Keeping Score: Brier Scores and Decomposition

Glenn Brier was a meteorologist at the U.S. Weather Bureau who, in 1950, proposed a simple scoring rule for probabilistic forecasts. The Brier score is the mean squared error between the predicted probability and the binary outcome: BS = (1/N) times the sum of (f_i minus o_i) squared, where f_i is the forecast probability and o_i is 1 if the event occurred or 0 if it did not. A perfect forecaster who assigns probability 1.0 to events that happen and 0.0 to events that do not happen scores 0. A forecaster who always says 0.5 scores 0.25.¹

The Brier score is a strictly proper scoring rule. This means that a forecaster minimizes their expected score by reporting their true beliefs. Unlike other scoring methods, there is no incentive to strategically misreport. If you genuinely believe the probability is 0.7, reporting 0.7 minimizes your expected Brier score. Reporting 0.9 to seem more decisive or 0.5 to hedge will, on average, produce a worse score. This property makes the Brier score the gold standard for evaluating probabilistic forecasts.

Murphy's decomposition. Allan Murphy's 1973 decomposition reveals the internal structure of the Brier score. The total score equals Reliability minus Resolution plus Uncertainty.²

Reliability measures calibration. Group all predictions into probability buckets: 0-10%, 10-20%, and so on. For each bucket, compute the average predicted probability and the actual hit rate. Reliability is the weighted mean squared difference between predicted and observed frequencies. A perfectly calibrated forecaster has Reliability = 0. This is what the calibration curve in Article 18 visualizes.

Resolution measures discrimination. It is the weighted mean squared difference between each bucket's observed frequency and the overall base rate. A forecaster who assigns high probability to events that happen and low probability to events that do not happen has high resolution. A forecaster who assigns the same probability to everything has zero resolution. Resolution is subtracted because higher resolution improves the score.

Uncertainty is o_bar times (1 minus o_bar), where o_bar is the overall base rate of the event. It depends entirely on the difficulty of the prediction task and is independent of the forecaster's skill. Predicting rare events (base rate 0.05) has low uncertainty. Predicting coin flips (base rate 0.50) has maximum uncertainty.

Why decomposition matters. A composite Brier score of 0.18 tells you the forecaster is better than random but does not tell you how to improve. The decomposition is diagnostic:

PATTERN	DIAGNOSIS	FIX
High reliability, low resolution	Honest but unhelpful	Better models, more data
Low reliability, high resolution	Useful signal, systematic bias	Recalibrate probability mapping
High reliability, high resolution	Strong forecaster	Expand prediction surface
Low reliability, low resolution	Noise generator	Rebuild from scratch

Crystal Ball reports all three components alongside every accuracy analysis. When the decomposition shows that the constraint graph has high resolution (it can distinguish events that happen from events that don't) but mediocre reliability (its probability estimates are systematically too high or too low), the fix is recalibration, not a new model. When resolution is low, the fix is structural: add nodes to the graph, improve the transfer functions, or extend the data sources. The decomposition converts a single number into an action plan.

The Brier Skill Score. The Brier Skill Score measures improvement over a reference forecast, typically climatology (always predicting the base rate). BSS = 1 minus (BS / BS_ref). A BSS of 0.30 means the forecaster's Brier score is 30 percent lower than always predicting the base rate. Tetlock's superforecasters achieved Brier Skill Scores in the range of 0.15 to 0.30 on geopolitical forecasting questions. Crystal Ball's target is BSS > 0.20 on the categories of predictions it generates, verified through a minimum of 200 resolved predictions. The calibration tracking system described in Article 18 provides the feedback loop that drives improvement toward this target.³

References

Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1-3.
Murphy, A.H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology, 12, 595-600.
Gneiting, T. & Raftery, A.E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." JASA, 102(477), 359-378.
Wilks, D.S. (2011). Statistical Methods in the Atmospheric Sciences. 3rd ed. Academic Press.