A prediction's confidence at the moment of creation is the least informative data point about that prediction. What matters is the trajectory: how did the confidence evolve as new evidence arrived? Did it strengthen steadily as falsifiers failed to trigger, suggesting the model was correctly calibrated? Did it oscillate wildly, suggesting the model was responding to noise? Did it collapse suddenly when a single falsifier triggered, revealing a critical assumption that was wrong?
Temporal confidence tracking stores a snapshot of each prediction's state at regular intervals. Each snapshot records: the current confidence, the constraint graph simulation probability if available, the prediction market contract price if mapped, the number of active falsifiers, the number of triggered falsifiers, and the specific fact or event that caused the most recent confidence change. Over the lifetime of a prediction, this series of snapshots produces a confidence trajectory that is far more informative than any single estimate.
What trajectories reveal. A prediction that starts at 0.65, rises steadily to 0.82 over six months as falsifiers fail to trigger and supporting evidence accumulates, then resolves as correct at 0.85, indicates a well-calibrated model that correctly identified a trend before the market. A prediction that starts at 0.70, drops to 0.35 when a single fact contradicts a key assumption, then resolves as incorrect, indicates a model with a specific weakness that can be diagnosed from the triggering fact. A prediction that oscillates between 0.40 and 0.60 throughout its lifetime indicates a fundamentally uncertain event where the model has no edge over chance.
Decomposed Brier scores. The Brier score, introduced by Glenn Brier in 1950 for evaluating weather forecasts, is the mean squared error between predicted probabilities and binary outcomes: BS = (1/N) sum of (forecast - outcome) squared. A perfect forecaster scores 0. Random guessing scores 0.25. Climatology, always predicting the base rate, scores better than 0.25 to the extent that events have skewed base rates.1
Allan Murphy's 1973 decomposition separates the Brier score into three components. Reliability (calibration): when you say 70 percent, does it happen 70 percent of the time? Resolution: can you distinguish high-probability events from low-probability ones? A predictor who always says 50 percent has zero resolution regardless of calibration. Uncertainty: the inherent difficulty of the prediction task, determined by the base rate of the event category. BS = Reliability - Resolution + Uncertainty.2
This decomposition is diagnostic. A system with poor calibration but high resolution has useful signal buried under systematic bias. The fix is recalibration: adjust the mapping from internal confidence to reported probability. A system with good calibration but low resolution is honest but unhelpful: it correctly reports its ignorance but doesn't actually distinguish likely from unlikely outcomes. The fix is better models or better data.
Crystal Ball reports all three components alongside the composite score. Over time, the decomposition reveals whether improvements are coming from better calibration (the system is learning to estimate probabilities more honestly) or better resolution (the constraint graph is learning to distinguish events that happen from events that don't). Both are valuable. The decomposition tells you which lever to pull.
References
- Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1-3.
- Murphy, A.H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology, 12, 595-600.
- Gneiting, T. & Raftery, A.E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." JASA, 102(477), 359-378.