The Calibration Problem — Future Studies

When you say "70 percent," you mean something specific. You mean that in a world where you make this statement one hundred times, the event should occur approximately seventy times. If it occurs ninety times, you are underconfident. If it occurs fifty times, you are overconfident. Calibration is the alignment between stated probability and actual frequency, and it is the most fundamental metric of forecasting quality.¹

Most humans are poorly calibrated. Studies consistently show that when people say they are "90 percent sure," they are correct approximately 75 percent of the time. When they say "99 percent sure," they are correct approximately 85 percent of the time. The overconfidence is systematic: people overestimate the reliability of their knowledge across virtually every domain tested.¹

Calibration can be improved. The Good Judgment Project demonstrated that feedback and training produced measurable calibration improvements in volunteer forecasters. Participants who received calibration feedback, showed on a reliability diagram where their forecasts diverged from observed frequencies, learned to adjust their probability estimates toward reality. The adjustment took weeks, not months, suggesting that miscalibration is more a matter of habit than ability.

The meta-problem. A perfectly calibrated predictor who always says "50 percent" is useless. Perfect calibration with zero resolution means the forecaster correctly reports their ignorance but adds no information. The dual problem, a forecaster who always says "90 percent" or "10 percent" with high resolution but poor calibration, has useful signal buried under systematic bias. The signal can be extracted through recalibration: mapping the forecaster's stated probabilities to their actual hit rates. A forecaster who says "90 percent" and is right 75 percent of the time can be recalibrated by simply relabeling "90 percent" as "75 percent."²

For machine prediction systems, calibration training is replaced by calibration tracking. Crystal Ball records every prediction's stated confidence and its eventual resolution. Over hundreds of resolved predictions, the reliability diagram emerges automatically. If the 70-percent bucket has a hit rate of 85 percent, the system is systematically underconfident in that range. The recalibration can be applied automatically to future predictions, or the underlying model can be investigated to understand why it is producing conservative estimates in that confidence range. This is the predict-score-learn flywheel applied to the meta-level: the system learns about its own predictive characteristics. The Brier score decomposition described in Article 19 provides the diagnostic framework for identifying whether the problem is calibration, resolution, or both.

References

Lichtenstein, S., Fischhoff, B., & Phillips, L.D. (1982). "Calibration of probabilities." In Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press.
Dawid, A.P. (1982). "The well-calibrated Bayesian." JASA, 77(379), 605-610.
DeGroot, M.H. & Fienberg, S.E. (1983). "The comparison and evaluation of forecasters." Journal of the Royal Statistical Society: Series D, 32(1-2), 12-22.