The Tetlock Revolution — Future Studies

In 1984, a young psychologist at the University of California, Berkeley began a study that would take twenty years to complete and would overturn the way we think about expertise, prediction, and the relationship between confidence and accuracy. Philip Tetlock recruited 284 experts, people whose profession involved commenting on or advising about political and economic trends, and asked them to make predictions about the future. He collected 28,361 predictions over two decades. Then he scored them.¹

The results, published in 2005 as Expert Political Judgment, were devastating. The average expert was barely better than a dart-throwing chimpanzee. More precisely, the experts performed slightly better than chance but dramatically worse than simple statistical algorithms. An extrapolation model that assumed nothing would change outperformed most of the experts most of the time. The finding was robust across domains, time horizons, and levels of expertise. More experience did not improve accuracy. More credentials did not improve accuracy. More confidence actively degraded it.

The media misread the finding. Headlines declared that experts were useless. Tetlock himself was uncomfortable with this interpretation because it obscured the more important result buried in the data. While the average expert was mediocre, the variance was enormous. Some experts were genuinely terrible, worse than chance across hundreds of predictions. Others were remarkably good, consistently outperforming statistical baselines. The difference was not intelligence, domain knowledge, or access to information. It was cognitive style.

Tetlock borrowed Isaiah Berlin's distinction between foxes and hedgehogs. Hedgehogs know one big thing. They have a master theory, a grand narrative through which they interpret all events. They are articulate, confident, and media-friendly. They make bold predictions grounded in their framework. Foxes know many small things. They are tentative, self-critical, and eclectic. They aggregate information from diverse sources, update their beliefs frequently, and express predictions as probabilities rather than certainties.¹

The foxes dramatically outperformed the hedgehogs. The mechanism is clear: hedgehogs are trapped by their frameworks. When evidence contradicts their theory, they reinterpret the evidence rather than updating the theory. Foxes, lacking a master theory, are free to follow the evidence wherever it leads. The hedgehog's confidence, which audiences and policymakers find reassuring, is precisely the cognitive feature that degrades predictive accuracy.

The experts who appeared most frequently on television performed the worst. Confidence and accuracy were inversely correlated. The pundits we trust most to see the future are the ones who are most reliably blind to it.

The IARPA tournament. Tetlock's findings caught the attention of the Intelligence Advanced Research Projects Activity, the research arm of the U.S. intelligence community. In 2011, IARPA launched the Aggregative Contingent Estimation (ACE) program, a forecasting tournament designed to find out whether anyone could consistently beat the intelligence community's own analysts at predicting geopolitical events.²

Tetlock's team, the Good Judgment Project (GJP), entered the tournament alongside four other academic teams. The questions were the same ones intelligence analysts were working on: Will North Korea conduct a nuclear test before a given date? Will the Eurozone lose a member? Will the price of gold exceed a given threshold? The questions were specific, time-bounded, and resolvable.

The GJP recruited volunteer forecasters from the general public. Some were professionals with relevant expertise. Others were a retired pipe installer, a Brooklyn filmmaker, a former ballroom dance instructor. The volunteers received no classified information. They had access only to public sources: newspapers, government reports, Wikipedia.

The results were extraordinary. The GJP's best forecasters outperformed intelligence analysts who had access to classified information by approximately 30 percent. They outperformed prediction markets. They won the tournament so decisively that IARPA shut it down two years early because the result was clear.³

What superforecasters do. The top two percent of GJP forecasters, whom Tetlock and Dan Gardner later dubbed "superforecasters," shared a set of cognitive habits that distinguished them from both experts and ordinary volunteers.⁴

Granular probability estimation. When asked "Will Russia invade eastern Ukraine before January 1?" a typical forecaster might say "probably" or "60 percent." A superforecaster would say "72 percent" and mean it. The granularity was not false precision. It reflected a genuine effort to distinguish between 60 percent and 70 percent, which requires thinking carefully about the base rate, the specific evidence, and the strength of the evidence. The practice of making fine-grained distinctions forced more careful analysis.

Frequent updating. Superforecasters revised their estimates constantly. A new piece of evidence, a speech by a foreign minister, a satellite image, a change in commodity prices, would trigger a reassessment. The updates were typically small, a few percentage points, but they accumulated. The forecaster who started at 72 percent and revised to 68 percent after one piece of evidence and then to 74 percent after another was performing something very close to Bayesian updating, the mathematical framework for incorporating new evidence into probability estimates described in Article 14.

The outside view. Superforecasters habitually began their analysis with the base rate: how often has this type of event occurred in the past? If 15 percent of countries that experience large-scale protests transition to a new government within two years, that is the starting point. The specific details of the current situation, the identity of the protesters, the weakness of the government, the involvement of external powers, adjust the estimate upward or downward from the base rate. This is the opposite of the typical expert approach, which begins with the specific case and constructs a narrative.

Dragonfly eye perspective. Rather than committing to a single analytical framework, superforecasters synthesized multiple perspectives. They would consider the question from a political scientist's viewpoint, then from an economist's, then from a military strategist's. Each perspective produced a different probability. The superforecaster aggregated these into a final estimate, weighting each perspective by its apparent relevance to the specific question.

Growth mindset. The single strongest predictor of superforecasting ability was not intelligence, education, or domain expertise. It was commitment to self-improvement. Superforecasters treated forecasting as a skill to be practiced and refined. They reviewed their past predictions, identified systematic errors, and adjusted their methods. This growth mindset, the belief that ability is developed through effort rather than fixed by talent, was more predictive of accuracy than any cognitive measure.

The team multiplier. Individual superforecasters were impressive. Teams of superforecasters were dramatically better. When Tetlock grouped his best forecasters into teams and had them discuss their estimates before submitting, the team estimates outperformed even the best individual. The mechanism was straightforward: team discussion forced explicit articulation of reasoning, exposed hidden assumptions, and provided social accountability for calibration. A forecaster who habitually overestimated risks would be gently corrected by teammates who noticed the pattern.²

Translation to machines. The superforecaster's cognitive toolkit maps remarkably well to computational systems. Granular probability estimation is trivially implementable. Frequent updating is what Bayesian algorithms do on every new data point. The outside view is reference class forecasting, a technique that can be automated with historical databases. The dragonfly eye is ensemble modeling, running multiple perspectives and aggregating. Growth mindset is the predict-score-learn flywheel that the constraint graph architecture implements by design.

The one thing that does not translate is the superforecaster's domain intuition, the ability to judge which evidence is relevant and which is noise. This is where the human-machine partnership becomes critical. A computational system can maintain perfect calibration, update on every new fact, and aggregate multiple model outputs. But it cannot, at least not yet, exercise the judgment that says "this particular satellite image matters more than that particular economic indicator for this particular question." This is the role of the LLM-as-validator described in Article 10: not generating predictions, but reviewing the predictions that a physical simulation generates and asking whether the evidence warrants them.

Tetlock's work demonstrated that prediction is not a gift. It is a method. The method can be taught, practiced, and measured. The superforecasters proved that humans can be dramatically better at prediction than experts, pundits, or intelligence analysts when they adopt the right cognitive practices. The question that follows is whether those practices can be embedded in architecture rather than relying on the discipline of individuals. That is the project this knowledge base describes.

References

Tetlock, P.E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press.
Mellers, B. et al. (2014). "Psychological Strategies for Winning a Geopolitical Forecasting Tournament." Psychological Science, 25(5), 1106-1115.
Mellers, B. et al. (2015). "The psychology of intelligence analysis: Drivers of prediction accuracy in world politics." Journal of Experimental Psychology: Applied, 21(1), 1-14.
Tetlock, P.E. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
Ungar, L. et al. (2012). "The Good Judgment Project: A large scale test of different methods of combining expert predictions." AAAI Technical Report.
Satopaa, V. et al. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344-356.