Future Studies

The Science of Seeing What Hasn't Happened Yet

A Laks Industries Division

Pillar I

The Prediction Problem

Article 01

Why We Can't See the Future

In the autumn of 2008, the global financial system collapsed in a manner that virtually no credentialed expert had predicted. The world's largest investment banks, staffed by thousands of quantitative analysts with doctoral degrees in mathematics, physics, and economics, failed to anticipate a crisis that in retrospect had been building for years. Lehman Brothers filed for bankruptcy on September 15. Within weeks, the contagion had spread to every major economy on earth. The collective intellectual firepower of Wall Street, the Federal Reserve, the International Monetary Fund, and every major economics department in the Western world had produced, at the critical moment, nothing useful.1

This was not an isolated failure. It was a specimen of a pattern that extends across every domain where human beings attempt to anticipate the future. Philip Tetlock's twenty-year study of expert political judgment, published in 2005, tracked 28,361 predictions made by 284 experts across politics, economics, and international relations. The result was devastating: the average expert performed barely better than a dart-throwing chimpanzee.2 The experts who appeared most frequently on television performed the worst. Confidence and accuracy were inversely correlated.

The question this raises is not whether prediction is possible. Weather forecasting proves that it is. The National Oceanic and Atmospheric Administration issues five-day forecasts that are correct approximately 90 percent of the time, a dramatic improvement over the 50 percent accuracy of 1970.3 The question is why prediction works spectacularly well in some domains and fails catastrophically in others. The answer, it turns out, has nothing to do with intelligence. It has everything to do with architecture.

The failure is not one of intelligence. It is one of architecture. The smartest people in the world, reasoning about the future without a physical model of the present, produce noise.

The machinery of self-deception. Daniel Kahneman spent four decades cataloguing the cognitive biases that distort human judgment. His framework divides cognition into two systems: System 1, which is fast, automatic, and associative, and System 2, which is slow, deliberate, and analytical.4 The trouble with prediction is that it feels like a System 2 activity but is almost always contaminated by System 1. An analyst reviewing a company's earnings report believes she is performing careful calculation. In practice, her estimate is anchored by the consensus forecast she read that morning, shaped by the availability of recent dramatic events, and organized into a narrative that her mind has already constructed before the analysis begins.

Anchoring is perhaps the most insidious of these biases. Kahneman and Tversky demonstrated that even random numbers influence subsequent estimates. When subjects were asked to estimate the percentage of African nations in the United Nations after spinning a rigged wheel that landed on either 10 or 65, the median estimates were 25 percent and 45 percent respectively.5 The anchor, which contained zero information about African nations, moved the estimate by twenty points. Now consider that every financial analyst begins their work surrounded by anchors: consensus estimates, recent price action, the framing of the question itself.

The availability heuristic compounds the problem. Events that are vivid, recent, or emotionally salient are systematically overweighted. After a plane crash, people overestimate the probability of dying in a plane crash by orders of magnitude. After a market crash, analysts overestimate the probability of another crash. After a long bull market, they underestimate it. The base rate of the event is overwhelmed by the salience of the most recent instance.

Then there is the narrative fallacy, which Nassim Taleb identified as perhaps the deepest obstacle to clear thinking about the future.1 The human mind is a story-generating machine. Given any set of facts, it will construct a coherent narrative that explains them. The problem is that coherent narratives are always available after the fact and almost never before it. The story of the 2008 crisis, told in retrospect, is perfectly logical: subprime mortgages, securitization, overleveraged banks, regulatory capture. Told in 2006, it was the fevered imagination of a handful of contrarians whom the market had been punishing for years.

Why expertise makes it worse. One of the most counterintuitive findings in Tetlock's study was that domain expertise often degraded predictive accuracy rather than improving it. He divided his experts into two cognitive styles, borrowing Isaiah Berlin's distinction between foxes and hedgehogs.2 Hedgehogs know one big thing. They have a master theory, they are articulate and confident, and they see every new fact through the lens of their framework. Foxes know many small things. They are tentative, self-critical, and willing to aggregate information from diverse sources even when it contradicts their priors.

The hedgehogs were terrible forecasters. The foxes were significantly better. The mechanism is clear in retrospect: hedgehogs are precisely the kind of experts who appear on television, write bestselling books, and are consulted by policymakers. They offer certainty, which is what audiences and decision-makers crave. But certainty is the enemy of calibration, and calibration is the foundation of accurate prediction.

Paul Meehl established this principle as early as 1954, when he demonstrated that simple statistical models consistently outperformed clinical judgment across twenty studies in psychology and medicine.6 The finding has been replicated hundreds of times since. Robyn Dawes, in his aptly titled House of Cards, showed that even improper linear models, with randomly assigned positive weights, outperformed human experts.7 The implication is not that models are brilliant. It is that human judgment is systematically worse than even crude quantitative approaches.

The fundamental asymmetry. There is a deeper reason why prediction is hard, one that goes beyond cognitive bias. The future is not a single thing to be discovered. It is a vast space of possibilities, of which exactly one will be realized. Every prediction is an attempt to assign probabilities to regions of this space. The problem is that the space is not merely large; it is structured in a way that defeats intuition.

Nate Silver, in The Signal and the Noise, catalogued the domains where prediction works and where it fails.8 Weather, baseball, and poker are domains where prediction has improved dramatically. Economics, politics, and earthquake forecasting are domains where it has not. The difference is not the availability of data. It is the presence or absence of tight feedback loops, stable causal structures, and the ability to run experiments.

Weather prediction works because the atmosphere obeys the laws of thermodynamics. The equations are known. The initial conditions are measured by a global sensor network. The model can be run forward in time and checked against reality every day. The prediction improves because the feedback loop is fast, the causal structure is stable, and the system, while chaotic, is governed by physics that does not change.

Economic prediction fails because economies are reflexive systems. The act of prediction changes the system being predicted. If a credible forecaster announces that a bank will fail, depositors withdraw their money and the bank fails, regardless of whether the original prediction was correct. The causal structure is not stable because the agents within it are reactive. George Soros formalized this as his theory of reflexivity, and it explains why economic forecasting is structurally harder than weather forecasting, even with better data.

The architecture problem. If prediction fails not because of insufficient intelligence but because of architectural inadequacy, then the solution is not smarter analysts. It is better architecture. The question becomes: what does a prediction system look like that works?

Weather forecasting provides the template. NOAA does not ask a panel of meteorologists to discuss the weather and arrive at a consensus. It runs a physics simulation. It discretizes the atmosphere into a grid. It measures the current state of every grid cell. It applies conservation laws, thermodynamic equations, and fluid dynamics. It steps the simulation forward. The forecast is not an opinion. It is the output of a physical model.

The Makridakis Competitions, which have benchmarked forecasting methods since 1982, consistently demonstrate that simple statistical methods outperform complex ones, and that combining methods outperforms any single method.9 The latest iteration, M5, showed that machine learning methods could outperform traditional statistical approaches, but only when combined with domain knowledge about the system being forecast. Pure data-driven approaches, without causal understanding, hit a ceiling.

This points to the solution. The domains where prediction works are the domains where someone has built a causal model of the system. The atmosphere obeys thermodynamics, and someone has written the equations. A baseball player's performance obeys biomechanics and probability, and someone has built the statistical framework. The domains where prediction fails are the domains where the causal model either does not exist or is ignored in favor of expert opinion.

The implication is radical. Every prediction system that relies on human judgment, whether it is a panel of economists, a room full of intelligence analysts, or a hedge fund's macro strategist, is doing the equivalent of asking a group of people to predict the weather by looking out the window and arguing. It might work for the next hour. It will not work for the next week. For that, you need a model.

The atmosphere obeys thermodynamics. Supply chains obey economics. Both are systems of measurable quantities connected by known transfer functions. The only question is whether anyone has bothered to write them down.

This is the thesis of this entire body of work. Prediction is not a mystical art. It is an engineering problem. The tools exist. The data exists. The transfer functions are known, or at least knowable. What has been missing is the architecture: a system that models physical reality as a graph of measurable quantities connected by causal relationships, simulates forward through known constraints, identifies the binding bottlenecks that the market has not priced, and then, critically, scores its own predictions against reality and improves.

The remainder of this knowledge base builds that architecture piece by piece. We begin with the history of forecasting, to understand what has been tried. We examine the systems that work, from NOAA's supercomputers to Renaissance Technologies' statistical models to prediction markets. We develop the methodology: constraint graphs, adversarial falsification, Bayesian updating, temporal confidence tracking. We build a complete worked example, the uranium fuel cycle, to prove that the method generates useful predictions with real numbers. And we close with an honest accounting of the limits: chaos, black swans, and the irreducible uncertainty that no architecture can eliminate.

The goal is not omniscience. It is a system that predicts what is predictable and is robust to what is not. That is a lower bar than prophecy and a higher bar than opinion. It is, in the precise sense of the word, a science.

References

  1. Taleb, N.N. (2007). The Black Swan: The Impact of the Highly Improbable. Random House.
  2. Tetlock, P.E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press.
  3. Bauer, P., Thorpe, A., & Brunet, G. (2015). "The quiet revolution of numerical weather prediction." Nature, 525, 47-55.
  4. Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
  5. Tversky, A. & Kahneman, D. (1974). "Judgment under Uncertainty: Heuristics and Biases." Science, 185(4157), 1124-1131.
  6. Meehl, P.E. (1954). Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence. University of Minnesota Press.
  7. Dawes, R.M. (1994). House of Cards: Psychology and Psychotherapy Built on Myth. Free Press.
  8. Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail — But Some Don't. Penguin.
  9. Makridakis, S. et al. (2018). "The M4 Competition: 100,000 time series and 61 forecasting methods." International Journal of Forecasting, 34(4), 802-808.
  10. Einhorn, H.J. & Hogarth, R.M. (1978). "Confidence in Judgment: Persistence of the Illusion of Validity." Psychological Review, 85(5), 395-416.
  11. Soros, G. (2003). The Alchemy of Finance. Wiley.
  12. Ord, T. (2020). The Precipice: Existential Risk and the Future of Humanity. Hachette.
Article 02

A Brief History of Forecasting

The desire to see the future is older than civilization. The Oracle at Delphi dispensed prophecies from the eighth century BCE through the fourth century CE, her utterances mediated by priests who converted the Pythia's trances into actionable counsel for kings, generals, and colonists. The prophecies were effective not because they were accurate but because they were ambiguous: Croesus was told that if he attacked Persia, a great empire would be destroyed. He attacked. His own empire was destroyed. The prophecy was technically correct.1

For most of human history, forecasting meant prophecy: a single authority claiming access to a single future. The astrologers of Babylon, the augurs of Rome, the court soothsayers of medieval Europe all operated within this framework. The future was singular, hidden, and accessible only to those with special gifts or divine connection. This tradition persists today in the form of pundits, guru investors, and thought leaders who project confidence about outcomes they cannot possibly know. The packaging has changed. The epistemology has not.

The probabilistic revolution. The first genuine break from prophetic forecasting came in 1654, when Blaise Pascal and Pierre de Fermat exchanged letters about the problem of points: how to divide the stakes of an interrupted gambling game. Their correspondence invented probability theory. For the first time, the future was not a single thing to be divined but a space of possibilities to which numerical weights could be assigned.1

Pierre-Simon Laplace extended this framework in 1814 with his Philosophical Essay on Probabilities, which contained the famous thought experiment of Laplace's Demon: an intellect that knew the position and velocity of every particle in the universe could, in principle, predict the entire future. This was simultaneously the high-water mark of deterministic optimism and, though Laplace could not have known it, the beginning of its refutation. The demon was a thought experiment, not a prediction about prediction. But it established the idea that forecasting was, at root, a computational problem.2

The actuarial tradition developed in parallel. Edmund Halley constructed the first life table in 1693, enabling the pricing of annuities. Insurance companies discovered that while no individual death was predictable, the death rate of a population was remarkably stable. This was forecasting through aggregation: the future of the individual was unknowable, but the future of the collective was computable. The distinction would prove fundamental.

The computational turn. Lewis Fry Richardson, a Quaker ambulance driver in World War I, published Weather Prediction by Numerical Process in 1922. The book proposed dividing the atmosphere into a grid of cells, measuring the physical state of each cell, and computing the future state using the equations of fluid dynamics. Richardson envisioned a "forecast factory" staffed by 64,000 human computers, each responsible for one cell, coordinated by a conductor in the center of a vast hall.3

The idea was visionary but premature. Richardson attempted a single forecast by hand and produced a wildly incorrect result: a pressure change of 145 millibars in six hours, roughly one hundred times too large. The error came not from the method but from the initial conditions: he had used observations that were too coarse for the equations. The method was right. The data was wrong. It would take thirty years and the invention of electronic computers before the method could be properly tested.

In 1950, a team led by Jule Charney ran the first successful numerical weather forecast on ENIAC, the U.S. Army's electronic computer at the Aberdeen Proving Ground. The 24-hour forecast took 24 hours to compute, which made it useless in practice but proved the concept. The atmosphere could be simulated. The forecast could be computed rather than guessed.4

Chaos and its consequences. In 1963, Edward Lorenz at MIT discovered that weather simulations were exquisitely sensitive to initial conditions. A difference of one part in a thousand in the starting temperature could produce completely different weather patterns after two weeks. This was deterministic chaos: the equations were perfectly specified, the physics was correct, but the prediction diverged exponentially from reality because the initial measurements were imperfect. The butterfly effect, as it came to be known, set a fundamental limit on weather prediction at approximately ten to fourteen days.5

Lorenz's discovery did not kill forecasting. It refined it. If a single simulation was unreliable beyond ten days, you could run fifty simulations with slightly different initial conditions and use the spread of the results as a measure of uncertainty. This was ensemble forecasting, and it transformed weather prediction from a deterministic exercise into a probabilistic one. The forecast was no longer "it will rain Tuesday." It was "there is a 70 percent probability of rain Tuesday." This was a conceptual revolution: the forecast carried its own uncertainty estimate.

The econometric dead end. Economics followed a different path. The Club of Rome's Limits to Growth report in 1972 used computer simulation to forecast global resource depletion, population collapse, and economic decline. The report's predictions were specific and dramatic, and most of them turned out to be wrong, not because simulation was the wrong approach but because the model's assumptions about technological stagnation and resource substitution were naive.6

Econometric forecasting, which attempted to model national economies using systems of equations, produced similarly disappointing results. The Lucas Critique, published by Robert Lucas in 1976, demonstrated that econometric models broke down when policy changed because the model parameters were not structural constants but behavioral responses to the current policy regime. Change the policy and the parameters shifted. This was the reflexivity problem: economic agents react to forecasts, which changes the system being forecast.

By the early 2000s, the dominant view in academic economics was that short-term macroeconomic forecasting was essentially impossible. The Federal Reserve's own forecasting record was mediocre. Private economic forecasters performed barely better than simple extrapolation. The field had spent decades building increasingly complex models and had, by most measures, made no progress.

Two traditions, one insight. H.G. Wells, in a 1932 lecture titled "Wanted: Professors of Foresight," called for a systematic science of the future to be taught in universities. Ossip Flechtheim coined the term "futurology" in the 1940s, envisioning a discipline that would apply scientific methods to the study of possible futures. Bertrand de Jouvenel, in his 1967 book The Art of Conjecture, distinguished between the forecast (what will happen) and the conjecture (what might happen), arguing that the latter was both more honest and more useful.78

The futures studies tradition that emerged from these thinkers emphasized plural futures, scenario analysis, and the Delphi method. It was intellectually honest about uncertainty but methodologically soft. The tools were workshops, expert panels, and structured imagination. The tradition produced useful frameworks, including the futures cone (possible, plausible, probable, preferable futures) and causal layered analysis, but it did not produce scored predictions. It could not tell you whether it was getting better over time because it had no mechanism for keeping score.

Meanwhile, the weather forecasting tradition had been keeping score every single day since 1950. Five-day forecast accuracy improved from roughly 50 percent in 1970 to 90 percent by 2020. The improvement was driven entirely by better models, better data, and more compute, not by better human judgment. The lesson was clear: forecasting improves when you build physical models, measure your errors, and iterate. It does not improve when you gather experts in a room and ask them to think harder.

Philip Tetlock's work, which we examine in Article 3, would bridge these two traditions. His finding that some forecasters dramatically outperformed others opened the door to studying prediction as a skill that could be measured, trained, and improved. The superforecasters he identified were, in a sense, human ensemble models: they aggregated diverse perspectives, updated granularly, and tracked their own accuracy. They were doing manually what weather models do computationally.

The history of forecasting is therefore a story of two approaches. The first, prophecy, assumes a single hidden future accessible through insight, expertise, or divine favor. It has been tried for three thousand years and has never worked. The second, simulation, assumes a system governed by knowable relationships and computes the range of possible outcomes. It has been tried for seventy-five years and has worked spectacularly well in every domain where it has been properly applied. The question is not which approach to choose. It is why simulation has been applied to the atmosphere and not to supply chains, commodity markets, and geopolitical scenarios. That question is the subject of Article 10.

References

  1. Bernstein, P. (1996). Against the Gods: The Remarkable Story of Risk. Wiley.
  2. Laplace, P.S. (1814). A Philosophical Essay on Probabilities.
  3. Richardson, L.F. (1922). Weather Prediction by Numerical Process. Cambridge University Press.
  4. Bauer, P., Thorpe, A., & Brunet, G. (2015). "The quiet revolution of numerical weather prediction." Nature, 525, 47-55.
  5. Lorenz, E.N. (1963). "Deterministic Nonperiodic Flow." Journal of the Atmospheric Sciences, 20(2), 130-141.
  6. Meadows, D.H. et al. (1972). The Limits to Growth. Universe Books.
  7. Wells, H.G. (1932). "Wanted: Professors of Foresight." Futures Research Quarterly.
  8. de Jouvenel, B. (1967). The Art of Conjecture. Basic Books.
  9. Bell, W. (1997). Foundations of Futures Studies. Transaction Publishers.
  10. Rescher, N. (1998). Predicting the Future. SUNY Press.
  11. Slaughter, R. (2003). Integral Futures. Australian Foresight Institute.
Article 03

The Tetlock Revolution

In 1984, a young psychologist at the University of California, Berkeley began a study that would take twenty years to complete and would overturn the way we think about expertise, prediction, and the relationship between confidence and accuracy. Philip Tetlock recruited 284 experts, people whose profession involved commenting on or advising about political and economic trends, and asked them to make predictions about the future. He collected 28,361 predictions over two decades. Then he scored them.1

The results, published in 2005 as Expert Political Judgment, were devastating. The average expert was barely better than a dart-throwing chimpanzee. More precisely, the experts performed slightly better than chance but dramatically worse than simple statistical algorithms. An extrapolation model that assumed nothing would change outperformed most of the experts most of the time. The finding was robust across domains, time horizons, and levels of expertise. More experience did not improve accuracy. More credentials did not improve accuracy. More confidence actively degraded it.

The media misread the finding. Headlines declared that experts were useless. Tetlock himself was uncomfortable with this interpretation because it obscured the more important result buried in the data. While the average expert was mediocre, the variance was enormous. Some experts were genuinely terrible, worse than chance across hundreds of predictions. Others were remarkably good, consistently outperforming statistical baselines. The difference was not intelligence, domain knowledge, or access to information. It was cognitive style.

Tetlock borrowed Isaiah Berlin's distinction between foxes and hedgehogs. Hedgehogs know one big thing. They have a master theory, a grand narrative through which they interpret all events. They are articulate, confident, and media-friendly. They make bold predictions grounded in their framework. Foxes know many small things. They are tentative, self-critical, and eclectic. They aggregate information from diverse sources, update their beliefs frequently, and express predictions as probabilities rather than certainties.1

The foxes dramatically outperformed the hedgehogs. The mechanism is clear: hedgehogs are trapped by their frameworks. When evidence contradicts their theory, they reinterpret the evidence rather than updating the theory. Foxes, lacking a master theory, are free to follow the evidence wherever it leads. The hedgehog's confidence, which audiences and policymakers find reassuring, is precisely the cognitive feature that degrades predictive accuracy.

The experts who appeared most frequently on television performed the worst. Confidence and accuracy were inversely correlated. The pundits we trust most to see the future are the ones who are most reliably blind to it.

The IARPA tournament. Tetlock's findings caught the attention of the Intelligence Advanced Research Projects Activity, the research arm of the U.S. intelligence community. In 2011, IARPA launched the Aggregative Contingent Estimation (ACE) program, a forecasting tournament designed to find out whether anyone could consistently beat the intelligence community's own analysts at predicting geopolitical events.2

Tetlock's team, the Good Judgment Project (GJP), entered the tournament alongside four other academic teams. The questions were the same ones intelligence analysts were working on: Will North Korea conduct a nuclear test before a given date? Will the Eurozone lose a member? Will the price of gold exceed a given threshold? The questions were specific, time-bounded, and resolvable.

The GJP recruited volunteer forecasters from the general public. Some were professionals with relevant expertise. Others were a retired pipe installer, a Brooklyn filmmaker, a former ballroom dance instructor. The volunteers received no classified information. They had access only to public sources: newspapers, government reports, Wikipedia.

The results were extraordinary. The GJP's best forecasters outperformed intelligence analysts who had access to classified information by approximately 30 percent. They outperformed prediction markets. They won the tournament so decisively that IARPA shut it down two years early because the result was clear.3

What superforecasters do. The top two percent of GJP forecasters, whom Tetlock and Dan Gardner later dubbed "superforecasters," shared a set of cognitive habits that distinguished them from both experts and ordinary volunteers.4

Granular probability estimation. When asked "Will Russia invade eastern Ukraine before January 1?" a typical forecaster might say "probably" or "60 percent." A superforecaster would say "72 percent" and mean it. The granularity was not false precision. It reflected a genuine effort to distinguish between 60 percent and 70 percent, which requires thinking carefully about the base rate, the specific evidence, and the strength of the evidence. The practice of making fine-grained distinctions forced more careful analysis.

Frequent updating. Superforecasters revised their estimates constantly. A new piece of evidence, a speech by a foreign minister, a satellite image, a change in commodity prices, would trigger a reassessment. The updates were typically small, a few percentage points, but they accumulated. The forecaster who started at 72 percent and revised to 68 percent after one piece of evidence and then to 74 percent after another was performing something very close to Bayesian updating, the mathematical framework for incorporating new evidence into probability estimates described in Article 14.

The outside view. Superforecasters habitually began their analysis with the base rate: how often has this type of event occurred in the past? If 15 percent of countries that experience large-scale protests transition to a new government within two years, that is the starting point. The specific details of the current situation, the identity of the protesters, the weakness of the government, the involvement of external powers, adjust the estimate upward or downward from the base rate. This is the opposite of the typical expert approach, which begins with the specific case and constructs a narrative.

Dragonfly eye perspective. Rather than committing to a single analytical framework, superforecasters synthesized multiple perspectives. They would consider the question from a political scientist's viewpoint, then from an economist's, then from a military strategist's. Each perspective produced a different probability. The superforecaster aggregated these into a final estimate, weighting each perspective by its apparent relevance to the specific question.

Growth mindset. The single strongest predictor of superforecasting ability was not intelligence, education, or domain expertise. It was commitment to self-improvement. Superforecasters treated forecasting as a skill to be practiced and refined. They reviewed their past predictions, identified systematic errors, and adjusted their methods. This growth mindset, the belief that ability is developed through effort rather than fixed by talent, was more predictive of accuracy than any cognitive measure.

The team multiplier. Individual superforecasters were impressive. Teams of superforecasters were dramatically better. When Tetlock grouped his best forecasters into teams and had them discuss their estimates before submitting, the team estimates outperformed even the best individual. The mechanism was straightforward: team discussion forced explicit articulation of reasoning, exposed hidden assumptions, and provided social accountability for calibration. A forecaster who habitually overestimated risks would be gently corrected by teammates who noticed the pattern.2

Translation to machines. The superforecaster's cognitive toolkit maps remarkably well to computational systems. Granular probability estimation is trivially implementable. Frequent updating is what Bayesian algorithms do on every new data point. The outside view is reference class forecasting, a technique that can be automated with historical databases. The dragonfly eye is ensemble modeling, running multiple perspectives and aggregating. Growth mindset is the predict-score-learn flywheel that the constraint graph architecture implements by design.

The one thing that does not translate is the superforecaster's domain intuition, the ability to judge which evidence is relevant and which is noise. This is where the human-machine partnership becomes critical. A computational system can maintain perfect calibration, update on every new fact, and aggregate multiple model outputs. But it cannot, at least not yet, exercise the judgment that says "this particular satellite image matters more than that particular economic indicator for this particular question." This is the role of the LLM-as-validator described in Article 10: not generating predictions, but reviewing the predictions that a physical simulation generates and asking whether the evidence warrants them.

Tetlock's work demonstrated that prediction is not a gift. It is a method. The method can be taught, practiced, and measured. The superforecasters proved that humans can be dramatically better at prediction than experts, pundits, or intelligence analysts when they adopt the right cognitive practices. The question that follows is whether those practices can be embedded in architecture rather than relying on the discipline of individuals. That is the project this knowledge base describes.

References

  1. Tetlock, P.E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press.
  2. Mellers, B. et al. (2014). "Psychological Strategies for Winning a Geopolitical Forecasting Tournament." Psychological Science, 25(5), 1106-1115.
  3. Mellers, B. et al. (2015). "The psychology of intelligence analysis: Drivers of prediction accuracy in world politics." Journal of Experimental Psychology: Applied, 21(1), 1-14.
  4. Tetlock, P.E. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
  5. Ungar, L. et al. (2012). "The Good Judgment Project: A large scale test of different methods of combining expert predictions." AAAI Technical Report.
  6. Satopaa, V. et al. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344-356.
Pillar II

How the Best Systems Work

Article 04

49 Petaflops of Future-Seeing: Weather Prediction

The National Oceanic and Atmospheric Administration operates two identical Hewlett Packard Enterprise Cray supercomputers, named Dogwood and Cactus, that together deliver 49.4 petaflops of computing power. These machines consume 35 terabytes of observational data per day from a global network of satellites, weather balloons, ocean buoys, ground stations, aircraft sensors, and ships. They run the Global Forecast System at 13-kilometer horizontal resolution, producing forecasts that extend sixteen days into the future and are updated every six hours. The five-day forecast is accurate approximately 90 percent of the time, a dramatic improvement from the 50 percent accuracy of 1970.12

This is the most successful prediction system ever built, and the method by which it works is worth understanding in detail because it is the template for every prediction architecture that actually works.

The grid. Numerical weather prediction begins by discretizing the atmosphere into a three-dimensional grid. The GFS grid divides the earth's surface into cells approximately 13 kilometers on a side, with 127 vertical levels extending from the surface to approximately 80 kilometers altitude. Each grid cell is characterized by a set of physical quantities: temperature, pressure, humidity, wind speed in three dimensions, and various trace gas concentrations. The total number of variables at each time step runs into the billions.

The equations. The state of each grid cell evolves according to the equations of fluid dynamics and thermodynamics. The Navier-Stokes equations govern the flow of air. The first law of thermodynamics governs heating and cooling. Clausius-Clapeyron governs the phase transitions of water between vapor, liquid, and ice. Radiation transfer equations govern the absorption and emission of solar and terrestrial radiation. These equations are not approximations or statistical models. They are the laws of physics. The atmosphere has no choice but to obey them.

The data. The forecast requires initial conditions: the measured state of every grid cell at the start time. This is where the 35 terabytes per day of observational data enter. The data assimilation system, itself a major computational challenge, combines observations from different sources with different resolutions, different error characteristics, and different spatial and temporal coverage into a single coherent initial state. The quality of the initial conditions determines the quality of the forecast, which is why weather prediction has improved as much from better observations as from better models.

The ensemble. A single simulation run forward from a single set of initial conditions will diverge from reality after approximately ten days because of chaotic sensitivity to initial conditions, as Lorenz demonstrated in 1963. The solution is ensemble forecasting: run not one simulation but fifty, each starting from slightly different initial conditions within the observational uncertainty. The spread of the ensemble is the forecast's built-in uncertainty estimate. When all fifty runs agree, the forecast is confident. When they diverge, the forecast honestly communicates that the atmosphere is in a state where small differences in initial conditions lead to large differences in outcome.

The feedback loop. Every six hours, the forecast is verified against new observations. Every forecast error can be attributed to either a flaw in the initial conditions or a flaw in the model equations. This attribution enables systematic improvement. When a specific physical process, like the formation of thunderstorms or the interaction between ocean and atmosphere, is consistently modeled incorrectly, the relevant equations are refined. This feedback loop has been running continuously since the 1950s. Seventy years of daily verification, error attribution, and model improvement have produced the most accurate prediction system in any domain of human endeavor.

Weather prediction works not because meteorologists are smart. It works because someone built a physical model, measured their errors every day, and improved the model for seventy consecutive years.

Resolution matters. When NOAA upgraded from 27-kilometer to 13-kilometer resolution, phenomena that had been invisible, individual thunderstorm cells, terrain-forced precipitation patterns, sea breeze fronts, suddenly appeared in the forecast. The physics had not changed. The grid was simply fine enough to represent processes that were below the resolution of the previous model. A 9-kilometer or 3-kilometer model reveals still more. The European Centre for Medium-Range Weather Forecasts (ECMWF) runs its operational model at 9 kilometers and plans to move to 5 kilometers under the Destination Earth initiative, which aims to create a digital twin of Earth's climate system.3

Nvidia's Earth-2 project takes a different approach: using GPU-accelerated machine learning to produce high-resolution forecasts in minutes rather than hours. The Nvidia approach does not replace physics-based simulation; it uses machine learning trained on physics-based simulation data to produce faster approximations. The physics remains the foundation. The machine learning is acceleration.4

Why this hasn't been applied to economics. The uranium fuel cycle described in Article 16 has perhaps 200 measurable quantities connected by known physical relationships. The atmosphere has billions of grid cells governed by the same equations. The uranium system is vastly simpler. It requires no supercomputer. A Monte Carlo simulation of 200 scenarios runs in seconds on commodity hardware.

The reason no one has built the economic equivalent of the GFS is not technical. It is institutional. The people who understand physical simulation work at weather agencies and aerospace companies. The people who understand commodity supply chains work at trading firms. The people who understand prediction scoring work in decision science departments. These communities do not overlap. There is no institution that combines simulation expertise, domain knowledge of physical supply chains, and rigorous prediction scoring. That is the gap that constraint-graph architecture is designed to fill.

References

  1. NOAA. (2022). Weather and Climate Operational Supercomputing System (WCOSS) documentation.
  2. Bauer, P., Thorpe, A., & Brunet, G. (2015). "The quiet revolution of numerical weather prediction." Nature, 525, 47-55.
  3. Wedi, N. et al. (2025). "Destination Earth: Digital twins of the Earth system." ECMWF Technical Memorandum.
  4. Nvidia. (2023). Earth-2 Technical Documentation.
  5. Voosen, P. (2020). "Europe builds digital twin of Earth to hone climate forecasts." Science, 370(6512), 16-17.
Article 05

Renaissance Technologies: Statistical Prophecy

The Medallion Fund, managed by Renaissance Technologies, has returned approximately 66 percent per year before fees since 1988. After the fund's 5-and-44 fee structure, investors kept roughly 39 percent annually. Over thirty-five years, a dollar invested at inception would have grown to over forty thousand dollars. No other investment vehicle in history has produced comparable risk-adjusted returns over a comparable period.1

Jim Simons, the fund's founder, was a Cold War codebreaker and Fields Medal-caliber mathematician who became chair of the mathematics department at Stony Brook University before pivoting to finance. He did not hire financial analysts. He hired mathematicians, physicists, computer scientists, and computational linguists. His hiring criterion was simple: people who could find patterns in noisy data. Whether the data was encrypted Soviet communications or market prices was, in his view, a secondary concern.

The method, to the extent it is publicly understood, involves ingesting approximately 40 terabytes of new data per day, identifying non-random statistical patterns in price movements across thousands of instruments, and executing hundreds of thousands of small trades that capture tiny edges. The positions are typically held for hours to days. The edges are individually small but aggregate to extraordinary returns because the law of large numbers converts many small positive-expectation bets into a near-certain positive outcome.2

What RenTech proves. The Medallion Fund proves that financial markets contain non-random structure that can be exploited by quantitative methods. Prices are not purely random walks. Statistical prediction, applied with sufficient data, sufficient compute, and sufficient mathematical sophistication, extracts signal from noise. This is an important finding because the efficient market hypothesis, in its strong form, holds that no such structure should exist.

What RenTech does not prove. The Medallion Fund does not prove that statistical prediction scales or generalizes. Simons capped the fund at approximately 10 billion dollars because the strategy's capacity is limited. The edges are small and exist in specific market microstructure conditions. When RenTech launched institutional funds, RIEF and RIDA, that attempted to apply similar methods at larger scale and longer holding periods, the returns were mediocre. RIDA lost 14 percent in late 2025. The method that works at one scale and time horizon does not transfer to another.

More fundamentally, RenTech does not know why prices move. The system finds patterns. It does not build causal models. When the patterns break, as they do during regime changes, the system has no mechanism for understanding why and adapting. This is the core difference between statistical inference and physical simulation. A weather model that fails can be diagnosed: which equation was wrong? Which physical process was misrepresented? A statistical model that fails can only be retrained on new data and hope that the new patterns are stable.

RenTech represents the ceiling of what inference can achieve. The returns are extraordinary precisely because they are extracting every available statistical regularity from market data. But the approach cannot answer the question that matters most for long-term prediction: what physical constraint, what supply-demand imbalance, what capacity bottleneck will drive the next major repricing? For that, you need a different architecture entirely, one grounded in the physical reality of supply chains rather than the statistical properties of price series.

References

  1. Zuckerman, G. (2019). The Man Who Solved the Market. Portfolio/Penguin.
  2. Patterson, S. (2010). The Quants. Crown Business.
  3. Lo, A. (2017). Adaptive Markets. Princeton University Press.
  4. Thorp, E. (2017). A Man for All Markets. Random House.
Article 06

Prediction Markets: The Wisdom and Madness of Crowds

In 1907, the English polymath Francis Galton attended a livestock exhibition where 787 people guessed the weight of an ox. No individual guess was correct. The median of all guesses was 1,207 pounds. The actual weight was 1,198 pounds. The crowd's aggregate estimate was within 1 percent of the truth, a result so striking that Galton, who had set out to demonstrate the unreliability of democratic judgment, was forced to revise his own beliefs.1

Prediction markets formalize this phenomenon. They are exchanges where participants buy and sell contracts that pay out based on the outcome of future events. A contract that pays one dollar if a candidate wins an election and zero dollars if she loses will trade at a price that represents the market's aggregate probability estimate. If the contract trades at 0.65, the market believes there is a 65 percent probability of that candidate winning.

The Iowa Electronic Markets, operated by the University of Iowa since 1988, consistently outperformed polls in predicting U.S. election outcomes, typically by several percentage points. Polymarket, a blockchain-based prediction market, processed over 10 billion dollars in trading volume in 2025. Kalshi, the first CFTC-regulated prediction market in the United States, demonstrated approximately 20 percent better accuracy than polling aggregates on the 2024 presidential election.23

Why prediction markets work. The aggregation mechanism has three properties that make it effective. First, participants have skin in the game: they lose real money when they are wrong, which suppresses the overconfidence that plagues expert panels. Second, the information sources are diverse: a contract price reflects the collective knowledge of everyone who trades it, from political insiders to statistical modelers to casual observers. Third, the updating is continuous: as new information arrives, traders adjust their positions and the price adjusts in real time.

Why prediction markets fail. The limitations are equally important. Markets can only exist for events that someone defines and creates a contract for. They are subject to manipulation when liquidity is thin. They suffer from correlated blind spots when participants share the same information diet. And they cannot model causal chains. A prediction market can tell you that there is a 35 percent probability of an oil supply disruption in the next twelve months. It cannot tell you that the disruption, if it occurs, will reduce global supply by 4 million barrels per day because the Strait of Hormuz handles 21 percent of global petroleum trade and there is no alternative route for Qatari LNG exports.

The opportunity is to use prediction market prices not as the prediction itself but as an input signal. When Crystal Ball's constraint graph produces a probability estimate that diverges significantly from the market price, one of them is wrong. If the divergence is greater than 20 percentage points, either the graph has a miscalibrated edge or the market has a blind spot. Both are actionable. The divergence becomes a research signal: investigate why the model and the market disagree, and the investigation will either improve the model or identify a mispricing.

References

  1. Surowiecki, J. (2004). The Wisdom of Crowds. Doubleday.
  2. Arrow, K. et al. (2008). "The Promise of Prediction Markets." Science, 320, 877-878.
  3. Wolfers, J. & Zitzewitz, E. (2004). "Prediction Markets." Journal of Economic Perspectives, 18(2), 107-126.
  4. Berg, J. et al. (2008). "Results from a Dozen Years of Election Futures Markets Research." Handbook of Experimental Economics Results.
Article 07

Digital Twins: Simulating Earth

A digital twin is a virtual replica of a physical system that is continuously updated with real-time sensor data and can be simulated forward in time. The concept originated in manufacturing, where digital replicas of jet engines and industrial equipment predict maintenance needs. Applied to the entire Earth, the concept becomes the most ambitious prediction project ever attempted.

The European Union's Destination Earth initiative, launched in 2024, is building two digital twins of the Earth system. The Climate Digital Twin runs multi-decadal climate projections at 5-kilometer resolution using three different models to capture structural uncertainty. The Extremes Digital Twin operates at 4.4-kilometer global resolution with sub-kilometer zoom for regional storm prediction. Both are hosted at ECMWF in Bologna and run on the LUMI supercomputer in Finland.1

Nvidia's Earth-2 takes a complementary approach: GPU-accelerated simulation that reduces computation time from hours to minutes. The FourCastNet model, trained on decades of ECMWF reanalysis data, produces global weather forecasts at 0.25-degree resolution in seconds. This is not a replacement for physics-based simulation. It is an emulator, a machine learning model trained on the outputs of physics-based models that reproduces those outputs much faster.2

The insight for economic prediction. The Earth system has on the order of 10^18 interacting molecules in the atmosphere alone. A digital twin of Earth is the most complex simulation ever attempted. A commodity supply chain has perhaps 200 measurable quantities. The ratio is roughly a quadrillion to one. If the technology exists to simulate the entire atmosphere at kilometer resolution, the computational challenge of simulating a supply chain is not merely tractable. It is trivial.

What the digital twin paradigm contributes is the concept of continuous updating. A static model that is run once and consulted is fragile. A digital twin that ingests new data continuously and re-simulates is adaptive. When a new mine production report is published, or when a reactor shuts down for maintenance, or when a government announces a change in enrichment policy, the constraint graph ingests the new data and re-simulates. The prediction changes because the physical reality has changed. This is the key architectural property: the model tracks reality, not the other way around.

References

  1. Wedi, N. et al. (2025). "Destination Earth: Digital twins of the Earth system." ECMWF Technical Memorandum.
  2. Nvidia. (2023). Earth-2 Technical Documentation.
  3. Voosen, P. (2020). "Europe builds digital twin of Earth to hone climate forecasts." Science, 370(6512), 16-17.
Article 08

Agent-Based Simulation

In early 2025, a twenty-year-old computer science student in China named MiroFish released an open-source agent-based simulation framework. Within ten days, it had accumulated 33,000 GitHub stars and attracted 4 million dollars in funding. The concept was simple: populate a simulated world with AI agents that have distinct personalities, knowledge, and behavioral patterns, then observe the emergent dynamics. Applied to financial markets, the simulation populates a virtual exchange with agents representing different investor archetypes and observes how they react to catalysts.1

The intellectual lineage goes back further. Joshua Epstein and Robert Axtell's Growing Artificial Societies (1996) demonstrated that complex social phenomena, trade, migration, conflict, could emerge from simple behavioral rules applied to a population of heterogeneous agents. Leigh Tesfatsion's work on agent-based computational economics showed that market dynamics that are intractable in standard equilibrium models arise naturally from agent interaction.2

The six archetypes. A minimal market simulation requires at least six distinct agent types, each embodying a different decision-making framework. The Value Investor buys assets trading below intrinsic value and sells above it, with a long time horizon and tolerance for short-term losses. The Momentum Trader follows price trends, buying what is rising and selling what is falling, with a short time horizon. The Macro Strategist responds to regime changes: interest rate shifts, inflation surprises, geopolitical events. The Corporate Insider has operational knowledge of specific industries: supply chain bottlenecks, customer dynamics, competitive shifts. The Retail Participant responds to narratives, social media buzz, and the behavior of other retail participants. The Short Seller actively searches for overvaluation, deterioration, and fraud signals.

When a catalyst is introduced, a new fact or a constraint graph prediction, each archetype responds according to its rules. A simultaneous "buy" signal from five or six archetypes constitutes a high-confidence convergence. A three-three split indicates a contested catalyst where the outcome depends on which archetype's framing proves correct. The simulation does not predict what will happen. It predicts how different market participants will react to what happens. This is a distinct and valuable layer of prediction that sits on top of the physical simulation described in Article 10.

The practical implementation runs entirely on local compute. Each archetype is a local language model with a 200-word system prompt constraining its perspective. Six parallel calls, one per archetype, complete in seconds across a multi-node cluster. The cost is zero beyond electricity. The output is structured: action (buy/sell/hold), conviction (1-10), reasoning, and time horizon. Unanimous signals are rare and valuable.

References

  1. MiroFish. (2025). GitHub repository.
  2. Epstein, J.M. & Axtell, R. (1996). Growing Artificial Societies: Social Science from the Bottom Up. MIT Press.
  3. Tesfatsion, L. (2006). "Agent-Based Computational Economics." Handbook of Computational Economics, Vol. 2.
  4. Bonabeau, E. (2002). "Agent-based modeling: Methods and techniques for simulating human systems." Proceedings of the National Academy of Sciences, 99(3), 7280-7287.
Article 09

The Superforecasters

The detailed methodology of Tetlock's IARPA tournament is covered in Article 3. This article focuses on the specific individuals who emerged as superforecasters and what their practices tell us about the nature of prediction as a learnable skill.

The superforecasters were not, for the most part, domain experts. Among the top performers were a retired irrigation engineer from Nebraska, a filmmaker from Brooklyn, a former ballroom dance instructor, and a pharmacist who forecasted in his spare time. What they shared was not expertise but a set of cognitive practices that, taken together, constituted a method.1

Fermi decomposition. When faced with a complex question like "Will Iran and Israel engage in direct military confrontation before December 2025?", superforecasters did not attempt to answer it directly. They decomposed it into component questions. What is the base rate of direct military confrontations between hostile states that have been in a cold conflict for more than twenty years? What has changed in the last twelve months that would make this base rate higher or lower? What would the precursor signals look like, and which of them are present? Each component question is easier to estimate than the whole, and the assembled estimates are more reliable than a direct gut judgment.

Reference class reasoning. Helmuth von Moltke the Elder, the Prussian military strategist, believed in deliberating fully before committing to a plan, then executing decisively. Superforecasters applied a version of this principle: before analyzing the specific case, they identified the reference class. How often do events like this one occur? What is the outside view? This practice counteracts the narrative bias that makes every current situation feel unique and unprecedented. Most situations are not unprecedented. They have historical analogues, and the historical frequency is the best starting point.

The update discipline. The difference between a good forecaster and a superforecaster was often the frequency and precision of updates. Where an ordinary forecaster might check a question once and move on, a superforecaster would return to the question daily, scanning for new information that should shift the estimate. The updates were small, typically two to five percentage points, but they accumulated. Over the course of a question's lifetime, a superforecaster might make twenty or thirty revisions. The final estimate, shaped by this iterative process, was consistently more accurate than the initial estimate.

Intellectual humility without paralysis. Superforecasters were comfortable saying "I don't know." They expressed uncertainty in precise numerical terms. They were willing to change their minds. But they were not indecisive. Helmut Schmidt's dictum, "people who have visions should go see a doctor," captures the superforecaster temperament: skeptical of grand narratives, attentive to evidence, decisive once the evidence is sufficient. The commitment to perpetual beta, always updating, never arriving at a final answer, was the disposition that most strongly predicted accuracy.

The translation of these practices into computational systems is the subject of this entire knowledge base. Fermi decomposition maps to the node structure of a constraint graph. Reference class reasoning maps to the base rate databases that ground prior probabilities. Update discipline maps to the fact pipeline that triggers re-simulation on new evidence. Intellectual humility maps to the adversarial falsification system described in Article 13 that tries to break every prediction. The superforecasters showed that prediction is a method. The question is whether the method can be embedded in architecture.

References

  1. Tetlock, P.E. & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.
  2. Mellers, B. et al. (2014). "Psychological Strategies for Winning a Geopolitical Forecasting Tournament." Psychological Science, 25(5), 1106-1115.
  3. Satopaa, V. et al. (2014). "Combining multiple probability predictions using a simple logit model." International Journal of Forecasting, 30(2), 344-356.
Pillar III

The Method

Article 10

From Inference to Simulation

Every existing prediction system in the world does inference. Not one of them does simulation. This distinction is the entire thesis of Crystal Ball, and it is worth being precise about what the words mean.

Inference is reasoning from evidence to conclusion. A doctor observes symptoms and infers a diagnosis. An intelligence analyst reads intercepted communications and infers an adversary's intentions. A hedge fund quantitative model observes price patterns and infers future price direction. A large language model reads a prompt and infers the most likely continuation. In every case, the process is the same: given facts, reason about implications.1

Simulation is something fundamentally different. It is the construction of a causal model of a system, the measurement of the system's current state, and the forward propagation of that state through the model's equations. A weather simulation does not reason about what the weather will be. It computes what the weather must be, given the current temperature, pressure, humidity, and wind speed at every grid point, and the laws of physics that govern how those quantities evolve.2

The difference is not one of sophistication. Inference can be extraordinarily sophisticated. Renaissance Technologies employs two hundred PhDs and fifty thousand compute cores to perform inference on petabytes of market data, and the Medallion Fund has returned 66 percent annually before fees for over thirty years.3 The difference is one of architecture. Inference asks: given what I have observed, what is likely true? Simulation asks: given what I know about how this system works, what must happen next?

Inference asks what is likely true. Simulation asks what must happen next. The first is bounded by the observer's imagination. The second is bounded only by the accuracy of the model.

The inference ceiling. Consider each of the major prediction systems currently operating in the world, and notice that every one of them is doing inference.

Renaissance Technologies finds statistical patterns in historical price data. The patterns are real, the returns are extraordinary, but the system has no model of why prices move. It cannot explain its predictions. It cannot extrapolate beyond the statistical regime in which the patterns were observed. When the underlying market structure changes, as it did during the 2020 COVID crash when the institutional funds RIEF and RIDA lost 14 percent in October 2025, the statistical patterns break and the system breaks with them. This is the fundamental fragility of pattern recognition without causal understanding.4

Prediction markets aggregate the opinions of participants who have money at stake. Polymarket processed over ten billion dollars in trading volume in 2025. The prices are efficient in the sense that they incorporate diverse information rapidly. But prediction market prices are not forward simulations. They are aggregated beliefs. When participants share the same blind spot, as they did when political prediction markets systematically underpriced Donald Trump's chances in 2016 and 2024, the aggregation mechanism fails precisely because it is aggregating the same type of inference from overlapping information sets.5

Large language models perform inference on training data. When asked to predict the future, they generate text that is statistically consistent with the patterns in their training corpus. This is pattern completion, not simulation. An LLM cannot model the physical constraints of a uranium supply chain because it has no representation of the causal structure. It can tell you that enrichment capacity is important. It cannot compute the month in which enrichment capacity becomes the binding constraint under a given set of demand assumptions. The first is inference. The second requires simulation.6

Human expert analysts perform inference from experience and domain knowledge. Tetlock's twenty-year study demonstrated that this is the weakest prediction architecture of all. The average expert performed barely better than chance over 28,361 predictions, and the most confident experts performed the worst.7

Why simulation works where inference doesn't. The reason weather prediction improved from 50 percent accuracy in 1970 to 90 percent accuracy today is not that meteorologists became smarter. It is that someone built a simulation. NOAA's Weather and Climate Operational Supercomputing System discretizes the atmosphere into a three-dimensional grid. Each grid cell contains measurable quantities: temperature, pressure, humidity, wind velocity. The cells are connected by equations that encode the laws of physics: conservation of mass, conservation of energy, the Navier-Stokes equations for fluid dynamics. The simulation starts from measured initial conditions and steps forward in time. The forecast is not an opinion. It is the output of a computation.8

This architecture has three properties that inference lacks. First, it is physically grounded. The model is constrained by conservation laws. Mass cannot be created or destroyed. Energy is conserved. These constraints eliminate entire regions of the prediction space that inference would have to consider. Second, it is falsifiable at every node. Every grid cell produces a prediction that can be checked against a sensor reading. When the prediction is wrong, the error can be attributed to a specific model parameter, which can be corrected. Inference systems rarely have this property because the reasoning path from input to output is opaque. Third, it is scale-independent. The same model runs at 13 kilometer resolution or 3 kilometer resolution. The physics is the same. The resolution determines the granularity of the prediction, not its fundamental approach.

The gap nobody has filled. If physical simulation is the architecture that makes prediction work, why hasn't anyone applied it to economics and finance? The answer is institutional, not technical.

Weather agencies have no mandate to model markets. Their budgets, their expertise, and their institutional incentives are oriented toward atmospheric science. Quantitative hedge funds, which do have financial incentives, have converged on statistical inference because it is profitable in the short term and because the mathematical culture of quantitative finance descends from physics through a lineage, Bachelier to Black-Scholes to RenTech, that treats prices as stochastic processes rather than outputs of physical systems. Prediction markets aggregate opinion but do not model causality. Academic economics has been dominated since the 1970s by rational expectations models that assume away the very frictions that create predictable supply-demand imbalances.9

The opportunity, therefore, is structural. Nobody is doing physical simulation of economic supply chains because the expertise is siloed. The people who understand physical simulation work at weather agencies and aerospace companies. The people who understand supply chains work at commodity trading firms and industrial companies. The people who understand prediction scoring work in decision science and psychology. No institution combines all three.

What supply chain simulation looks like. A uranium supply chain has perhaps two hundred measurable quantities. Mine production in Kazakhstan, measured in millions of pounds of U3O8. Conversion capacity, measured in kilograms of UF6. Enrichment capacity, measured in separative work units. Reactor demand, measured in gigawatts of installed capacity times fuel consumption per gigawatt. Spot price. Term contract price. Inventory levels. New mine development pipeline with known timelines.

These quantities are connected by known transfer functions. One gigawatt of reactor capacity requires approximately 400,000 pounds of U3O8 per year. Enrichment capacity is constrained by the number and capacity of centrifuge cascades, which have known throughput limits. New mines require seven to ten years from discovery to production, a delay function that is determined by geology, regulation, and capital availability. Spot price below sixty dollars per pound makes no new mine development economical, a threshold function that creates a floor under long-term supply response.

This is the same architecture as a weather model. Measurable quantities at nodes. Known transfer functions on edges. Physical bounds that constrain the solution space. The system is vastly simpler than the atmosphere. It requires no supercomputer. A Monte Carlo simulation of two hundred scenarios across twenty-four months runs in seconds on a laptop.

The output is not a price prediction. It is a map of physical constraints. The simulation identifies binding bottlenecks: the nodes where demand exceeds supply with highest probability. "Enrichment capacity becomes the binding constraint in month fourteen in 72 percent of scenarios." This is a different kind of prediction than "uranium will go up." It is a statement about physical reality that is either true or false, that can be checked against measurable data, and that has causal explanatory power.10

The role of the LLM inverts. In every existing system, the LLM or the human expert is the generator of predictions. The system asks: "What do you think will happen?" and the oracle answers. In a constraint-graph architecture, the LLM's role inverts. The simulation generates the predictions. The LLM's role is validation: "Here are the three bottlenecks the simulation identified. Does this make sense given what you know? What physical constraint am I missing?"11

This is precisely where LLMs are strongest. They are mediocre generators of novel predictions because they are pattern-completion engines operating on training data. They are excellent validators because they can synthesize domain knowledge from thousands of sources to check whether a specific claim is consistent with known physics. The constraint graph generates the hypothesis. The LLM stress-tests it. This is the architectural inversion that makes Crystal Ball possible.

The meta-learning signal. Every prediction generated by the constraint graph has a source tag: "constraint_graph" or "opus_inference." When predictions resolve, the system compares the accuracy of graph-sourced predictions against LLM-sourced predictions. If the graph outperforms, the system should trust the graph more and generate more graph-based predictions. If the LLM outperforms on certain categories, the system should examine why the graph failed and whether a node or edge is miscalibrated.

This is the flywheel. Simulate. Predict. Score. Compare sources. Recalibrate. Simulate again. Each turn of the flywheel makes the model more accurate, not because anyone got smarter, but because the architecture is designed to learn from its own errors. It is the same flywheel that took weather forecasting from 50 percent to 90 percent accuracy over fifty years, compressed into a domain where the physics is simpler and the feedback cycle is faster.

The question is not whether anyone is smart enough to predict the future. The question is whether anyone has built a model of the present that is accurate enough to compute it.

References

  1. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
  2. Bauer, P., Thorpe, A., & Brunet, G. (2015). "The quiet revolution of numerical weather prediction." Nature, 525, 47-55.
  3. Zuckerman, G. (2019). The Man Who Solved the Market: How Jim Simons Launched the Quant Revolution. Portfolio/Penguin.
  4. Patterson, S. (2010). The Quants: How a New Breed of Math Whizzes Conquered Wall Street and Nearly Destroyed It. Crown Business.
  5. Arrow, K. et al. (2008). "The Promise of Prediction Markets." Science, 320, 877-878.
  6. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search. 2nd ed. MIT Press.
  7. Tetlock, P.E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press.
  8. NOAA. (2022). Weather and Climate Operational Supercomputing System (WCOSS) documentation.
  9. Sargent, R.G. (2005). "Verification and validation of simulation models." Proceedings of the Winter Simulation Conference, 130-143.
  10. Oreskes, N., Shrader-Frechette, K., & Belitz, K. (1994). "Verification, Validation, and Confirmation of Numerical Models in the Earth Sciences." Science, 263(5147), 641-646.
  11. Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
  12. Heckerman, D. (1995). "A Tutorial on Learning with Bayesian Networks." Microsoft Research Technical Report MSR-TR-95-06.
Article 11

Constraint Graph Architecture

The central thesis of Article 10 is that prediction should be simulation, not inference. This article describes the architecture that makes simulation possible for economic and supply chain systems: the constraint graph.

A constraint graph is a directed graph of measurable physical quantities connected by causal edges that encode known transfer functions. Each node represents a quantity that can be measured and updated from real-world data sources. Each edge represents a known relationship between two quantities, whether linear, threshold-based, or delay-based. The graph is the model. Simulation is the act of propagating current node values forward through the edges to compute future states.1

Nodes. A node is any physical quantity that satisfies three criteria. First, it must be measurable: there must be a public or obtainable data source that reports its current value. Mine production in millions of pounds, enrichment capacity in separative work units, reactor demand in gigawatts. Second, it must have physical bounds: upper and lower limits determined by geology, engineering, regulation, or economics. A mine cannot produce negative uranium. Enrichment capacity cannot exceed the installed centrifuge cascade throughput. Third, it must have a known update cadence: how frequently new measurements become available. SEC filings are quarterly. EIA reports are monthly. Spot prices are daily.

Each node also carries a confidence score between 0 and 1 that reflects the reliability of the current measurement. A node updated from a direct measurement (mine production reported by the operator) has high confidence. A node inferred from secondary sources (estimated inventory levels based on trade flow data) has lower confidence. The confidence score determines the width of the uncertainty band used in Monte Carlo simulation.

Edges. Edges encode the causal relationships between nodes. The relationship types are:

LINEAR

y = coefficient * x

The simplest relationship. Reactor demand = installed capacity (GW) times 0.4 Mlbs/GW/year. The coefficient is determined by the physics of nuclear fission and fuel burnup. It varies slightly by reactor type but the aggregate is stable.

THRESHOLD

y = f(x) with discontinuity at trigger value

Non-linear responses at critical values. When uranium spot price drops below $60/lb, new mine development ceases because the economics don't work. Above $80/lb, marginal ISR operations restart. The threshold creates a step function in the supply response curve. Threshold edges are the most predictively valuable because they identify price levels where behavior changes qualitatively.

DELAY

y(t) = f(x(t - lag))

Time-shifted relationships. A mine development decision made today produces first ore in 7-10 years. An enrichment plant expansion started today reaches full capacity in 5-7 years. Delay edges are what make supply-demand imbalances persistent: even when the market signals that more supply is needed, the physical infrastructure cannot respond for years.

DEPLETION

y(t) = y(t-1) * (1 - depletion_rate)

Resources that decline over time. ISR well field production drops 3-5% per year from peak. Government stockpiles draw down at a fixed annual rate. Depletion edges create an inherent clock: the available supply diminishes with each passing year unless new capacity is added.

Simulation. Forward simulation uses Monte Carlo methods to propagate uncertainty through the graph. For each of N scenarios (typically 200), the simulation samples each uncertain edge coefficient from a distribution determined by the edge's confidence. A coefficient of 0.85 with confidence 0.8 is sampled uniformly from [0.68, 1.02] (plus or minus 20 percent). A coefficient with confidence 0.5 is sampled from [0.51, 1.19] (plus or minus 40 percent). The wider the uncertainty, the more the scenarios diverge, which is exactly the behavior we want: the simulation's spread IS the uncertainty estimate, just as in ensemble weather forecasting.2

For each scenario, the simulation steps forward one month at a time. At each time step, it propagates values through linear and threshold edges, applies depletion functions, and checks delay edges to see if any lagged inputs have arrived. The output is a trajectory for every node across every scenario. From these trajectories, the simulation computes the median, 10th percentile, and 90th percentile for each node at each time step.

Bottleneck identification. The most valuable output of the simulation is not any single node trajectory. It is the identification of binding constraints: nodes where demand exceeds supply in a high fraction of scenarios. A node is flagged as a bottleneck when it reaches within 10 percent of its physical bound in more than 60 percent of scenarios. The bottleneck severity ranking orders these constraints by the fraction of scenarios in which they bind and the time horizon at which they bind.

This is the prediction that the constraint graph generates. Not "uranium will go up." Instead: "Western enrichment capacity becomes the binding constraint in month 14 in 73 percent of scenarios, with the bottleneck severity increasing to 89 percent by month 24." This prediction is specific, falsifiable, time-bounded, and physically grounded. It is also the kind of prediction that no amount of expert inference would produce, because it emerges from the interaction of multiple nodes rather than from any single observable trend.

The LLM as validator. The constraint graph generates predictions. The LLM's role is to validate them. After the simulation completes, a single Opus call reviews the top three bottleneck predictions with the prompt: "Given the current state of the uranium fuel cycle and recent industry developments, do these simulation-identified bottlenecks make physical sense? What constraint am I missing? What assumption might be wrong?" If the LLM flags a gap, a missing node or a miscalibrated edge, the flag is logged for human review. The predictions ship regardless. The system tracks whether graph-based predictions or LLM-based predictions score better over time, which is the meta-learning signal that guides future architecture decisions.

Comparison to probabilistic graphical models. The constraint graph shares DNA with Bayesian networks but differs in important ways. A Bayesian network represents probability distributions over random variables and infers posterior probabilities given evidence. A constraint graph represents physical quantities with known transfer functions and simulates forward trajectories. The Bayesian network answers "what is the probability of X given Y?" The constraint graph answers "what happens to X over the next 24 months given the current state of the system?" Both are directed acyclic graphs. Both can be used for prediction. But the constraint graph's predictions are grounded in physics rather than statistics, which makes them more interpretable and more robust to distribution shifts.3

References

  1. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
  2. Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
  3. Heckerman, D. (1995). "A Tutorial on Learning with Bayesian Networks." Microsoft Research Technical Report MSR-TR-95-06.
  4. Spirtes, P. et al. (2000). Causation, Prediction, and Search. 2nd ed. MIT Press.
Article 12

Capital Cycles as Prediction Framework

Marathon Asset Management, the London-based investment firm, built its track record on a single observation: the returns to investors in commodity and industrial businesses are determined primarily by the supply side, not the demand side. Demand is hard to predict. Supply is visible years in advance because capital expenditure decisions, mine developments, factory constructions, and fleet orders have long lead times and are publicly disclosed.1

The capital cycle framework identifies a repeating pattern. In the expansion phase, high prices attract capital. New mines are developed, new factories are built, new ships are ordered. Because of construction lead times, the new capacity arrives years after the investment decision, often at the point where the market has already softened. The resulting oversupply depresses prices and returns. Marginal operators exit. Capital flees the sector. This is the contraction phase, and it sets the stage for the next cycle because no new capacity is being built while existing capacity depreciates and depletes.

The prediction comes from timing the contraction-to-expansion transition. When an industry is in deep contraction, when capital expenditure has collapsed, when marginal producers have gone bankrupt, when the remaining producers cannot cover replacement costs at current prices, a structural shortage is developing. The shortage is invisible to most market participants because it is defined by what is NOT being built, and you cannot observe an absence by reading today's headlines. You can only observe it by building a model of the capacity pipeline and simulating forward.

Uranium as case study. The uranium market entered a deep contraction after the Fukushima disaster in 2011. Japan shut down its entire reactor fleet. Germany committed to nuclear exit. Spot prices fell from over $70/lb to below $20/lb. Mines closed. Exploration budgets collapsed to near zero. Cameco placed its flagship McArthur River mine on care and maintenance. No new mine development was initiated anywhere in the world.

Meanwhile, the physical demand for uranium continued. China was building reactors at a rate of six to eight per year. India, Korea, and Russia were expanding their fleets. The existing global fleet of 440 reactors still needed approximately 180 million pounds of U3O8 equivalent per year. Primary mine production could supply only about 140 million pounds. The gap, roughly 35 million pounds per year, was filled by drawing down secondary sources: government stockpiles, commercial inventories, and underfeeding at enrichment plants.

The capital cycle framework translates directly into constraint graph architecture. The "supply destruction" phase is when node values for mine production, exploration spending, and new project pipeline fall below sustainability thresholds. The "structural shortage" phase is when the simulation identifies binding constraints in supply nodes that cannot be resolved within the delay function timelines of new mine development. The "price discovery" phase is when the market recognizes the constraint and reprices. The constraint graph makes the cycle observable before the market reprices, which is the definition of a predictive edge.

References

  1. Marathon Asset Management. (2015). Capital Returns: Investing Through the Capital Cycle. Palgrave Macmillan.
  2. Chancellor, E. (2015). Capital Account: A Fund Manager Reports on a Turbulent Decade. Texere.
  3. World Nuclear Association. (2023). Nuclear Fuel Report.
Article 13

Adversarial Falsification

Karl Popper argued in 1959 that the demarcation between science and non-science is falsifiability. A theory is scientific if and only if it makes predictions that could, in principle, be proven wrong. A theory that accommodates every possible outcome, that is revised after the fact to explain whatever happened, is not a theory at all. It is a narrative.1

Most prediction systems violate this principle. An analyst who says "I think oil will go up because of geopolitical tensions" has not made a falsifiable prediction. There is no specified price level, no time horizon, no mechanism by which the claim would be proven wrong. If oil goes up, the analyst claims credit. If oil goes down, the analyst claims "the market hasn't realized it yet." The prediction is irrefutable, which means it is worthless.

The falsifier architecture. Adversarial falsification inverts the standard approach. For every prediction the system generates, it also generates three to five specific conditions that would kill the prediction. These are not vague reservations. They are specific, observable, time-bounded claims:

"If Kazakhstan announces production exceeding 65 Mlbs for 2026, the enrichment bottleneck prediction weakens because primary supply is higher than modeled."

"If TENEX offers new enrichment contracts to Western utilities, the effective Western enrichment capacity is higher than the graph assumes."

"If three or more new ISR mines enter production before 2028, the depletion rate model is too pessimistic."

Each falsifier is connected to the fact pipeline. When a new fact arrives, news article, SEC filing, government report, the system checks whether it matches any active falsifier's keywords. If a match is found, a quick automated check determines whether the falsifier has been triggered. If two or more falsifiers on the same prediction trigger, the prediction is automatically resolved as incorrect.2

This architecture makes the prediction system anti-fragile. Every failed prediction improves the model because the falsifier identifies which specific assumption was wrong. The enrichment edge coefficient was too low. The depletion rate was too aggressive. The delay function for mine development was too long. Each correction makes the next simulation more accurate. The system does not learn from being right. It learns from being wrong in specific, diagnosed ways.

The practice of adversarial falsification also combats confirmation bias at the architectural level. A human analyst naturally seeks evidence that confirms their thesis. A system that generates its own kill conditions and actively watches for disconfirming evidence has no such bias. It is trying to prove itself wrong with the same vigor that it tries to prove itself right. This is the computational implementation of what Tetlock identified as the strongest predictor of superforecasting accuracy: intellectual humility combined with relentless self-correction.

References

  1. Popper, K. (1959). The Logic of Scientific Discovery. Routledge.
  2. Lakatos, I. (1978). The Methodology of Scientific Research Programmes. Cambridge University Press.
  3. Meehl, P.E. (1978). "Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology." Journal of Consulting and Clinical Psychology, 46, 806-834.
Article 14

Bayesian Updating in Practice

Thomas Bayes, an English Presbyterian minister, described in a 1763 posthumous paper the mathematical relationship between a prior probability, new evidence, and the resulting posterior probability. The formula is simple: P(H|E) = P(E|H) times P(H) divided by P(E). The prior probability of a hypothesis, multiplied by the likelihood of the observed evidence given that hypothesis, divided by the probability of the evidence overall, yields the updated probability. It is the mathematics of learning from experience.1

Humans are spectacularly bad at Bayesian updating. Kahneman and Tversky demonstrated in the 1970s that people systematically neglect base rates. If a disease affects 1 in 10,000 people and a test has a 5 percent false positive rate, most people (including most doctors) dramatically overestimate the probability that a positive test indicates actual disease. The correct answer, approximately 0.2 percent, is counterintuitive because it requires properly weighting the very low base rate against the apparently alarming positive test.

Computational systems have no such bias. A Bayesian updating algorithm applied to a prediction system starts with a prior probability derived from the constraint graph simulation or from historical base rates. When a new fact arrives, the system computes the likelihood of that fact under the hypothesis that the prediction is correct versus the hypothesis that it is incorrect. The posterior probability, the updated prediction confidence, follows from the mathematics.

In practice, the updating in Crystal Ball is implemented through the confidence trajectory system described in Article 15. Each new fact triggers a reassessment. If Kazatomprom announces production below target, the constraint graph's depletion node is updated, the simulation is re-run, and the prediction confidence adjusts. The adjustment is proportional to the significance of the new evidence and the strength of its connection to the prediction. A major production shortfall announcement shifts confidence by 5-10 points. A minor policy statement shifts it by 1-2 points. The accumulation of small updates, what Tetlock calls granular probability estimation, produces well-calibrated forecasts over time.2

The key insight is that calibration is not a static property. It is the result of disciplined updating. A system that updates frequently, in small increments, based on evidence rather than narrative, will converge on well-calibrated probabilities over time. A system that updates rarely, in large jumps, based on dramatic events, will oscillate between overconfidence and panic. The mathematics guarantees the former if the likelihood ratios are correctly computed. The architecture of the fact pipeline and the constraint graph ensures that they are.

References

  1. McGrayne, S.B. (2011). The Theory That Would Not Die. Yale University Press.
  2. Jaynes, E.T. (2003). Probability Theory: The Logic of Science. Cambridge University Press.
  3. Gelman, A. et al. (2013). Bayesian Data Analysis. 3rd ed. CRC Press.
Article 15

Temporal Confidence Tracking

A prediction's confidence at the moment of creation is the least informative data point about that prediction. What matters is the trajectory: how did the confidence evolve as new evidence arrived? Did it strengthen steadily as falsifiers failed to trigger, suggesting the model was correctly calibrated? Did it oscillate wildly, suggesting the model was responding to noise? Did it collapse suddenly when a single falsifier triggered, revealing a critical assumption that was wrong?

Temporal confidence tracking stores a snapshot of each prediction's state at regular intervals. Each snapshot records: the current confidence, the constraint graph simulation probability if available, the prediction market contract price if mapped, the number of active falsifiers, the number of triggered falsifiers, and the specific fact or event that caused the most recent confidence change. Over the lifetime of a prediction, this series of snapshots produces a confidence trajectory that is far more informative than any single estimate.

What trajectories reveal. A prediction that starts at 0.65, rises steadily to 0.82 over six months as falsifiers fail to trigger and supporting evidence accumulates, then resolves as correct at 0.85, indicates a well-calibrated model that correctly identified a trend before the market. A prediction that starts at 0.70, drops to 0.35 when a single fact contradicts a key assumption, then resolves as incorrect, indicates a model with a specific weakness that can be diagnosed from the triggering fact. A prediction that oscillates between 0.40 and 0.60 throughout its lifetime indicates a fundamentally uncertain event where the model has no edge over chance.

Decomposed Brier scores. The Brier score, introduced by Glenn Brier in 1950 for evaluating weather forecasts, is the mean squared error between predicted probabilities and binary outcomes: BS = (1/N) sum of (forecast - outcome) squared. A perfect forecaster scores 0. Random guessing scores 0.25. Climatology, always predicting the base rate, scores better than 0.25 to the extent that events have skewed base rates.1

Allan Murphy's 1973 decomposition separates the Brier score into three components. Reliability (calibration): when you say 70 percent, does it happen 70 percent of the time? Resolution: can you distinguish high-probability events from low-probability ones? A predictor who always says 50 percent has zero resolution regardless of calibration. Uncertainty: the inherent difficulty of the prediction task, determined by the base rate of the event category. BS = Reliability - Resolution + Uncertainty.2

This decomposition is diagnostic. A system with poor calibration but high resolution has useful signal buried under systematic bias. The fix is recalibration: adjust the mapping from internal confidence to reported probability. A system with good calibration but low resolution is honest but unhelpful: it correctly reports its ignorance but doesn't actually distinguish likely from unlikely outcomes. The fix is better models or better data.

Crystal Ball reports all three components alongside the composite score. Over time, the decomposition reveals whether improvements are coming from better calibration (the system is learning to estimate probabilities more honestly) or better resolution (the constraint graph is learning to distinguish events that happen from events that don't). Both are valuable. The decomposition tells you which lever to pull.

References

  1. Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1-3.
  2. Murphy, A.H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology, 12, 595-600.
  3. Gneiting, T. & Raftery, A.E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." JASA, 102(477), 359-378.
Article 16

The Uranium Fuel Cycle: A Worked Example

Everything in this knowledge base has been abstract until now. The constraint graph architecture described in Article 11 is a theoretical framework. The distinction between inference and simulation developed in Article 10 is an argument. This article is neither. It is a worked example. We will build a complete constraint graph for the uranium fuel cycle using publicly available data, seed it with real production numbers, and show what the simulation reveals about the next four years of uranium supply.

If the architecture works, the simulation will identify binding constraints that are not obvious from reading any single analyst report. If it does not work, this article will be the proof of failure. Either way, the method earns trust by being specific enough to be wrong.

The system. The civilian nuclear fuel cycle is a chain of physical transformations. Uranium ore is mined, milled into yellowcake (U3O8), converted to uranium hexafluoride (UF6), enriched to increase the concentration of fissile U-235, fabricated into fuel assemblies, and loaded into reactors. Each step has measurable capacity. Each step has known throughput constraints. Each step has different lead times for expansion. The chain is linear and the physics is well-understood. This makes it a nearly ideal candidate for constraint-graph simulation.1

MINE PRODUCTION (2025)
~145 Mlbs U3O8
REACTOR DEMAND
~180 Mlbs U3O8 equiv.
SUPPLY DEFICIT
~35 Mlbs/year
SECONDARY SOURCES
Declining inventories

The nodes. The constraint graph for uranium requires approximately twenty primary nodes, each representing a measurable physical quantity. These are not abstractions. Every value cited below comes from the World Nuclear Association's Nuclear Fuel Report, Cameco and Kazatomprom annual reports, the EIA Uranium Marketing Annual Report, or the IAEA Power Reactor Information System.23

Mine production is the entry point. Global production in 2025 was approximately 145 million pounds of U3O8. Kazakhstan dominates with roughly 60 million pounds, produced almost entirely via in-situ recovery (ISR) leaching. Canada contributes approximately 18 million pounds, primarily from Cameco's McArthur River and Cigar Lake operations. Australia adds roughly 10 million pounds. The remainder is distributed across Namibia, Niger, Uzbekistan, Russia, and smaller producers.4

ISR depletion is the first non-obvious constraint. Kazakhstan's ISR operations extract uranium by pumping acidic solution through ore bodies. Unlike conventional mines, ISR well fields deplete progressively. Production per well declines at approximately 3-5 percent per year from peak, requiring continuous development of new well fields to maintain output. Kazatomprom has disclosed increasing development costs per pound as legacy well fields deplete. This is a physical constraint that no amount of capital can overcome: the geology dictates the depletion rate.3

Conversion capacity is measured in kilograms of uranium as UF6. Global conversion capacity is approximately 62,000 tonnes of uranium per year, but effective utilization has been well below capacity due to maintenance and political constraints. Cameco's Port Hope facility and Orano's Malvesi-Pierrelatte complex are the Western world's primary converters. ConverDyn's Metropolis Works in Illinois returned to production in 2023 after an eight-year shutdown, adding capacity but requiring time to ramp to full throughput.

Enrichment capacity is measured in separative work units (SWU). Global enrichment capacity is approximately 65 million SWU per year, divided among four major providers: Urenco (consortium of UK, Netherlands, Germany), Orano (France), TENEX/Rosatom (Russia), and CNNC (China). The Western enrichment bottleneck is acute because Russian enrichment services have been curtailed by sanctions and self-sanctioning. This creates an effective Western enrichment capacity of roughly 40 million SWU against demand that exceeds that figure when Chinese domestic consumption is subtracted from CNNC's output.1

Reactor demand is the most stable node in the graph. A 1-gigawatt nuclear reactor requires approximately 400,000 pounds of U3O8 per year in fuel. As of early 2026, approximately 440 reactors are operating globally with a combined capacity of roughly 395 gigawatts. This implies baseline demand of approximately 158 million pounds per year just for existing reactors. When accounting for enrichment tails assay choices and fuel management strategies, effective demand is closer to 180 million pounds equivalent.5

The gap between 145 million pounds of production and 180 million pounds of demand is not a forecast. It is a measurement. Thirty-five million pounds per year must come from somewhere, and the sources are finite.

Secondary supply bridges the gap. Commercial inventories held by utilities and traders, government stockpiles (principally the U.S. DOE and Russian state reserves), and underfeeding at enrichment plants (using excess SWU capacity to extract more fissile material from the same feed) collectively supply the deficit. The critical insight is that secondary sources are stocks, not flows. They deplete. The DOE inventory drawdown program has reduced U.S. government holdings by roughly 70 percent since 2010. Commercial inventories, after a decade of post-Fukushima destocking, are at historically low levels. Underfeeding is constrained by the same enrichment capacity limits described above.

The edges. With nodes defined, the constraint graph requires edges that encode the transfer functions between nodes.

LINEAR RELATIONSHIPS

DIRECT PROPORTIONALITY WITH KNOWN COEFFICIENTS

Reactor demand = installed capacity (GW) x 0.4 Mlbs/GW/year. This is a physical constant determined by neutron physics and fuel burnup rates. It varies slightly by reactor type but the aggregate conversion factor is stable.

DELAY FUNCTIONS

TIME LAGS BETWEEN CAUSE AND EFFECT

New mine development: 7-10 years from discovery to first production (geology, permitting, construction). Enrichment plant expansion: 5-7 years from decision to full operation (technology, licensing, construction). Reactor construction: 10-15 years from planning to commercial operation. These delays are the reason supply-demand imbalances in nuclear fuel persist for years. The market cannot respond quickly because the physical infrastructure cannot be built quickly.

THRESHOLD FUNCTIONS

NON-LINEAR RESPONSES AT CRITICAL VALUES

Spot price below $60/lb U3O8: no new mine development is economical. This creates a price floor for long-term supply response. Spot price above $80/lb: marginal ISR and heap-leach operations become economical, adding approximately 5-10 Mlbs of potential supply, but with 2-3 year ramp times. Spot price above $100/lb: conventional underground mines with higher all-in sustaining costs become viable, but require the full 7-10 year development cycle.

DEPLETION FUNCTIONS

DECLINING RESOURCE AVAILABILITY OVER TIME

Kazakhstan ISR well fields: ~3-5% production decline per year from peak per well field, requiring continuous capital expenditure on new well field development. DOE inventory: drawdown at ~2-3 Mlbs/year under current policy, with finite total remaining. Commercial inventory: utility restocking cycle has begun, converting inventories from supply source to demand source.

The simulation. With twenty nodes and approximately thirty edges, the constraint graph can be simulated forward using Monte Carlo methods. For each of two hundred scenarios, the simulation samples each uncertain edge coefficient from its confidence interval. A linear coefficient with confidence 0.8 is sampled from plus or minus 20 percent of its nominal value. A coefficient with confidence 0.5 is sampled from plus or minus 40 percent. The simulation then propagates values forward month by month, respecting delay edges and checking threshold conditions.

The output is not a single price prediction. It is a probability distribution over node trajectories, from which we extract the bottleneck nodes: quantities where demand exceeds supply in the greatest fraction of scenarios.

What the simulation reveals. Running the constraint graph forward twenty-four months from early 2026 produces three primary findings.

Finding 1: Enrichment, not mine supply, is the binding constraint in the Western fuel cycle. In 73 percent of scenarios, Western enrichment capacity (Urenco + Orano + ConverDyn) fails to meet Western demand by month 18. This is because Chinese domestic enrichment consumption is growing with China's reactor fleet (approximately 25 GW under construction), reducing CNNC's available export capacity, while Russian enrichment is increasingly unavailable to Western utilities. The enrichment bottleneck means that even if mine production increases, the fuel cannot reach reactors without sufficient SWU capacity to enrich it. This constraint is invisible to analysts who focus exclusively on mine supply and uranium spot price.

Finding 2: Kazakhstan's ISR depletion creates a structural decline in the largest producing country. In 68 percent of scenarios, Kazakhstan's production declines from approximately 60 Mlbs to 52-56 Mlbs by 2028 despite Kazatomprom's stated production targets, because legacy well field depletion exceeds new well field commissioning. This is not a financial constraint. It is geological. The ore grades in mature well fields are declining, and the most accessible deposits have been exploited first. Kazatomprom's own disclosures about rising production costs per pound are consistent with this depletion trajectory.

Finding 3: The secondary supply buffer exhausts between 2027 and 2029 in 61 percent of scenarios. When DOE inventory drawdowns, commercial destocking, and underfeeding reductions are simulated forward with the current primary production deficit of 35 Mlbs per year, the cumulative secondary supply available drops below the cumulative deficit requirement. The exact timing varies by scenario because it depends on utility restocking behavior and government policy decisions, but the central tendency is clear: the buffer has a shelf life measured in years, not decades.

The simulation does not predict the price of uranium. It identifies the physical constraints that will determine the price. The constraints are enrichment capacity, ISR depletion, and secondary supply exhaustion. These are measurable, falsifiable, and time-bounded.

Validation. The constraint graph generates predictions that can be tracked and scored. "Western enrichment utilization exceeds 90 percent by Q4 2027." "Kazakhstan production fails to exceed 60 Mlbs in 2027." "U.S. DOE inventory drops below 20 Mlbs by end of 2028." Each prediction has a source (constraint_graph), a confidence derived from the simulation probability, and a resolution date. When the prediction resolves, the accuracy is compared against predictions generated by pure Opus inference on the same themes.6

This meta-comparison is the mechanism by which the system learns whether the constraint graph is well-calibrated. If graph-based predictions consistently outperform Opus-based predictions, the system should generate more graph-based predictions and invest in expanding the graph to additional nodes. If Opus-based predictions outperform in certain categories, the system should examine why the graph failed in those categories: is a node missing? Is an edge miscalibrated? Is a delay function wrong?

What this proves. The uranium fuel cycle is not a special case. It is an existence proof. The method requires three things: measurable quantities, known transfer functions, and physical bounds. Any supply chain that has these three properties can be modeled as a constraint graph and simulated forward. Copper has them. Oil has them. Semiconductor fabrication has them. Rare earth processing has them. The number of nodes and edges varies, but the architecture is the same.

The worked example also demonstrates the architectural inversion described in Article 10. The simulation generated the three findings above without any LLM involvement. An Opus call that reviews the findings and asks "does this make sense?" adds value as a validator: it might note that a specific mine restart could offset the Kazakhstan depletion, or that a geopolitical event could accelerate the enrichment bottleneck. But the core predictions come from the graph, not from inference. The LLM validates. The physics generates.

This is not a forecast. It is a demonstration that the architecture produces specific, falsifiable, time-bounded predictions from known physical constraints. The predictions may turn out to be wrong. If they do, the graph will be recalibrated, the edges will be adjusted, and the next simulation will be more accurate. That is the flywheel, as described in Article 13 on adversarial falsification: the system earns trust by surviving attempts to break it, and it improves by learning from every resolution.

References

  1. World Nuclear Association. (2023). Nuclear Fuel Report: Global Scenarios for Demand and Supply of Uranium, Conversion and Enrichment.
  2. Cameco Corporation. (2024). Annual Report.
  3. Kazatomprom. (2024). Annual Report.
  4. EIA. (2024). Uranium Marketing Annual Report.
  5. IAEA. Power Reactor Information System (PRIS) database.
  6. Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1-3.
  7. UxC. (2025). Uranium Market Outlook.
  8. Schultz, J. (2023). Commodity Conversations: An Introduction to Trading in Agricultural Commodities.
  9. Marathon Asset Management. (2015). Capital Returns: Investing Through the Capital Cycle. Palgrave Macmillan.
  10. Zoellner, T. (2009). Uranium: War, Energy, and the Rock That Shaped the World. Penguin.
Pillar IV

The Limits

Article 17

Chaos, Black Swans, and Irreducible Uncertainty

In 1963, Edward Lorenz discovered that a weather simulation run with initial conditions rounded to three decimal places produced a completely different outcome than the same simulation with six decimal places. The divergence was exponential. After two simulated weeks, the forecasts bore no resemblance to each other. This was deterministic chaos: the equations were perfectly specified, the physics was correct, but any imprecision in the starting conditions grew without bound. Lorenz had discovered the hard limit of weather prediction, and by extension, the hard limit of prediction in any chaotic system.1

Nassim Taleb extended this insight from chaotic systems to human systems with his concept of the Black Swan: a high-impact event that is unpredictable before it occurs and rationalized after it occurs. The 2008 financial crisis, the COVID-19 pandemic, the September 11 attacks. In each case, the event was not merely unlikely in the way that drawing a specific card from a deck is unlikely. It was outside the space of possibilities that anyone was considering. The distribution of outcomes had fat tails, meaning that extreme events were far more frequent than normal distributions would predict.2

Frank Knight, writing in 1921, drew a distinction that remains essential. Risk is uncertainty that can be quantified: the probability of a fair coin landing heads is exactly 0.5. Uncertainty is unquantifiable: the probability that a novel pathogen will emerge from a wet market in Wuhan and shut down the global economy is not merely unknown but unknowable in advance. Knightian uncertainty cannot be managed by better models. It can only be managed by building systems that are robust to outcomes the model did not anticipate.3

What IS predictable. The honest accounting is not that prediction is impossible. It is that prediction is possible in specific domains and impossible in others. What is predictable: physical constraints (geology depletes at known rates), demographic trends (populations age predictably), capacity pipelines (construction projects have known timelines), regulatory cycles (license applications have published schedules), and supply-demand arithmetic (production minus consumption equals inventory change). What is NOT predictable: wars, pandemics, technological breakthroughs, political revolutions, and the timing of any event that depends on individual human decisions.

The art of prediction, then, is not to pretend that Black Swans do not exist. It is to build systems that predict what is predictable and are robust to what is not. A constraint graph that identifies enrichment capacity as the binding constraint in the uranium fuel cycle is making a prediction about physics. If a war disrupts Russian enrichment exports, the constraint binds sooner. If a technological breakthrough increases centrifuge efficiency, the constraint eases. The prediction is wrong in its timeline but not in its structure. The system built on it must be designed to update rapidly when Black Swans arrive, adjusting node values and re-simulating, rather than pretending they will not occur.

References

  1. Lorenz, E.N. (1963). "Deterministic Nonperiodic Flow." Journal of the Atmospheric Sciences, 20(2), 130-141.
  2. Taleb, N.N. (2007). The Black Swan. Random House.
  3. Knight, F. (1921). Risk, Uncertainty and Profit. Houghton Mifflin.
  4. Mandelbrot, B. (2004). The Misbehavior of Markets. Basic Books.
  5. Bak, P. (1996). How Nature Works. Copernicus/Springer.
Article 18

The Calibration Problem

When you say "70 percent," you mean something specific. You mean that in a world where you make this statement one hundred times, the event should occur approximately seventy times. If it occurs ninety times, you are underconfident. If it occurs fifty times, you are overconfident. Calibration is the alignment between stated probability and actual frequency, and it is the most fundamental metric of forecasting quality.1

Most humans are poorly calibrated. Studies consistently show that when people say they are "90 percent sure," they are correct approximately 75 percent of the time. When they say "99 percent sure," they are correct approximately 85 percent of the time. The overconfidence is systematic: people overestimate the reliability of their knowledge across virtually every domain tested.1

Calibration can be improved. The Good Judgment Project demonstrated that feedback and training produced measurable calibration improvements in volunteer forecasters. Participants who received calibration feedback, showed on a reliability diagram where their forecasts diverged from observed frequencies, learned to adjust their probability estimates toward reality. The adjustment took weeks, not months, suggesting that miscalibration is more a matter of habit than ability.

The meta-problem. A perfectly calibrated predictor who always says "50 percent" is useless. Perfect calibration with zero resolution means the forecaster correctly reports their ignorance but adds no information. The dual problem, a forecaster who always says "90 percent" or "10 percent" with high resolution but poor calibration, has useful signal buried under systematic bias. The signal can be extracted through recalibration: mapping the forecaster's stated probabilities to their actual hit rates. A forecaster who says "90 percent" and is right 75 percent of the time can be recalibrated by simply relabeling "90 percent" as "75 percent."2

For machine prediction systems, calibration training is replaced by calibration tracking. Crystal Ball records every prediction's stated confidence and its eventual resolution. Over hundreds of resolved predictions, the reliability diagram emerges automatically. If the 70-percent bucket has a hit rate of 85 percent, the system is systematically underconfident in that range. The recalibration can be applied automatically to future predictions, or the underlying model can be investigated to understand why it is producing conservative estimates in that confidence range. This is the predict-score-learn flywheel applied to the meta-level: the system is not only learning about the world but learning about its own predictive characteristics.

References

  1. Lichtenstein, S., Fischhoff, B., & Phillips, L.D. (1982). "Calibration of probabilities." In Judgment Under Uncertainty: Heuristics and Biases. Cambridge University Press.
  2. Dawid, A.P. (1982). "The well-calibrated Bayesian." JASA, 77(379), 605-610.
  3. DeGroot, M.H. & Fienberg, S.E. (1983). "The comparison and evaluation of forecasters." Journal of the Royal Statistical Society: Series D, 32(1-2), 12-22.
Article 19

Keeping Score: Brier Scores and Decomposition

Glenn Brier was a meteorologist at the U.S. Weather Bureau who, in 1950, proposed a simple scoring rule for probabilistic forecasts. The Brier score is the mean squared error between the predicted probability and the binary outcome: BS = (1/N) times the sum of (f_i minus o_i) squared, where f_i is the forecast probability and o_i is 1 if the event occurred or 0 if it did not. A perfect forecaster who assigns probability 1.0 to events that happen and 0.0 to events that do not happen scores 0. A forecaster who always says 0.5 scores 0.25.1

The Brier score is a strictly proper scoring rule. This means that a forecaster minimizes their expected score by reporting their true beliefs. Unlike other scoring methods, there is no incentive to strategically misreport. If you genuinely believe the probability is 0.7, reporting 0.7 minimizes your expected Brier score. Reporting 0.9 to seem more decisive or 0.5 to hedge will, on average, produce a worse score. This property makes the Brier score the gold standard for evaluating probabilistic forecasts.

Murphy's decomposition. Allan Murphy's 1973 decomposition reveals the internal structure of the Brier score. The total score equals Reliability minus Resolution plus Uncertainty.2

Reliability measures calibration. Group all predictions into probability buckets: 0-10%, 10-20%, and so on. For each bucket, compute the average predicted probability and the actual hit rate. Reliability is the weighted mean squared difference between predicted and observed frequencies. A perfectly calibrated forecaster has Reliability = 0. This is what the calibration curve in Article 18 visualizes.

Resolution measures discrimination. It is the weighted mean squared difference between each bucket's observed frequency and the overall base rate. A forecaster who assigns high probability to events that happen and low probability to events that do not happen has high resolution. A forecaster who assigns the same probability to everything has zero resolution. Resolution is subtracted because higher resolution improves the score.

Uncertainty is o_bar times (1 minus o_bar), where o_bar is the overall base rate of the event. It depends entirely on the difficulty of the prediction task and is independent of the forecaster's skill. Predicting rare events (base rate 0.05) has low uncertainty. Predicting coin flips (base rate 0.50) has maximum uncertainty.

Why decomposition matters. A composite Brier score of 0.18 tells you the forecaster is better than random but does not tell you how to improve. The decomposition is diagnostic:

PATTERNDIAGNOSISFIX
High reliability, low resolutionHonest but unhelpfulBetter models, more data
Low reliability, high resolutionUseful signal, systematic biasRecalibrate probability mapping
High reliability, high resolutionStrong forecasterExpand prediction surface
Low reliability, low resolutionNoise generatorRebuild from scratch

Crystal Ball reports all three components alongside every accuracy analysis. When the decomposition shows that the constraint graph has high resolution (it can distinguish events that happen from events that don't) but mediocre reliability (its probability estimates are systematically too high or too low), the fix is recalibration, not a new model. When resolution is low, the fix is structural: add nodes to the graph, improve the transfer functions, or extend the data sources. The decomposition converts a single number into an action plan.

The Brier Skill Score. The Brier Skill Score measures improvement over a reference forecast, typically climatology (always predicting the base rate). BSS = 1 minus (BS / BS_ref). A BSS of 0.30 means the forecaster's Brier score is 30 percent lower than always predicting the base rate. Tetlock's superforecasters achieved Brier Skill Scores in the range of 0.15 to 0.30 on geopolitical forecasting questions. Crystal Ball's target is BSS > 0.20 on the categories of predictions it generates, verified through a minimum of 200 resolved predictions.3

References

  1. Brier, G.W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review, 78(1), 1-3.
  2. Murphy, A.H. (1973). "A New Vector Partition of the Probability Score." Journal of Applied Meteorology, 12, 595-600.
  3. Gneiting, T. & Raftery, A.E. (2007). "Strictly Proper Scoring Rules, Prediction, and Estimation." JASA, 102(477), 359-378.
  4. Wilks, D.S. (2011). Statistical Methods in the Atmospheric Sciences. 3rd ed. Academic Press.
Article 20

Why Prediction Scales and Prophecy Doesn't

Prophecy is a person claiming to see the future. The prophet's authority derives from charisma, credentials, or track record. The prediction is singular: this will happen. The mechanism is opaque: trust me. When the prophecy fails, the prophet reinterprets. When it succeeds, the prophet's authority increases. There is no feedback loop. There is no mechanism for systematic improvement. Prophecy has been practiced for three thousand years, and the hit rate has not improved.

Prediction, in the sense developed throughout this knowledge base, is something fundamentally different. It is a system that generates probabilistic forecasts from explicit models, scores those forecasts against reality, and uses the scores to improve the models. The system's authority derives not from charisma but from its track record, which is public, quantified, and decomposed into diagnostic components. When the prediction fails, the failure is attributed to a specific model component, which is corrected. When the prediction succeeds, the success is attributed to specific model strengths, which are reinforced. The feedback loop is the mechanism. The mechanism is the moat.

The flywheel. Every resolved prediction makes the next prediction better. A falsified prediction identifies a wrong assumption. A triggered falsifier identifies a blind spot. A calibration drift reveals a systematic bias. A resolution improvement reveals a model enhancement that is working. Each turn of the flywheel produces specific, actionable diagnostics that are fed back into the model. After a hundred turns, the system is meaningfully better. After a thousand turns, it is categorically different from what it was at the start.

This is why prediction scales and prophecy does not. A prophet who makes a thousand predictions does not improve. The thousandth prediction is made by the same mind with the same biases as the first. A prediction system that resolves a thousand predictions and feeds each resolution back into the model is a fundamentally different entity from what it was at prediction number one. The constraint graph has been recalibrated. The falsifier keywords have been refined. The edge coefficients have been adjusted to match observed transfer rates. The confidence trajectory database has revealed which categories of predictions the system is strong on and which it is weak on.

The Excel analogy. Excel did not make everyone a financial analyst. It gave every financial analyst a tool that made their work dramatically faster, more accurate, and more reproducible. Before spreadsheets, financial modeling was done on paper, by hand, with adding machines. The introduction of electronic spreadsheets did not replace financial analysts. It amplified their capabilities by orders of magnitude while making their work transparent and auditable.

Crystal Ball occupies the same position relative to prediction that Excel occupies relative to financial modeling. It does not replace human judgment. It provides a simulation engine grounded in physical reality that makes human judgment dramatically more effective. The analyst who builds a constraint graph and runs a simulation is not trusting a black box. They are constructing an explicit model of the physical system, specifying every assumption as a node or edge, and observing the logical consequences. If the consequences are surprising, the analyst examines the assumptions. If the assumptions are correct, the surprise is the prediction.

The ultimate test. The moat is not the architecture. Architectures can be copied. The moat is years of resolved predictions with full causal traces. A system that has generated, scored, and learned from five thousand predictions across energy, materials, geopolitics, and technology has a calibration database that no competitor can replicate without the same investment of time. The calibration data reveals which edge types are reliable, which node sources are accurate, which categories of predictions the system excels at and which it should avoid. This knowledge can only be acquired by making predictions and seeing whether they come true.

The goal of this knowledge base has been to lay the intellectual foundation for that project. The prediction problem established why existing approaches fail. The history of forecasting showed what has been tried. The weather prediction and Renaissance Technologies articles demonstrated what works and where it hits limits. The simulation thesis identified the architectural gap. The constraint graph, falsification, Bayesian updating, and temporal tracking articles developed the method. The uranium worked example proved the method on real data. The chaos, calibration, and scoring articles established the limits and the measurement framework.

What remains is execution. Constraint graphs for every major supply chain. Prediction markets as input signals. Adversarial falsification as quality control. Temporal tracking as calibration. Archetypal agent simulation as sentiment analysis. Cross-domain cascade propagation as the mechanism for connecting geopolitical scenarios to company-level models. A genuine science of seeing what hasn't happened yet. Not prophecy. Prediction. Measured, scored, and improving with every turn of the flywheel.

References

  1. Tetlock, P.E. & Gardner, D. (2015). Superforecasting. Crown.
  2. Silver, N. (2012). The Signal and the Noise. Penguin.
  3. Ord, T. (2020). The Precipice: Existential Risk and the Future of Humanity. Hachette.
  4. Wells, H.G. (1932). "Wanted: Professors of Foresight." Futures Research Quarterly.

Crystal Ball

PREDICTION ENGINE

A constraint-graph simulation engine that models physical supply chains as directed graphs of measurable quantities, simulates forward through known transfer functions, identifies binding bottlenecks the market hasn't priced, and scores every prediction against reality.

Sources Facts Constraint Graph Simulation Predictions Resolution Learning

CONSTRAINT GRAPH SIMULATION

Model physical supply chains as directed graphs of measurable quantities with causal edges. Monte Carlo forward simulation identifies binding constraints before the market prices them. The architecture that made weather prediction possible, applied to economics.

ADVERSARIAL FALSIFICATION

Every prediction gets kill conditions. The system generates specific, observable, time-bounded falsifiers and actively watches for disconfirmation through real-time fact pipeline matching. Predictions earn trust by surviving attempts to break them.

SCORED PREDICTIONS

Decomposed Brier scores separate calibration from resolution. Temporal confidence trajectories track how predictions evolve over time. Prediction market cross-referencing identifies divergences between Crystal Ball and market consensus.

COMING 2026

A LAKS INDUSTRIES DIVISION

Reference

Master Bibliography

Bibliography will be consolidated as articles are completed. Individual article reference lists appear at the end of each section above.