Every existing prediction system in the world does inference. Not one of them does simulation. This distinction is the entire thesis of Crystal Ball, and it is worth being precise about what the words mean.
Inference is reasoning from evidence to conclusion. A doctor observes symptoms and infers a diagnosis. An intelligence analyst reads intercepted communications and infers an adversary's intentions. A hedge fund quantitative model observes price patterns and infers future price direction. A large language model reads a prompt and infers the most likely continuation. In every case, the process is the same: given facts, reason about implications.1
Simulation is something fundamentally different. It is the construction of a causal model of a system, the measurement of the system's current state, and the forward propagation of that state through the model's equations. A weather simulation does not reason about what the weather will be. It computes what the weather must be, given the current temperature, pressure, humidity, and wind speed at every grid point, and the laws of physics that govern how those quantities evolve.2
The difference is not one of sophistication. Inference can be extraordinarily sophisticated. Renaissance Technologies employs two hundred PhDs and fifty thousand compute cores to perform inference on petabytes of market data, and the Medallion Fund has returned 66 percent annually before fees for over thirty years.3 The difference is one of architecture. Inference asks: given what I have observed, what is likely true? Simulation asks: given what I know about how this system works, what must happen next?
The inference ceiling. Consider each of the major prediction systems currently operating in the world, and notice that every one of them is doing inference.
Renaissance Technologies finds statistical patterns in historical price data. The patterns are real, the returns are extraordinary, but the system has no model of why prices move. It cannot explain its predictions. It cannot extrapolate beyond the statistical regime in which the patterns were observed. When the underlying market structure changes, as it did during the 2020 COVID crash when the institutional funds RIEF and RIDA lost 14 percent in October 2025, the statistical patterns break and the system breaks with them. This is the fundamental fragility of pattern recognition without causal understanding.4
Prediction markets aggregate the opinions of participants who have money at stake. Polymarket processed over ten billion dollars in trading volume in 2025. The prices are efficient in the sense that they incorporate diverse information rapidly. But prediction market prices are not forward simulations. They are aggregated beliefs. When participants share the same blind spot, as they did when political prediction markets systematically underpriced Donald Trump's chances in 2016 and 2024, the aggregation mechanism fails precisely because it is aggregating the same type of inference from overlapping information sets.5
Large language models perform inference on training data. When asked to predict the future, they generate text that is statistically consistent with the patterns in their training corpus. This is pattern completion, not simulation. An LLM cannot model the physical constraints of a uranium supply chain because it has no representation of the causal structure. It can tell you that enrichment capacity is important. It cannot compute the month in which enrichment capacity becomes the binding constraint under a given set of demand assumptions. The first is inference. The second requires simulation.6
Human expert analysts perform inference from experience and domain knowledge. Tetlock's twenty-year study demonstrated that this is the weakest prediction architecture of all. The average expert performed barely better than chance over 28,361 predictions, and the most confident experts performed the worst.7
Why simulation works where inference doesn't. The reason weather prediction improved from 50 percent accuracy in 1970 to 90 percent accuracy today is not that meteorologists became smarter. It is that someone built a simulation. NOAA's Weather and Climate Operational Supercomputing System discretizes the atmosphere into a three-dimensional grid. Each grid cell contains measurable quantities: temperature, pressure, humidity, wind velocity. The cells are connected by equations that encode the laws of physics: conservation of mass, conservation of energy, the Navier-Stokes equations for fluid dynamics. The simulation starts from measured initial conditions and steps forward in time. The forecast is not an opinion. It is the output of a computation.8
This architecture has three properties that inference lacks. First, it is physically grounded. The model is constrained by conservation laws. Mass cannot be created or destroyed. Energy is conserved. These constraints eliminate entire regions of the prediction space that inference would have to consider. Second, it is falsifiable at every node. Every grid cell produces a prediction that can be checked against a sensor reading. When the prediction is wrong, the error can be attributed to a specific model parameter, which can be corrected. Inference systems rarely have this property because the reasoning path from input to output is opaque. Third, it is scale-independent. The same model runs at 13 kilometer resolution or 3 kilometer resolution. The physics is the same. The resolution determines the granularity of the prediction, not its fundamental approach.
The gap nobody has filled. If physical simulation is the architecture that makes prediction work, why hasn't anyone applied it to economics and finance? The answer is institutional, not technical.
Weather agencies have no mandate to model markets. Their budgets, their expertise, and their institutional incentives are oriented toward atmospheric science. Quantitative hedge funds, which do have financial incentives, have converged on statistical inference because it is profitable in the short term and because the mathematical culture of quantitative finance descends from physics through a lineage, Bachelier to Black-Scholes to RenTech, that treats prices as stochastic processes rather than outputs of physical systems. Prediction markets aggregate opinion but do not model causality. Academic economics has been dominated since the 1970s by rational expectations models that assume away the very frictions that create predictable supply-demand imbalances.9
The opportunity, therefore, is structural. Nobody is doing physical simulation of economic supply chains because the expertise is siloed. The people who understand physical simulation work at weather agencies and aerospace companies. The people who understand supply chains work at commodity trading firms and industrial companies. The people who understand prediction scoring work in decision science and psychology. No institution combines all three.
What supply chain simulation looks like. A uranium supply chain has perhaps two hundred measurable quantities. Mine production in Kazakhstan, measured in millions of pounds of U3O8. Conversion capacity, measured in kilograms of UF6. Enrichment capacity, measured in separative work units. Reactor demand, measured in gigawatts of installed capacity times fuel consumption per gigawatt. Spot price. Term contract price. Inventory levels. New mine development pipeline with known timelines.
These quantities are connected by known transfer functions. One gigawatt of reactor capacity requires approximately 400,000 pounds of U3O8 per year. Enrichment capacity is constrained by the number and capacity of centrifuge cascades, which have known throughput limits. New mines require seven to ten years from discovery to production, a delay function that is determined by geology, regulation, and capital availability. Spot price below sixty dollars per pound makes no new mine development economical, a threshold function that creates a floor under long-term supply response.
This is the same architecture as a weather model. Measurable quantities at nodes. Known transfer functions on edges. Physical bounds that constrain the solution space. The system is vastly simpler than the atmosphere. It requires no supercomputer. A Monte Carlo simulation of two hundred scenarios across twenty-four months runs in seconds on a laptop.
The output is not a price prediction. It is a map of physical constraints. The simulation identifies binding bottlenecks: the nodes where demand exceeds supply with highest probability. "Enrichment capacity becomes the binding constraint in month fourteen in 72 percent of scenarios." This is a different kind of prediction than "uranium will go up." It is a statement about physical reality that is either true or false, that can be checked against measurable data, and that has causal explanatory power.10
The role of the LLM inverts. In every existing system, the LLM or the human expert is the generator of predictions. The system asks: "What do you think will happen?" and the oracle answers. In a constraint-graph architecture, the LLM's role inverts. The simulation generates the predictions. The LLM's role is validation: "Here are the three bottlenecks the simulation identified. Does this make sense given what you know? What physical constraint am I missing?"11
This is precisely where LLMs are strongest. They are mediocre generators of novel predictions because they are pattern-completion engines operating on training data. They are excellent validators because they can synthesize domain knowledge from thousands of sources to check whether a specific claim is consistent with known physics. The constraint graph generates the hypothesis. The LLM stress-tests it. This is the architectural inversion that makes Crystal Ball possible.
The meta-learning signal. Every prediction generated by the constraint graph has a source tag: "constraint_graph" or "opus_inference." When predictions resolve, the system compares the accuracy of graph-sourced predictions against LLM-sourced predictions. If the graph outperforms, the system should trust the graph more and generate more graph-based predictions. If the LLM outperforms on certain categories, the system should examine why the graph failed and whether a node or edge is miscalibrated.
This is the flywheel. Simulate. Predict. Score. Compare sources. Recalibrate. Simulate again. Each turn of the flywheel makes the model more accurate, not because anyone got smarter, but because the architecture is designed to learn from its own errors. It is the same flywheel that took weather forecasting from 50 percent to 90 percent accuracy over fifty years, compressed into a domain where the physics is simpler and the feedback cycle is faster.
References
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd ed. Cambridge University Press.
- Bauer, P., Thorpe, A., & Brunet, G. (2015). "The quiet revolution of numerical weather prediction." Nature, 525, 47-55.
- Zuckerman, G. (2019). The Man Who Solved the Market: How Jim Simons Launched the Quant Revolution. Portfolio/Penguin.
- Patterson, S. (2010). The Quants: How a New Breed of Math Whizzes Conquered Wall Street and Nearly Destroyed It. Crown Business.
- Arrow, K. et al. (2008). "The Promise of Prediction Markets." Science, 320, 877-878.
- Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search. 2nd ed. MIT Press.
- Tetlock, P.E. (2005). Expert Political Judgment: How Good Is It? How Can We Know? Princeton University Press.
- NOAA. (2022). Weather and Climate Operational Supercomputing System (WCOSS) documentation.
- Sargent, R.G. (2005). "Verification and validation of simulation models." Proceedings of the Winter Simulation Conference, 130-143.
- Oreskes, N., Shrader-Frechette, K., & Belitz, K. (1994). "Verification, Validation, and Confirmation of Numerical Models in the Earth Sciences." Science, 263(5147), 641-646.
- Koller, D. & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
- Heckerman, D. (1995). "A Tutorial on Learning with Bayesian Networks." Microsoft Research Technical Report MSR-TR-95-06.