LLMs Play Games Better When They Write Simulators First

Wolfgang LehrachStanford HAIFriday, June 5, 202617 min read

DeepMind research scientist Wolfgang Lehrach argues that language models should not be asked to play games directly when their outputs are slow, strategically weak, or illegal. In a Stanford HAI seminar, he presents Code World Models, which use LLMs to translate natural-language rules and play traces into executable game simulators that planners such as Monte Carlo Tree Search or reinforcement learning can use. He also describes Autoharness, a narrower system that synthesizes code to check action legality, as part of the same broader case for turning LLM knowledge into executable structure rather than immediate moves.

The central move is to make the LLM write the game, not play it

Wolfgang Lehrach’s main distinction is between using a language model as a policy and using it to synthesize something the rest of an AI system can reason with. In the standard setup, the environment state is passed to an LLM and the LLM emits an action: a chess move, a blackjack decision, a text command. A related “code-as-policy” setup asks the LLM to infer a strategy from trajectories and write a function that directly chooses actions.

Lehrach’s first alternative, Code World Models, changes the target. The LLM is not asked to say what to do. It is asked to write executable code that replicates the game’s dynamics. Once that model exists, conventional planning methods—Monte Carlo Tree Search, reinforcement learning, or other game-solving algorithms—can operate on the code.

His analogy was deliberately mundane: if an alien demanded that humanity win a game or be destroyed, the natural response would not be to ask a language model for one move at a time. A human team would implement the game and run a planner. Code World Models are an attempt to give an LLM that tool: synthesize a model of the game, then let search and learning do the parts they are good at.

If an alien landed and said, “It is imperative that you win at this game no matter what, otherwise the earth will be destroyed,” then what would you do? You would immediately sit down and implement the game and then use a planner.

Wolfgang Lehrach · Source

The motivation is not that LLMs have no useful game knowledge. Lehrach’s account depends on the opposite: LLMs often know rule language, common tactical concepts, and programming idioms. The problem is that direct policy generation is a poor way to use that knowledge. For unfamiliar or out-of-distribution games, LLMs tend to make low-quality moves. In adversarial multiplayer settings, they struggle with strategic reasoning. They are slow at test time if they rely on long chain-of-thought planning. Training an LLM directly with reinforcement learning is, in his words, “insanely expensive” and sample inefficient for this purpose.

Code-as-policy is broader and can sometimes work in simple environments, but Lehrach said it breaks down as games become more complicated. A fixed policy function has to absorb strategic reasoning, non-stationarity, and feedback from adversarial play. In principle, a sufficiently capable code-as-policy system could decide to recreate the game and run MCTS itself. In practice, the work he presented separates those responsibilities explicitly: synthesize the world model first, then plan over it.

Approach	What the LLM produces	What happens at evaluation time
LLM-as-policy	An action directly from the current environment state	The environment state is sent to the LLM; the LLM returns a move
Code-as-policy	A function that maps environment state to an action	The environment state is passed to the learned function; the function returns a move
Code World Model	A function that replicates the game dynamics	A planner such as MCTS uses the synthesized model to search before selecting a move
Code harness	A partial helper function, such as an action verifier	The LLM proposes actions while synthesized code filters or assists them

The core distinction in Lehrach’s framing: whether code is the policy, the world model, or a runtime harness

Lehrach’s baseline flow was “Env → LLM → Action.” Code-as-policy inserted a learned function between environment and action. The Code World Model flow routed the environment through “MCTS + F(...).” The harness flow used “LLM + F(...)” at evaluation time. The difference is not merely notation. In the CWM case, F is meant to reproduce the environment. In the harness case, F may only answer a narrower question, such as whether a proposed action is valid.

Rules are not enough, and trajectories are not enough either

The Code World Model system uses two kinds of evidence: natural-language rules and game trajectories. Lehrach emphasized that natural-language rules are ambiguous in the way board-game rules are ambiguous when people first sit down to play. Corner cases and interactions often become clear only through play. Trajectories, meanwhile, provide concrete examples of state transitions.

The synthesis pipeline uses the rules to prompt the LLM to generate code, then turns trajectories into unit tests. A candidate world model is evaluated by whether, given a state and an action from a trajectory, it produces the next state that actually occurred. In fully observed games, the goal is to learn a Python function F such that F(s_t, a_t) = s_(t+1).

The scoring rule is the number of trajectory transitions the candidate function reproduces: for each trajectory, and each observed state-action-next-state triple in it, the test checks whether F(s_t, a_t) equals s_(t+1). The selected model is the candidate with the highest transition score.

The choice of Python and unit tests is not incidental. Lehrach said the system is deliberately kept close to what LLMs have seen in training: large quantities of Python code and many examples of fixing Python unit-test failures. A scalar pass/fail score alone would be too weak. Code execution produces traces, error messages, assertion failures, and opportunities for debugging. He described the desired optimizer as a form of “text gradient descent”: given code, scores, and textual failure feedback, improve the code.

The search over code variants is structured as a tree. Each node contains an attempted implementation of the game engine. A refiner uses failed unit tests and score feedback to generate a new child implementation. Thompson sampling decides which promising node to expand, balancing refinement of good candidates against exploration of alternatives. Lehrach noted that other code-evolution systems, including AlphaEvolve-like mutation-based search, are relevant to the same class of problem, though the work he described used a method with stronger support for textual feedback at the time.

The broader point is that the system can check consistency between representations. Natural-language rules should match trajectories. Code-generated trajectories should match both. An LLM can be asked to examine whether a code model appears consistent with the rules. This consistency checking is one of the recurring themes in the work: different representations of the same environment can supervise and constrain one another.

In fully observed games, synthesized models were accurate enough to plan over

In the fully observed setting, every player can see the complete game state, as in chess or tic-tac-toe. Stochasticity is represented by a special “chance player” that performs actions such as dealing cards or rolling dice. Once chance events are expressed as actions, the state transition function itself can be deterministic: given the current state and the action, produce the next state.

The work tested games including backgammon, connect four, tic-tac-toe, generalized tic-tac-toe, and generalized chess. The generalized games were included because known games risk contamination: an LLM may already have seen them in training. Lehrach reported transition accuracy above 99% for all five perfect-information games.

>99%

synthesis transition accuracy reported for all five perfect-information games

The next step is to plan with the synthesized model. Lehrach used Monte Carlo Tree Search as the main example. MCTS expands a tree of possible moves, selects promising branches, simulates continuations, and backpropagates outcomes. Because the world model is code, expansion and rollout are computationally cheap compared with repeatedly asking an LLM to reason about future moves in text.

For leaf evaluation, the system can either use random rollouts or ask an LLM to propose value functions. Lehrach’s example was intuitive: a language model may know that losing a queen in chess is bad or that a connect-four position with three in a row and open ends is promising. Those intuitions can be distilled into heuristic value functions, which are then selected through tournament-style evaluation rather than regressed against a ground-truth value label.

The empirical contrast with direct LLM play was strongest where the baseline model either forfeited or planned poorly. Against Gemini 1.5 Pro, CWM-MCTS was evaluated as both player 0 and player 1 across win, draw, loss, win-by-forfeit, and loss-by-forfeit outcomes, with additional comparisons to ground-truth MCTS and random valid-action play. Lehrach emphasized backgammon: Gemini 1.5 Pro “could just never finish a single game without messing up,” because at some point it made an illegal move and lost. In connect four, illegal moves were less the issue; the model’s strategy was poor. The synthesized-code approach could explore many futures in the time a direct LLM policy spent reasoning only a move or two ahead.

Lehrach also compared MCTS using the synthesized model with MCTS using the ground-truth game model. He did not dwell on the details, but said the difference was not large in the expected way: once the synthesized model is accurate, planning over it behaves similarly to planning over the true implementation.

Partial observability turns world modeling into inference over hidden histories

The more difficult case is imperfect-information play: blackjack, poker-like games, card games where hidden hands matter. In those settings, a planner cannot simply assume access to the true full state. Seeing an opponent’s cards in poker would make planning easier, but it would be cheating.

Lehrach introduced two additional components. The first is an observation mapping M, which maps the full state to what a given player can observe. In blackjack, for example, it filters out hidden dealer information. The second is an imputation model I, which samples hidden histories compatible with a player’s observations and actions. Information Set MCTS then plans by repeatedly resampling plausible hidden states and averaging over what is unknown.

In practical terms, if a blackjack player sees a 2 and a 10, the imputation model must imagine possible dealer cards and chance events that would be consistent with that observation. It must infer actions for all players and chance, not just the observing player. The system then replays those hypothesized actions through the transition model F and validates that applying M produces the observations the player actually saw.

Lehrach distinguished two learning regimes. In “open-deck” learning, hidden state is visible during training, as when people learning a card game reveal their hands so everyone can understand the rules. At test time the game is still closed-deck, but training is easier because the system sees the latent state. In “closed-deck” learning, the model sees only the player’s observations and actions from the beginning. It must infer both the hidden state and the rules governing it.

Closed-deck learning is substantially harder. The model must construct entire plausible histories under constraints: a card cannot be drawn twice, turns must occur in the correct order, hidden actions must explain visible observations. Lehrach said the chain-of-thought traces for these models become long because the imputation problem is involved.

Setting	What the learner sees	Main difficulty
Open-deck training	Hidden state during training, closed-deck at test time	Learning game dynamics and observation mappings with full latent information available during learning
Closed-deck training	Only each player’s observations and actions	Inferring hidden histories, chance events, and latent state consistent with observations

Lehrach’s distinction between open-deck and closed-deck learning in imperfect-information games

The results reflected that difficulty. For five imperfect-information games—Bargaining, Gin Rummy, Hand of War, Leduc poker, and Quadranto—the system improved as more LLM calls were used. Lehrach reported that 3 out of 5 imperfect-information games reached the stated transition-accuracy threshold, with Gin Rummy identified as the notable failure to reach 90%. Inference accuracy was generally lower than transition accuracy for open-deck games, and closed-deck-learned models were typically less accurate than open-deck-learned models.

In gameplay, Lehrach described the imperfect-information results as generally working but harder to make clearly superior to LLM-as-policy in the closed-deck setting. For Quadranto, Hand of War, and Bargaining, he said the system outperformed the base LLM. Other games were more ambiguous.

Closed-deck learning resembles a heavily regularized autoencoder

Lehrach treated closed-deck world-model learning as more than a games trick. He described it as “basically a heavily regularized autoencoder.” The encoder is the imputation model I, which reconstructs a hidden history of actions and chance events from observations. The decoder is the deterministic replay through F and M, which must regenerate the observations.

That formulation matters because it suggests a broader modeling pattern: given data from a system, infer hidden mechanisms and stochastic events that could have produced it. He gave a non-game example: election data over the last decade. With no meaningful actions, or with actions treated as no-ops, the model could be asked to infer a generative structure that explains the observed changes. The point was not that the presented system solved election modeling, but that the same code-world-model framing could apply to datasets where latent causes and chance events matter.

The explicit extraction of stochasticity is what enables counterfactual reasoning. In a card game, one can ask what would have happened if the dealer drew a 10 instead of a jack. In an election model, one could ask what would have happened if a different person had been nominated. Lehrach was careful to qualify this: “how well it works is up to the model.” But he framed the representation as causal in spirit because chance events and interventions become explicit objects in code.

This also exposes under-specification. Many different models can explain the same observations. Lehrach illustrated this with a simple dice example. Suppose the true game uses a four-sided die, but the learned model uses a six-sided die. If the imputation model simply never samples 5 or 6 for the observed data, the model may pass every reconstruction test. It is still worse: probability mass leaks into impossible outcomes, so the model needs more support than necessary to explain the same observations.

The remedy he described is a log-likelihood lower bound or related complexity criterion: prefer models that explain the data with less unnecessary probability mass. For simple games this may not matter much, but Lehrach said it becomes useful for harder probabilistic models.

Planning can be amortized into a policy, but search remained stronger in the simple implementation

MCTS is powerful but can be slow at runtime because the system uses the world model during action selection. Lehrach also tried replacing online search with reinforcement learning. The idea is to synthesize the CWM, generate imaginary rollouts from it, and train a reactive policy with PPO. At runtime, the policy acts directly; the CWM is no longer invoked.

The PPO setup trained policies for 10 million steps against a randomly playing opponent, separately for each game and player ID. The training trajectories were generated by imaginary rollouts from the CWM, not real environment steps. Lehrach’s informal point was that code makes large-scale rollout cheap: he said they could unroll the model “a billion times or whatever,” and that in JAX this could take minutes, or maybe around an hour.

10M

PPO training steps per game and player ID in the CWM rollout setup

The results were mixed. PPO trained on the CWM matched or beat LLM-as-policy for all tested games, according to Lehrach’s summary, but was less consistently strong against CWM plus search. Lehrach said the naive PPO implementation “did okay” but was not as good as MCTS. His expectation was that strong systems would be hybrid, combining amortized learning with search and planning, as many strong game-playing systems do.

Autoharness narrows the task: do not model the world, just stop illegal actions

The second piece of work, Autoharness, is less ambitious than a full Code World Model but more immediately practical for text-based LLM agents. Instead of synthesizing a complete simulator, the LLM synthesizes a partial harness that helps a base LLM avoid bad actions—especially illegal moves.

Lehrach motivated the system with a failure mode that appeared repeatedly: many direct LLM losses were not strategic losses but forfeits. In a TextArena chess competition, he reported that 78% of Gemini 2.5 Flash’s losses were due to illegal moves. His framing was pragmatic: legal play does not solve strong play, but there is no point optimizing a strong policy if the agent cannot reliably make legal moves.

Autoharness learns two functions:

Function	Purpose
propose_action(obs: str) -> str	Propose a valid random action from a text observation
eval_action(obs: str, action: str) -> bool	Return whether an action string is valid for the observation

The two code functions learned by the Autoharness system

At evaluation time, one simple policy uses the LLM as the action proposer and the learned code as a verifier. The LLM proposes an action; the harness checks whether it is valid. If not, the LLM receives feedback and tries again, up to a maximum number of attempts. Lehrach described this as a guardrail: the code is a partial world model that predicts a validity bit rather than the full next state.

Other combinations are possible. The code can propose many legal random actions and ask the LLM to choose among them. The code can be used more like a policy, proposing and verifying. But the central result focused on action verification.

The evaluation used TextArena, which provides more than 140 games, including multiplayer poker, board, and card games, with heterogeneous action formats and complex text observations. Lehrach said the team removed legal-action listings from the environment to make the benchmark emphasize the problem they cared about.

The main legality result was simple: the learned harness achieved 100% legal move accuracy on all 145 evaluated two-player and one-player games, across 10,000 steps for 10 random seeds per game. Lehrach showed examples where Gemini 2.5 Flash alone had substantially lower legal-action accuracy—Stratego at 76.12%, chess at 83.98%, Othello at 88.08%, Tak at 92.04%—while the harnessed version reached 100%.

100%

legal move accuracy reported for the learned harness on all 145 evaluated TextArena 2P/1P games

Lehrach also presented performance comparisons, not just legality. On a selected set of 16 relatively long two-player TextArena games, Gemini 2.5 Flash plus harness beat Gemini 2.5 Pro without the harness with a reported player reward of 0.563. He characterized the point as a cost and quota advantage: a smaller, cheaper model with a little infrastructure can beat a larger model without it. Against base Gemini 2.5 Flash, the harnessed version improved by roughly 15 percentage points in the comparison he described.

For one-player games, he said Gemini 2.5 Flash plus harness outperformed both Gemini 2.5 Flash and Gemini 2.5 Pro. He also highlighted a comparison in which Gemini 2.5 Flash plus harness outperformed “GPT-5.2-High” on TextArena one-player games. The figures he gave for that run were approximately $0.0 for the harnessed system versus $640 for GPT-5.2-High.

The two systems differ most in where the supervision comes from

The Q&A clarified an important distinction between the two systems. Code World Models are built from rules plus a small number of trajectories; Lehrach said the CWM experiments generally used five random gameplays plus the rules. Without rules, the LLM can sometimes infer the system, but it takes longer and overfits more easily because less information is available.

Autoharness, by contrast, assumes many interactions with the real game engine. It does not require limited-interaction learning. During training, the synthesized code proposes moves and checks legality while interacting repeatedly with the environment. The true game engine determines whether a move is legal. The system then uses contradictions—moves it proposed that were rejected, or moves it judged illegal that were accepted—as feedback for further code synthesis.

Lehrach said that for TextArena, learning a harness involved on the order of tens of synthesis iterations and large numbers of cheap game interactions, roughly hundreds of thousands per game. The LLM was not playing each training game as a policy; it was synthesizing code that could interrogate the game engine. That is why the supervision structure is so different from CWM: the first system tries to infer a compact simulator from limited demonstrations and rules; the second can cheaply query the real environment many times and use those queries to harden a narrower checker.

System	Primary supervision	Interaction assumption	Main failure mode
Code World Model	Natural-language rules plus a small set of trajectories	Can operate with limited demonstrations; Lehrach said the experiments generally used five random gameplays plus rules	The synthesized simulator may miss dynamics not exposed by rules or trajectories
Autoharness	Repeated interaction with the real game engine and legality feedback	Assumes many cheap environment interactions during harness learning	The harness may cover only part of the legal action space or fail to improve strategic choice

The Q&A clarified that CWM and Autoharness use very different supervision regimes

This difference also shapes the coverage problem. In CWM, the risk is that the synthesized model misses some part of the dynamics because the rules and trajectories did not expose it. In Autoharness, the risk is coverage of the action space. An audience member asked whether the harness might learn only a restricted subset of legal moves—for example, ordinary chess moves but not castling or pawn promotion—thereby achieving legal play while crippling strategy.

Lehrach’s answer was that there is no guarantee of coverage. The system can learn from rules or from seeing an opponent make such a move. Replay buffers can incorporate those examples. The consistency-checking methods can search for trajectories that violate the model, and LLMs can be asked to look for moves the system might be missing. But in large combinatorial action spaces, exhaustive enumeration is impossible. If the model plays a restricted legal subset, the signal may be indirect: it loses.

That answer is also the boundary of Autoharness’s claim. It improves legality and can improve performance, but it does not solve strong play. Lehrach said ranking legal moves with an LLM produced mixed results, and code-as-policy was “hit and miss.” If the model has seen something similar in training, it may do well; if not, credit assignment in two-player games becomes difficult.

The broader ambition is model-based reasoning, not games per se

When asked what inspired the work, Lehrach pointed first to model-based reinforcement learning. His ideal target was not merely benchmark games but environments like NetHack or Crafter, where an agent could play and solve a game “in a single life” by quickly building a model. He contrasted that with putting a transformer on raw pixels and asking it to learn a new game from scratch. The promise, for him, is to distill prior knowledge from the LLM into an explicit model that can be searched, simulated, debugged, and manipulated.

He also connected the work to probabilistic modeling and Judea Pearl-style counterfactual reasoning. In his view, a successful system could build causal world models from data, pull out stochastic effects, and support counterfactual inference. He described it as akin to building probabilistic graphical models, but at a different level of abstraction: instead of a human hand-authoring the model, the human provides biases and the LLM builds the model and performs inference.

Enterprise applications came up in that context. Lehrach did not claim a specific business use case. He said many business systems are effectively models—spreadsheets, for example—and that understanding them causally, identifying stochastic effects, and reasoning about interventions could have value. But he also said he had not found a good concrete enterprise application yet.

On harnesses as runtime control layers for agents, Lehrach’s answer was that harnesses are already common. Coding tools and other LLM applications are extensively wrapped because raw model outputs need structure. Autoharness is one particular way to synthesize such a wrapper.

Asked about environments beyond TextArena, he mentioned Crafter and the possibility of simplified modeling for 3D games, such as inferring bounding boxes or simulating motion. He also speculated that science applications could fit the same pattern if the synthesized code were generalized to differential equations, JAX programs, differentiable simulators, or other model classes. He did not present those as completed results, calling them more a bias than an answer.

AI Application Architecture Evals and Benchmarks AI Research Methods Agents and Autonomy Coding Assistants