AlphaGo Shows How Search Can Turn RL Into Supervised Learning

Eric JangDwarkesh PatelFriday, May 15, 202628 min read

Eric Jang rebuilds AlphaGo as a way to examine why its combination of search, value learning and self-play still matters for modern AI. His central claim is that AlphaGo’s Monte Carlo Tree Search turns each move into a better supervised-learning target, avoiding the long-horizon credit-assignment problem that makes much reinforcement learning for language models inefficient. Jang also argues that current LLM research assistants can already help execute and optimize experiments, but still struggle with the harder judgment of choosing which research paths are worth pursuing.

AlphaGo’s relevance is the density of its learning signal

Eric Jang rebuilt AlphaGo because it remains one of the cleanest examples of a hard problem made tractable by combining search, learned evaluation, and self-play. The important point is not nostalgia for Go. It is that AlphaGo can turn local search into dense supervised labels, while much modern reinforcement learning still has to infer which parts of a long trajectory caused a final success.

Go looked hostile to brute-force search. A 19-by-19 board begins with roughly 361 legal moves, games can run on the order of 250 to 300 moves, and a naive tree has a branching structure so large that exhaustive search is out of the question.

361^300

order-of-magnitude size Jang gave for a naive Go game tree before accounting for symmetries and merging

That compression is what Jang found mysterious. In robotics, he said, neural-network decisions often feel more intuitive: the network maps observations to actions in a way that can be inspected against physical behavior. In Go, the decision appears to amortize “a very, very deep search” through the game tree. A network with perhaps ten layers can acquire a board-level judgment that stands in for an enormous number of possible futures.

AlphaGo was the first paper that kind of like really showed this profound level of simulation being compressed into a small amount of compute.

Eric Jang · Source

The practical context has also changed. Jang pointed to KataGo, David Wu’s open-source Go engine, as achieving roughly a 40x reduction in the compute needed to train a strong bot from scratch compared with earlier systems. He said he was not certain whether KataGo is stronger than AlphaGo Zero, AlphaZero, or MuZero, but described it as “very, very strong” and the bot most Go practitioners now train against. With modern rented compute and LLM coding support, he said, work that once required a DeepMind research team and millions of dollars can now be attempted for a few thousand dollars. That figure was his project-level estimate, not a settled benchmark for reproducing AlphaZero from scratch.

The deeper lesson is that AlphaGo uses search not merely to choose moves, but to manufacture better supervised-learning targets. On each move, the raw policy network proposes a distribution over moves. MCTS improves that distribution by looking ahead, guided by the network’s own policy and value estimates. The improved distribution then becomes a target for the network itself. Over time, the network learns to produce in one forward pass what previously required many simulations.

That structure avoids a central difficulty in naive reinforcement learning. Rather than waiting until the end of a 300-move game and trying to infer which move caused the win, AlphaGo can produce a better training target for every move along the way. Jang repeatedly contrasted this with model-free policy-gradient learning, where a final reward must be assigned backward across a long trajectory.

AlphaGo is not trying to do credit assignment on wins. It’s trying to improve the label for any given action you took.

Eric Jang

At the implementation level, this makes the system look surprisingly like supervised learning. The value head is trained to predict the eventual winner from a board state. The policy head is trained to imitate the improved move distribution produced by search. Jang described this as the elegance of AlphaGo: the system hill-climbs on improved labels, rather than trying to escape a flat reward landscape where almost every sampled trajectory fails.

The value function is the shortcut humans already use

Go’s local rules are simple enough to implement quickly. Players place black and white stones on intersections and try to control territory. Black moves first. Stones are captured when all of their crosswise neighboring intersections, not diagonals, are occupied by the opponent. Jang used the analogy of a stone being cut off from oxygen.

The simplicity of the rules hides the difficulty of evaluation. A stone or group can be threatened if it has only one remaining liberty. Local fights can force immediate responses. But losing a local group may be rational if it buys influence elsewhere. Jang described this as the beauty of Go: one can “lose the battle but win the war,” and as board size increases, the micro-versus-macro dynamics become more interesting.

The endgame is the useful bridge to value functions. Humans often stop playing well before every territory is mechanically resolved. They agree that the game is over, agree which stones are dead, and score from that mutual judgment. Jang framed this as two humans’ implicit value functions reaching consensus. If they disagree, they keep playing.

Computer Go needs unambiguous rules. Jang therefore described Tromp-Taylor scoring, which is designed to be algorithmically resolvable. Under Tromp-Taylor, a game ends when a player resigns or both players pass consecutively. Scoring counts the stones a player controls and empty intersections that are not connected to the opponent’s stones. This can diverge from human intuition: positions that a human knows are dead may still count as territory under the mechanical scoring rule until fully resolved.

That distinction matters for training. Humans can glance at a board and decide that an apparently unresolved group has no chance of life. A program using Tromp-Taylor scoring may need the game played out to remove ambiguity. The value network’s job is to learn the human-like shortcut: given a board state, estimate the probability of winning without expanding the full tree to terminal resolution.

On the blackboard, Jang wrote the value function as:

V_{θ} (s) \approx p (win)

This is how AlphaGo attacks the depth of the tree. Without a value function, a search process would need to roll positions much farther forward before assigning reliable values. With a value function, a leaf in the partial search tree can be evaluated immediately.

The complementary network is the policy:

π_{θ} (a ∣ s)

The policy addresses the breadth problem. It says which moves are worth considering. The value network reduces how far the system must search; the policy network reduces how many branches it should search. Jang summarized the high-level division as two problems: the breadth of the tree and the depth of the tree. AlphaGo shrinks both.

MCTS builds a sparse tree, then turns visit counts into a policy

AlphaGo’s search is an iterative tree-building process. On a given turn, the AI encodes the current board state and treats it as the root node. It then runs a fixed number of simulations—Jang mentioned ranges such as 200 to 2048 during training, while AlphaGo Lee used far more at match time—to expand and evaluate promising branches.

The search object is a tree of states. Because Go is deterministic and fully observable, an action can be inferred from the child state: moving from a parent board to a child board corresponds to placing a stone at the differing location. Each node stores visit counts, a mean action value, a prior probability from the policy network, and pointers to children.

Quantity	Meaning in Jang’s MCTS explanation
N_a	Visit count for taking action a from the parent
Q_a or Q(s,a)	Mean action value estimated from downstream evaluations
P_a	Prior probability for action a from the policy network
children	The child states reached by legal moves from the current node

The node structure Jang used to explain AlphaGo’s search

The basic problem is an exploration-exploitation tradeoff. If the system knew the exact value of every child, it could always choose the highest-value action. But because it is building the tree while searching it, it must sometimes explore actions it has not yet evaluated enough. Jang introduced the UCB1 bandit criterion as a predecessor and then described AlphaGo’s PUCT rule, which incorporates the neural network’s policy prior:

ar g a max [Q (s, a) + c_{p u c t} P_{a} \frac{N}{1 + N _{a}}]

Here, $Q (s, a)$ is the mean action value; $P_{a}$ is the policy network’s prior probability for action $a$ ; $N$ is the parent visit count; and $N_{a}$ is the action’s visit count. The denominator makes frequently visited actions less attractive for exploration. The prior means the search is not uniform: it is guided toward moves the network already regards as plausible.

Dwarkesh Patel restated the intuition: $Q$ represents how often simulations downstream of a node lead to wins, while the exploration term asks whether a branch has been sampled enough relative to alternatives. Jang agreed, with one clarification. Go itself is deterministic, so probabilities enter not because the game is stochastic, but because the search samples from a distribution over possible futures. The $Q$ value is the expected action value under the random distribution induced by the search process.

Each MCTS simulation has four steps: selection, expansion, evaluation, and backup. Selection walks down the existing tree using PUCT. Expansion adds children when the simulation reaches a node not yet fully expanded. Evaluation uses the value network to estimate whether the position is winning. Backup propagates that evaluation back up the tree, updating visit counts and mean action values.

AlphaGo Lee included an extra grounding step: it averaged the value-network estimate with a real rollout in which the policy played the game to Tromp-Taylor completion. Jang explained the motivation: in late-game positions, an actual playout can anchor the value estimate to reality. But he also said subsequent systems removed this playout, and his implementation did as well, because eliminating it speeds up training considerably.

The tree produced by MCTS is sparse and unbalanced. It is not an exhaustive tree like tic-tac-toe’s full search space. Many branches terminate early because their value appears too low. A smaller number of branches receive many visits. The final visit-count distribution at the root becomes the improved policy: not just the move selected, but the full distribution over moves after search.

For each move in a self-play game, the system can store the board state, the MCTS visit-count distribution from that state, and, once the game ends, the final outcome. The state plus final outcome trains the value head. The state plus visit-count distribution trains the policy head. The actual move matters for producing the trajectory, but the important label is the improved distribution discovered by search.

On the next turn, the process starts again from the new board state, with a new root. The search tree itself need not be the permanent artifact. The permanent artifact is the per-move training data generated by the search.

The network learns to replace its own search

The network architecture Jang described is conceptually simple. The board is encoded as channels, analogous to an RGB image: one channel for black stones, one for white stones, and perhaps one for empties or masks. A shared trunk—Jang used a ResNet in his own work—branches into two heads. The value head outputs a single logit estimating win probability. The policy head outputs a vector over legal board positions, such as $R^{361}$ for a 19-by-19 board.

Jang emphasized that the exact architecture is not the main point. Transformers work; ResNets work. In small-data, low-budget regimes, his experience was that ResNets performed better, but he did not present that as a settled general result. He attributed the advantage he saw to the inductive bias of local convolutions: Go contains local spatial patterns, and convolutional networks encode local structure naturally. Transformers can aggregate global context more directly, but require more data to learn local invariances.

KataGo, he noted, found it useful to aggregate global features throughout the network so information from one side of the board could influence value judgments elsewhere. On a large Go board, local fights may be spatially separated but strategically connected. A convolutional network’s receptive field must grow through layers to connect them; attention can do that more directly. Still, Jang said he tried hard to make Transformers outperform ResNets for this project and had not succeeded so far.

The original AlphaGo Lee used supervised learning from expert human games to initialize the policy. Jang recommended the same style of warm start for practitioners. His general research advice was to initialize a project “as close to success as possible”: start with something that works, then improve it, rather than beginning with a system that fails everywhere.

Training from expert games is straightforward. For the policy, train the model to predict expert moves, especially moves from winning games. For the value, train it to predict the eventual winner from each board state. Early board states should converge toward roughly 0.5 win probability, because many games can branch from them into either outcome. As the game progresses, value predictions become more decisive.

Even this supervised policy can be strong. Jang said a raw policy network trained on expert data—choosing the argmax move without search—would likely beat most human players. This is already striking in his telling: a small network, in his example perhaps around ten layers and under roughly three million parameters, can play Go well by “shooting from the hip.” But MCTS makes it substantially stronger.

At test time, search improves the raw policy. During training, the system then distills that improvement back into the network. Jang drew this as a test-time scaling curve: more simulations monotonically improve performance, and once the network learns to imitate the result of many simulations, it begins the next round of search from a better point. Search labor gets amortized into the model.

Patel asked whether the shape of this scaling curve was sigmoid-like and whether gains diminish similarly after distillation. Jang cautioned that he did not know the precise test-time scaling behavior of MCTS simulations and was only drawing a monotonic curve. The claim was not about the exact curve shape, but about the direction: more simulations improve the policy, and distilling search into the policy shifts the starting point upward.

Search is only as good as the values it backs up

MCTS is not an unconditional policy-improvement guarantee. Jang treated it as a heuristic that works in practice when the value estimates are good enough. If the value function is wrong, search can make the policy worse.

He gave a concrete failure mode. Suppose a policy’s move recommendations are good, but the value network has poor late-game estimates because the self-play data contains too many resignations and too few fully resolved Tromp-Taylor endings. The system may forget how to evaluate positions humans would normally resign from. If the leaf values are wrong, those errors propagate upward through backup and distort the search distribution. Low simulation counts can also introduce variance: the search may not explore enough to overcome a bad early estimate.

This is why grounding matters. Jang suspected AlphaGo Lee’s real playouts helped anchor the value estimates. In modern implementations, another approach is to force some fraction of games—he suggested 10% as an example—to play all the way to resolution instead of allowing resignation, so the replay buffer contains terminal examples for positions that would otherwise be absent.

Cold-starting AlphaZero-style training is therefore delicate. Early in training, the policy may be useless; what matters first is learning a value function that can predict winners from states. Patel summarized this as the early epochs functioning mainly as value training: play full games, label previous states by the eventual winner, and train the value head until search has something meaningful to back up. Jang agreed.

Jang’s own practical trick, which he explicitly described as not peer-reviewed, was to ensure the value function is good before spending many cycles on MCTS. Expert human games can do this. So can games generated by an open-source Go bot playing itself. On smaller boards, even random play can generate useful terminal data. He suggested that 50,000 random games on a 9-by-9 board can teach a decent value function because common local patterns appear frequently; with an architecture that can train across board sizes, some of that value knowledge can transfer to 19-by-19.

The beginning of a Go game is easy for the value function: it should be near 0.5. The end is also relatively easy: there is less uncertainty, and the board is closer to mechanical resolution. The hard part is the midgame. That is where a value function must learn to recognize “a healthy board state versus a not healthy board state.”

Jang also explained why later AlphaGo-family systems use a shared trunk with policy and value heads, while AlphaGo Lee used separate networks. A shared representation should save compute because the two heads are learning related structure. Policy and value should be consistent: it would be strange for the policy to strongly recommend an action that the value head regards as bad. But rigorously quantifying the compute savings, he said, would require real experimental work.

The exact future can be chaotic while the useful value remains predictable

AlphaGo’s value function led Jang to a broader point about learned simulation. The exact future board state in Go can be highly sensitive to a single stone. Predicting the precise board 100 moves ahead is analogous to predicting the exact microscopic state of a chaotic system. But predicting who will win is a macroscopic question. It averages over many possible futures.

Patel raised weather as a possible counterexample to the idea that neural networks can compress arbitrary simulation: weather is chaotic, and small perturbations affect predictions over time. Jang accepted the analogy but distinguished between predicting an exact state and predicting a useful aggregate. In weather, one may not care about the exact wind velocity at a particular latitude, longitude, and altitude; one may care where the hurricane is. Similarly, in Go, one may not know the exact future board but still estimate the win probability.

He compared this to a Lorenz attractor: the exact trajectory may be unpredictable, but the global structure is visible. Some chaotic systems have macroscopic patterns that can be learned. Patel contrasted this with a hash function, where sensitivity to initial conditions is deliberately designed not to expose useful macrostructure. Jang agreed intuitively, while cautioning that cryptography and computational complexity were outside his expertise.

The point for AlphaGo is that value functions do not have to predict every detail. They need to predict the quantity that matters for decision-making. That is why a network can substitute for much of a game-tree rollout. It is not predicting the exact playout; it is predicting the expected outcome under plausible continuations.

Jang connected this to other domains, such as protein folding or weather, where a worst-case simulation may seem intractable but a learned model may capture the macroscopic quantities that matter. His point was exploratory rather than a proof: AlphaGo made vivid that a neural network can compress an enormous amount of search-like computation into a small number of forward-pass operations.

Naive RL learns from winning trajectories; AlphaGo learns from improved moves

The most consequential comparison was between AlphaGo’s per-move relabeling and the kind of trajectory-level reinforcement learning often used for language models.

A naive self-play algorithm would run games between two roughly equal policies, reinforce all actions from winning games, and ignore or downweight losing games. Suppose policy A and policy B are evenly matched, with true win rates around 50%. Across 100 games of roughly 300 moves each, A might win 51 and B might win 49. If the difference came from one critical exploratory move in one game, then there is only one truly useful supervision signal among tens of thousands of actions. The rest are neutral: imitating them gives you essentially the same policy as before.

good label in Jang’s toy example, surrounded by roughly 99 games of neutral move labels

That is the credit-assignment problem. Winning tells you something happened, but not which move caused it. Advantage estimation, baselines, TD learning, and related RL machinery all try to subtract away the average and isolate which actions were better than expected. But this requires a good estimate of value or average performance from a state.

MCTS changes the problem. It does not simply say “this trajectory won, do more of it.” It locally searches at each state and produces a better distribution over moves. Even in a game the agent eventually loses, MCTS can still give a better action target at each move. Jang compared this to DAgger in imitation learning and robotics: even if a self-driving car has drifted off the road, there is still a corrective expert action that can bring it back.

Patel connected this to LLM policy-gradient training, citing Andrej Karpathy’s phrase “sucking supervision through a straw.” In an LLM setting, a model may generate a long chain of tokens and receive a reward only at the end: a unit test passes, a math answer is correct, or a task succeeds. The reward must then be assigned back across many tokens, most of which may have had little causal relationship to success.

Jang explained that current LLM RL often treats an entire generated sequence as one action: the log probability of the sequence is the sum of token log probabilities, but the reward is assigned at the sequence level. If one instead decomposes the trajectory into many actions with rewards at each step, interaction effects between rewards and log-prob terms can increase variance. The question becomes how to ascribe credit to every action in the episode.

Patel then offered an information-efficiency framing. In supervised learning, every labeled token gives a dense signal. If the prompt is “the sky is” and the correct token is “blue,” cross-entropy tells the model exactly how far its distribution is from the target. In naive RL, an untrained model samples wrong tokens—“holycon,” “toll,” and so on—until it happens to sample “blue.” With a vocabulary around 100,000 tokens, the chance of stumbling on the correct token is tiny, and most samples communicate little beyond “not that.”

He framed learning efficiency as:

Bits/FLOP = Samples/FLOP \times Bits/Sample

Long-horizon RL reduces samples per FLOP because each sample requires a long rollout. Naive RL also reduces bits per sample because most failed attempts are low-information, especially at low pass rates. Jang added that the sampling distribution is the policy’s own distribution: if the policy has no chance of sampling the right answer, it may never receive a useful signal.

Jang then connected this to distillation. A one-hot label contains less information than a soft target distribution. If a teacher provides logits or a full probability distribution, the entropy of the label can be much higher. That is why distillation is sample-efficient. AlphaGo exploits the same principle: the policy is trained not merely to imitate the move MCTS selected, but the full MCTS distribution. Jang suggested that one could experimentally test the importance of this “dark knowledge” by training on the selected MCTS action instead of the soft distribution.

The advantage of AlphaGo, in Jang’s framing, is that it never needs to start from a zero-percent success regime and solve exploration from scratch. Every iteration provides cleaner, more stable supervised targets: value classification and policy KL minimization. The system is always trying to improve the policy relative to its current state, rather than relying on rare successful trajectories to break a flat reward landscape.

Language models lack the conditions that make Go search easy

AlphaGo-style MCTS does not automatically apply to LLM reasoning. Jang did not treat this as impossible; he said the jury is still out, and he expects forward search and simulation to reappear in some form. But Go has properties that make MCTS unusually well matched.

First, value estimation is concrete. A Go position can be resolved, eventually, into a win or loss under clear rules. A value network can be trained against those outcomes and used to truncate search. Second, the action space is discrete and bounded. Even though 361 legal moves is large, it is manageable enough for PUCT-style action selection. Third, the game state is fully observable and deterministic.

Language does not share those conveniences. Although LLMs emit discrete tokens, the effective action space is extremely broad. Jang pointed specifically to the PUCT exploration term:

\frac{N}{1 + N _{a}}

In an LLM reasoning tree, one may almost never sample the same child twice because the space of possible continuations is so large. A heuristic designed to revisit and compare discrete Go moves may not guide language search well. It may also be too locally greedy, producing obvious correct thoughts without solving the final problem.

Patel suggested that LLMs may already learn an implicit version of MCTS: they try an approach, notice it fails, back up, and try another direction. Jang agreed that LLMs can do something resembling human reasoning without an explicit tree. But he still saw value in forward simulation: thinking along multiple parallel tracks may reveal which paths are valuable, even if the final mechanism is not AlphaGo’s exact MCTS.

The domain matters. Mathematics may be more tree-like because proof search has a rigid logical structure. A business negotiation may be less tree-like and less amenable to discrete branch evaluation. Robotics and continuous control also complicate MCTS, though Jang noted that many researchers have explored MCTS-like successors such as MuZero in higher-dimensional settings.

The contrast with Go clarifies why AlphaGo’s learning signal is special. In Go, search can locally improve the next move without solving the entire problem from scratch, because the value function provides a reliable estimate at the frontier. In language-model RL, one often does not know whether a local step was good until the full problem is solved.

Off-policy data helps when it teaches recovery, not when it teaches irrelevant states

AlphaGo can use a replay buffer, but not because off-policy training is harmless. Jang framed the issue through distribution mismatch. MCTS relabels states with improved actions. If the replay buffer contains states the current policy would never visit, training on them can waste capacity or actively harm the policy. In the extreme, a model trained entirely on unreachable states learns to act well in places it will never be, while failing on its actual state distribution.

But off-policy data can also be useful. In a DAgger-like view, the ideal data distribution contains the states on the optimal trajectory plus a “tube” of nearby states the agent might drift into. For those nearby states, labels that funnel the policy back toward the optimal trajectory are valuable. Jang used robotics examples: wind may push a vehicle off course, or tire friction may differ, and the policy needs recovery behavior. In games, the opponent is always perturbing the trajectory.

Replay-buffer state	Effect in Jang’s framing
States on the current or optimal trajectory	Directly useful training data
Nearby off-trajectory states	Useful if labels teach recovery back toward good trajectories
States the current policy would never reach	Potentially wasted capacity or harmful distribution mismatch

Jang’s distinction between useful and harmful off-policy data

So off-policy training is not categorically bad. It is bad when the data is far from the policy’s reachable distribution. It is helpful when it covers plausible deviations and teaches correction.

Jang described an experiment he tried: instead of playing full games and running MCTS at each move, he sampled random states from a dataset and reran MCTS on those states using the current network. This allowed more parallelism and better GPU utilization, because board states at different depths could be relabeled independently. It worked “moderately” well, but he considered it too complex to open-source and noted the same distribution issue: if the current model relabels states it would never reach, capacity is wasted.

He compared this to off-policy robotic learning systems. A replay buffer stores transition tuples. A Bellman updater computes targets such as:

Q (s, a) \leftarrow r + γ a^{'} max Q (s^{'}, a^{'})

The trainer then fits the network’s current $Q (s, a)$ to the target. Patel called this “daydreaming”: the system revisits old experience and asks what could have been done better. Jang agreed. In Go, an MCTS relabeler can play a similar role: take an old state, ignore the action that was actually taken, run current MCTS, and produce a new policy label.

The reason many RL systems converge toward more on-policy setups, Jang said, is stability. Off-policy estimates may still be used to shape rewards or compute advantages, but directly training objectives from stale or mismatched data can blow up more easily.

Scaling laws help after the artifact works

Dwarkesh Patel brought up Andy Jones’s scaling-laws work on board games, which showed tradeoffs between train-time compute and test-time compute. In such settings, more MCTS at inference can substitute for more training compute, at least to some extent. Patel described this as an early anticipation of inference-time scaling in LLMs.

The chart shown on screen plotted test-time compute against train-time compute, with dotted contours for different Elo levels on a 9-by-9 board. The visible fitted relation was:

lo g_{10} (test) = 1.2 lo g_{10} (train) + 0.004 \cdot elo + 29

The axes were FLOP-seconds for test-time compute and train-time compute, and the dotted lines marked minimum train-test compute for Elo values from -1500 to -250. The editorial point of the chart was the same one Jang emphasized verbally: inference compute and training compute can trade against each other, and board games provide a controlled setting for studying that tradeoff.

Jang agreed that the interaction among test-time search, model size, and training compute is profound. One question is how much reasoning must remain explicit search and how much can be packed into the forward pass. Another is how scaling behavior changes as board size grows, since Go can range from small boards to arbitrarily large ones.

Jang initially hoped to use scaling-law thinking to build a compute-optimal Go bot with fewer KataGo-style tricks. He was not successful in that form. His lesson was methodological: scaling laws are most useful once the recipe already works, the data is good, and the system is bug-free. If the system is broken or the data is bad, the apparent scaling curves may simply describe the wrong artifact.

He said he made this mistake early when bugs in MCTS labeling led him to collect expert-policy data and study supervised-learning scaling on it. The plots could look scaling-law-like, but they were not necessarily telling him how to build a stronger self-play system. Scientific understanding often follows engineering success: get the artifact working, then use scaling laws to understand and extrapolate.

Patel also asked whether DeepMind had simply done a poor job training AlphaGo Zero, given that Jang could now attempt related Go-bot work with around $10,000 in donated compute from Prime Intellect. Jang rejected that interpretation. His own work used bootstraps that AlphaZero did not: best-response training against KataGo models, modern open-source baselines, modern hardware, and modern LLM coding assistance. He was explicit that being first is much more expensive than catching up. Once a system exists, later researchers can use distillation, strong opponents, better baselines, and many “crutches” to bootstrap progress.

Jang’s own bot used best-response training against KataGo models to reach strong performance. AlphaZero, by design, did not train against an existing strong policy; it was tabula rasa. First attempts also optimize for time-to-result and capability demonstration, not necessarily compute optimality. Jang said this pattern appears in robotics too: early frontier models are often trained to make a capability appear, not to sit on a clean compute-optimal Pareto frontier.

Modern hardware and tooling also change which tricks matter. KataGo was trained on V100s; Jang said one can now train on fewer modern desktop-class GPUs. Some auxiliary supervision objectives become unnecessary if the model starts from a strong initialization. Infrastructure can be simplified: instead of a distributed asynchronous RL system with replay buffers, pushers, and collectors, one can sometimes run a simpler synchronous loop that collects, trains, and collects again.

Some tricks still help. Jang found 9-by-9 board training useful for resolving endgame value functions and transferring to larger boards. But other details, such as varying the number of simulations between episodes, seemed less sensitive in his experiments. His broader view was that many algorithmic compute multipliers are transitory: as hardware improves or initialization changes, tricks that once mattered may become redundant. He presented that as a preliminary judgment from his own experiments, not as a peer-reviewed result.

Automated research already executes experiments; taste remains the bottleneck

Eric Jang used LLM coding assistants throughout the AlphaGo project, and his observations split cleanly into what the systems already did well and what they did not.

The useful part was execution. Jang said he mostly used Opus 4.6 and 4.7. The models were effective at implementing experiments and doing open-ended hyperparameter optimization. Instead of searching a fixed grid over learning rates, weight decay, and layer counts, the model could inspect gradients, alter layers, modify data loaders, add augmentations, and generally “grind a performance metric” in a more grad-student-like way.

He built a Claude skill called “experiment.” He could specify a question, an x-axis, and a y-axis; the system would run experiments, compile plots, write a report, and suggest possible explanations. For fixed datasets and fixed time budgets, he said, these systems can squeeze out substantial performance on classification-style problems such as Go or LLM tasks.

The unsolved part was research taste. Current public models, in Jang’s account, were not good at choosing the next experiment in a research track, nor at stepping back and deciding that the track itself was misguided. He described research as a tree of experiments: nodes may be failed, successful, or mixed, and each leads to follow-on experiments. A researcher may spend time down a rabbit hole—such as off-policy MCTS relabeling—then decide it is not worth continuing and jump laterally to a different line of attack. The models could help answer questions once prompted, but Jang often had to identify infrastructure bugs himself or ask the right diagnostic question.

This distinction matters for automated science. Execution is improving quickly. But the ability to ask whether the current objective is the right one, whether a failure is a bug or a bad idea, and whether to keep pushing or abandon a direction remains much less automated.

Dwarkesh Patel asked about local verifiability. Some research ideas look bad for a long time before paying off. Deep learning itself required faith through decades of skepticism. Patel said Ilya Sutskever has described research ability partly as knowing when an idea is right and failures are due to bugs rather than the idea itself. Jang agreed that this creates a hard long-horizon RL problem: many intermediate signals may tell you the idea is bad, but the final result may vindicate it.

Jang suggested that Go could be a useful environment for training automated researchers because the outer loop is quickly verifiable while the inner loop still contains real research engineering. A Go project can ask whether an agent plays better, predicts scaling behavior, or improves a compute multiplier. The inner work includes distributed systems, training dynamics, architecture choices, and predictions about whether a modification will work. Skills learned there might transfer to harder domains.

That transfer claim was explicitly speculative. Jang’s non-rigorous analogy was DeepMind: the lab began with games, and its researchers plausibly transferred skills from Atari, Go, and StarCraft into later LLM work—coding, research judgment, project management, and system-building. If humans can transfer from quick-to-verify environments to harder domains, automated researchers may be able to do the same. Patel pushed back that an RL-heavy heritage may also hinder LLM work, and Jang conceded the jury is still out.

The stackability of improvements is another unresolved problem. Patel said he had heard of labs where individually promising ideas failed to combine cleanly; training runs can fail because two locally good modifications interact badly. Jang attributed this partly to redundancy among compute multipliers. Many tricks buy correlated benefits. As hardware improves, some stop mattering or stack less well. Research taste includes knowing how much the bitter lesson can buy at the present moment and where heuristics are still needed because compute, parameters, and initialization are finite.

The outer loop is easier in Go than in general AI, but not trivial even there. Win rate against a strong bot is a clear metric. Discovering a new phenomenon, such as a useful scaling law, is harder to reward automatically. For general AI self-improvement, benchmarks measure only part of what matters: the broader target is economically useful work, which is harder to verify.

Jang pointed readers to his website, evjang.com, an interactive tutorial linked from his blog, and the autogo repository under his GitHub username ericjang. Patel also recommended Jang’s blog post “As Rocks May Think,” which argues more broadly about what happens when thinking becomes a primitive in computer science. Jang’s final point was not that LLMs should literally put trees inside their reasoning, but that Go, MCTS, and modern reasoning systems have an underexplored duality that can still be studied on small budgets.

Agents and Autonomy Coding Assistants Data and Training AI Research Methods