In 2016 a machine beat Lee Sedol at Go and it felt like the future arriving. Ten years on, Eric Jang rebuilt that machine over a weekend for the price of a used laptop — and in doing so laid bare the three primitives of intelligence: search, learning from experience, and self-play. This is the walk-through, built up from what a neural network actually is.
Source Dwarkesh Patel × Eric JangFormat 3-hr chalkboard interviewBuilt ~$7K rented computeRead for zero RL background
00
Why step backwards to go forwards
The version you watched — AlphaGo against Lee Sedol, March 2016 — is the one the source still calls jangWhen I saw the early breakthroughs on AlphaGo in 2014, 2015, 2016 and so forth, it was profound to see how smart AI systems could become and the computational complexity class they could tackle with deep learning. a watershed. A problem long believed intractable for any computer fell to deep learning, and it did so in a way nobody fully expected. Jang's bet is that this old machine still teaches the future better than today's chatbots do.
His framing is worth holding onto for the whole read: AlphaGo is the cleanest worked example of the primitives of intelligence — search (reading ahead), learning from experience (training on what happened), and self-play (getting better by playing yourself). jangAlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Every modern system — the LLM you talk to, a robot learning to walk — is some recombination of these three. Go just shows them in their purest form.
And the punchline that frames the second half of the piece: the way AlphaGo learns is almost the opposite of how a large language model learns from reinforcement — and arguably much closer to how you learn. Getting to why that's true is the journey. But we can't talk about networks beating Go until we're honest about what a neural network even is. So we start there.
· · ·
01
What a neural network actually is
You already have the concept: a thing that learns patterns from data. Let's swap the concept for the machinery, because the rest of the piece leans on it.
The machinery — a network in three plain ideas
1. It's a giant tunable function. A neural network is just a mathematical function: numbers go in, numbers come out. What makes it special is that it has millions of little dials inside it — called weights or parameters — and turning those dials changes what the function computes. AlphaGo's network had under 3 million of these dials.
2. Running it once is a "forward pass." You feed the input (a Go board) in one end, it flows through a stack of layers — each layer a simple transformation — and an answer comes out the other end. Ten layers means ten transformations in sequence. That single front-to-back run is one forward pass. It's fast: milliseconds.
3. "Training" is just nudging the dials. You show it an example, compare its answer to the right answer (the gap is the loss), and nudge every dial a hair in the direction that would have made the gap smaller. Do that across millions of examples and the dials settle into a configuration that, as Jang puts it, holds crystallized knowledge and experience. That nudging procedure is gradient descent; you don't need the calculus, just the picture of millions of dials slowly settling.
One design choice matters later. AlphaGo's network is a ResNet — a style built from convolutions, which look at small local patches of the board rather than the whole thing at once. Jang found ResNets beat the trendier transformers here, because Go fights are spatially local (what matters is usually near the stones in question), and that local bias is baked into convolutions for free jang. The lesson generalises: in a low-data regime, a network that already "assumes" the right shape of the problem learns faster than a blank-slate one.
That's the whole toolkit. A board goes in; a few million dials, arranged in ten local-pattern layers, transform it; an answer comes out. Now — what answer do we want it to give?
· · ·
02
The two impossible problems — and the two networks
Go is hard for a brutally simple reason. The board is 19×19, so on the first move there are 361 places to play; the game runs 250–300 moves. Follow every branch and you get roughly 361300 possible games — jangThis is something on the order of 361^300, which is far more than the number of atoms in the universe. far more than there are atoms in the universe. No computer can search that. Two impossibilities sit inside that number, and AlphaGo defeats each with one network.
361
moves available on the opening board (19×19 intersections)
~300
moves in a typical game — the depth problem
361300
possible games — more than atoms in the universe
The breadth problem → the policy network
Too many moves to consider at each turn. The fix: a policy network that looks at the board and outputs a hunch — a probability over all 361 points saying "these few are worth considering." It narrows 361 down to a handful, the way a strong human barely notices most of the board.
The depth problem → the value network
Games are too long to play out to the end for every branch. The fix is the move Jang calls the most profound in all of AlphaGo: a value network that looks at a board mid-game and outputs a single number — the probability of winning from here — without playing the rest of the game at all.
The analogy that makes the value network click
Think about a strong Go player glancing at a board and instantly muttering "I'm probably going to lose." They didn't simulate the remaining 150 moves. They just know. Jang's framing: that human is running a value function in their head — a network that amortizes a huge number of possible playouts into a single glance. jangThe human glances at the board and knows, 'I'm probably going to lose.' They're essentially running a neural network that looks at a board and implicitly amortizes a huge number of possible game playouts. The value network is that intuition, made of silicon.
Figure · one trunk, two heads
board → ResNet trunk
→
Policy head → 361 numbers
"Which moves are worth a look?" — a hunch over the board. Solves breadth.
Value head → 1 number
"What's my win probability from here?" — no playout needed. Solves depth.
Both heads share one trunk reading a 3-channel board (black / white / empty). The original Lee Sedol version used two separate networks; every version since merges them into one trunk with two heads.
Here is why Jang calls it profound, and it's the single deepest idea in the piece. A ten-layer network — ten steps of computation — collapses a search problem with more branches than the universe has atoms into one forward pass.
A 10-layer neural network pass… is able to amortize and approximate to very high fidelity a nearly intractable search problem.
⌞jang⌟ — the same phenomenon underwriting AlphaFold (protein folding) and AlphaTensor
This isn't unique to Go. The same trick — a smallish network amortizing an enormous simulation into a glance — is what powers AlphaFold folding proteins and AlphaTensor discovering algorithms. Jang admits it genuinely unsettles his sense of what's computationally hard: jangIt actually makes me wonder if our understanding of problems like P=NP, or these fundamental computational hardness problems, is incomplete. what felt like an impossibly hard problem keeps falling to a surprisingly simple, "macroscopic" solution. Not a proof of anything — but a real crack in the intuition that hard problems must stay hard.
· · ·
03
Search — how Monte Carlo Tree Search actually turns
The policy network's snap hunch already plays a shockingly strong game — Jang notes you can just take its top suggestion and "shoot from the hip" for a formidable opponent jangit'll be a very fast Go player that doesn't think in terms of reasoning steps. It just shoots from the hip, and it'll be a very strong Go player.. But to reach superhuman, you add thinking — and thinking, here, means search. The search procedure is Monte Carlo Tree Search (MCTS), and it runs fresh on every single move.
"Monte Carlo" just means using guided random sampling instead of exhaustive enumeration — you can't read every branch, so you sample the promising ones cleverly and average what you learn. MCTS builds a little tree of possible futures, four steps at a time:
Figure · one MCTS simulation, four steps
1 SELECT
Walk down
From the current board, follow the most promising line deeper into the tree, using the PUCT rule (below) to decide where to go.
2 EXPAND
Grow a leaf
Hit a board you've never seen? Add it to the tree and ask the policy head for its hunch about the moves from there.
3 EVALUATE
Score it
Ask the value head for the win probability of that new board — a single forward pass, no playing to the end.
4 BACKUP
Propagate
Carry that score back up the line you walked, updating the running average for every move along the way.
↻ repeat hundreds to thousands of times per move · then play the move that got visited most
Run that loop a few hundred times and the tree has quietly concentrated its attention on the strongest lines. The move chosen isn't the highest-scoring one — it's the most-visited one, because a move the search kept returning to is a move it trusts. The original Lee Sedol machine ran tens of thousands of these simulations per move on a room of TPUs; a modern trained bot needs far fewer, because, as we'll see, the network has absorbed the search into itself.
The one rule worth seeing — PUCT
Step 1 hides the only formula in this piece, and it's worth meeting because its three parts are the intuition of search: balance exploiting what looks good against exploring what you haven't tried.
Figure · what "pick the best branch" actually weighs
Exploit
How good it looks so far
The running average result (Q) from every simulation that has gone through this move. High when the move keeps leading to wins.
+
Explore
How much it deserves a fresh look
Large when the policy's hunch rated this move highly but the search has barely tried it. Shrinks every time the move gets visited — so attention spreads instead of fixating.
MCTS walks toward whichever move maximises exploit + explore. The policy network seeds the explore term with a prior; the value network feeds the exploit term through backup. The two networks are the search's eyes.
That interplay is the whole engine: the policy network proposes, the value network judges, and PUCT arbitrates between trusting a good track and checking a neglected one. Crucially, the tree is thrown away after each move — only one thing is kept, and that one thing is what makes AlphaGo learn.
· · ·
04
Self-play — the improvement operator
Here is where the three primitives lock together, and where your RL nouns — agent, state, action, reward, policy, value — finally become a loop. The thing MCTS keeps from each move is its final visit-count distribution: a sharper, search-improved opinion about what to play than the policy's raw hunch was. That improved opinion is the training signal.
Figure · the self-play training loop
PLAY
Play yourself
The agent plays a full game against a copy of itself, running MCTS on every move. Record, for each move: the board (state), the search's distribution (improved action), and eventually who won.
LABEL
Relabel every move
Each move now has a strictly better target than what was actually played — the search distribution. And each board gets labelled with the final game outcome (won / lost).
TRAIN
Nudge both heads
Train the policy head to predict the search distribution; train the value head to predict the outcome. Then play again with the improved network. ↻
No "reward signal" in the textbook RL sense — no TD error, no dynamic programming. Jang's striking claim: it reduces to supervised learning on improved labels.
Sit with what step two is doing, because it's the heart of the whole machine. Even in a game the agent lost, MCTS hands back a better move for every single position along the way:
You played this game where you eventually lost, but on every single action, I'm going to give you a strictly better action that you should have taken instead.
⌞jang⌟ — MCTS as a per-move improvement operator
The analogy — learning to drive by being corrected
Jang ties this to DAgger, a trick from teaching robots by imitation. Picture a self-driving car that drifts toward the shoulder. Even in that bad spot, there exists a correct action — steer back. jangEven if you're in a not-great state—for example, a self-driving car that veers off the side of the road—there is still a valid action that corrects you and brings you back. Whether the car ultimately crashes is beside the point; every moment still has a right answer you can learn from. AlphaGo manufactures that right answer at every move, win or lose. That's why it never has to untangle the maddening question "which of my 300 moves actually lost me the game?"
And notice the elegance of the starting conditions. Most reinforcement learning begins at a 0% success rate and has to stumble onto its first win by luck before it has anything to learn from — the dreaded exploration problem. AlphaGo sidesteps it: jangyou never have to initialize at a zero percent success rate and solve the exploration problem of how to get to a non-zero success rate. This is what allows you to hill-climb this beautiful supervised learning signal. because MCTS always produces a better label, there's always a gradient to climb. This is also why Jang insists initialization is everything — he warm-started his bot against the open-source KataGo rather than from a blank slate, skipping the most expensive part of the original effort.
· · ·
05
Why this is not how a chatbot learns
Now the payoff promised at the start. When an LLM is trained with reinforcement learning, it does not get a better label for every step. It writes a whole answer — thousands of tokens — gets told only "right" or "wrong" at the very end, and must somehow figure out which tokens deserve the credit. This is the credit assignment problem, and AlphaGo simply doesn't have it.
Figure · where the learning signal lands
AlphaGo
a target on every move
Every move carries its own improved label from search. Hundreds of dense supervision signals per game.
LLM reinforcement
one verdict at the end
one trajectory · ~100,000 tokens
✓ / ✗
A single right/wrong lands on the whole sequence. Which token earned it? Unknown. Karpathy: "sucking supervision through a straw."
Same idea of "learn from outcomes," opposite density of signal. In a 300-move game one decisive move may sit among ~29,900 neutral ones — yet MCTS labels each individually, so the needle never hides in the haystack.
There's a deeper way to say why this matters, and it's about information per example. Supervised learning (predict this specific move) delivers a rich, detailed signal every time. Reinforcement learning delivers only a single bit — won or lost — and worse, when the model rarely succeeds, even that bit carries almost no information. jangnaive RL delivers only the entropy of a binary variable — and because most of training is spent at very low pass rates, RL spends virtually all its time in the near-zero information regime. Most of training is spent in exactly that starved regime. Jang adds a killer caveat: if your policy would never even try the right answer, RL gives you no signal at all — whereas a supervised label always teaches you something jangIf your policy has no chance of sampling 'blue,' then you will never get a signal..
This is also, quietly, why distillation works so well. AlphaGo doesn't train the policy to copy the single best move; it trains it to copy the whole search distribution — a "soft" target carrying far more information per example than a hard yes/no. The same "dark knowledge" reason a small model learns efficiently by imitating a big model's full distribution rather than just its top answer.
So why not just bolt MCTS onto LLMs?
People have tried; Jang is skeptical it transfers cleanly, for two concrete reasons grounded in the mechanics we just built:
Value estimation is harder. Go hands you a crisp, cheatproof outcome — you won or you didn't. Mid-way through a chain of reasoning or a piece of code, "how good is this so far?" is genuinely hard to score, and the value network was the load-bearing piece.
The action space is too wide. PUCT's explore term only works because a good Go move gets revisited — its visit count climbs and the math responds. In language, the "moves" are tokens, and the space is so vast you'll essentially never sample the same continuation twice, so that visit count never accumulates and the heuristic goes dead. jangIn an LLM, you're most likely never going to sample the same child more than once… a discrete set of actions is not really an appropriate choice for an LLM. Jang doesn't rule out a future MCTS-flavoured mechanism for rigidly logical domains like mathematics — but calls the jury still out.
The bridge to your own project
This is exactly the line your six qui prend work already walked. 2-player 6 nimmt! has fully known, low-dimensional rules — the game is a perfect simulator, so there's nothing to learn about its dynamics. That puts it squarely on the AlphaGo side of this divide: the right shape is lightweight MCTS over the known rules, with a policy/value net guiding the search six-qui — not the heavier learned-world-model machinery built for environments whose rules are unknown. The credit-assignment elegance in this movement is precisely what your agent gets to exploit.
· · ·
06
$7K, and a robot doing the research
The original AlphaGo was a flagship effort — a team of DeepMind scientists and, for the Lee Sedol match, AlphaGo Zero-era training runs on the order of 1023 floating-point operations, a scale comparable to a frontier language model. Jang rebuilt a competitive bot for roughly $7,000 of rented compute (funded by a Prime Intellect donation). The gap isn't magic; it's a clean illustration of a general law:
The compute required to be the first to do something is always much larger than the compute it takes to catch up.
⌞jang⌟ — once a thing exists, you get to use crutches: distillation, a strong initialization, a decade of hardware
The catch-up crutches were exactly the ones this piece has named: warm-starting against KataGo instead of from zero; Blackwell-class GPUs making many of KataGo's 2020 compute tricks irrelevant; and a neat shortcut of co-training on small 9×9 boards alongside full 19×19 ones, so the value network learns endgames quickly before scaling up. Jang is careful to flag these as honest "vibe guesses" from his own runs, not peer-reviewed claims jangthese are vibe guesses from his own experiments.
The coda — Go as a proving ground for automating research itself
Throughout the rebuild, Jang ran an autoresearch loop: frontier models (Opus 4.6 and 4.7) driving parts of the research process. The split he found is the most quietly important thing in the interview, because it's a real datapoint on the "intelligence explosion" question — what happens when AI starts doing AI research.
Figure · what the AI researcher could and couldn't do
Already good at
the inner loop
Hyperparameter search across open-ended options · inspecting gradient norms and reacting · rewriting data loaders · running experiments and compiling the plots and reports. Give it an axis to explore and it executes.
Still bad at
the outer loop
Choosing which experiment to run next when a track is failing · the lateral step of "wait, this whole assumption is wrong, go back to first principles." It optimises harder inside the frame instead of escaping it.
Jang visualises his project as a tree of experiments — nodes are outcomes, branches are follow-ups. The models grow the branches well; a human still has to prune the dead tracks and start new ones.
Why is Go the ideal arena to study this? Because it closes the loop honestly: a Go game is fast to verify and hard to reward-hack — win rate against KataGo is an unambiguous external judge — while the inner work (distributed systems, architecture choices, hyperparameter tuning) maps cleanly onto general ML research jangGo captures a lot of very interesting research problems… Yet it's very quick to verify. The outer loop is ultimately: does the agent do what I think it does?. Whether competence on that inner loop ever transfers to the truly hard part — choosing the right question, discovering a new phenomenon like scaling laws — is left open. Which lands us, fittingly, back where we started: AlphaGo as the cleanest mirror we have for how the next intelligence might learn — and where it might still be stuck.
Glossary — the hero's toolkit
Parameter weight
One of the millions of tunable numbers inside a network. Training = finding the configuration of all of them that maps inputs to good outputs. AlphaGo's net had under 3 million.
Forward pass
One front-to-back run of the network: input in, answer out. Fast (milliseconds). The whole point of the value network is doing useful work in a single one.
Policy network the hunch
Outputs a probability over all 361 moves — "which are worth considering." Solves the breadth problem. On its own it's already a strong, instinctive player.
Value network the glance
Outputs one number — win probability from this board — without playing it out. Solves the depth problem. Jang's "most profound" piece of AlphaGo.
MCTS search
Monte Carlo Tree Search. Builds a tree of likely futures by repeating Select → Expand → Evaluate → Backup hundreds of times per move, then plays the most-visited move. "Monte Carlo" = guided random sampling instead of brute enumeration.
PUCT
The rule MCTS uses to choose where to look next: exploit (how good a move looks so far) + explore (how under-tried a promising move is). Balances trusting vs checking.
Self-play
The agent plays itself, MCTS improves every move into a better label, the network trains on those labels, repeat. The "improvement operator." Reduces to supervised learning on improved labels.
Credit assignment
The problem of knowing which action in a long sequence deserves the credit/blame for the outcome. LLM reinforcement suffers it badly; AlphaGo dodges it entirely by labelling every move.
Policy gradient / REINFORCE
The standard "learn from outcomes" RL family used in LLMs — push up the probability of actions from winning trajectories. High variance because the signal is sparse; the lineage (baselines → actor-critic → PPO) is one long campaign to tame that variance weng.
Distillation
Training a network on another's full distribution (a soft target) rather than its single top answer — far more information per example. Why AlphaGo trains on the whole search distribution, not just the chosen move.
KataGo
The strong open-source Go engine (2020) Jang warm-started against — the "strong initialization" that turned a multi-million-dollar effort into a $7K one.
the deep dive · play-by-play
Inside the conversation
The told story melded everything into one clean zero-to-hero arc. This page does the opposite: it walks Patel and Jang's three-hour chalkboard interview in its own order — every detour, number, correction, and aside the synthesis had to drop. It assumes the told story's foundation (what a network is, the self-play loop, what AlphaGo did) and builds up from there.
Source Dwarkesh Patel × Eric JangMode faithful to the transcriptDevice hover ⌞cites⌟ · unfold the mathRead after the told story
The interview's own chapter markers are kept as anchors. Dense machinery — real formulas, variance algebra, the bits-per-sample math — is tucked into unfold blocks so the conversation keeps moving; open them when you want the receipts.
· · ·
00:00:00 · Basics of Go
The rules already contain a value function
They start at the board, not the algorithm — and the reason becomes clear an hour later. Jang has Patel put down stones and learn capture by playing into it: surround all four orthogonal neighbours of a stone and it dies, jangfor every intersection, if you can surround all four of its neighbors with your stones, then it's cut off from oxygen, if you will, and it's a dead stone. cut off from "oxygen." Patel guesses it's the diagonals that matter; Jang corrects him — the cross-section, not the diagonals. A small beat, but the whole game is built from that one local rule.
The strategic idea Jang wants Patel to feel is that capture is sometimes worth conceding: "In Go, it's actually okay to let an opponent capture some stones if… it lets you position to capture more stones somewhere else." That is the macro-versus-micro tension that makes the game deep — you can lose the battle but win the war jangThis is what makes Go a beautiful game: you can lose the battle but win the war. As the board size increases, the complexity of these micro versus macro dynamics gets more interesting., and it gets richer as the board grows.
Where the value function is hiding
Then comes the move that justifies starting here. Jang sets up a surrounded white group and asks Patel who controls it. Patel says white; it's actually black, because the white group is dead — it just hasn't been captured yet. How do humans know it's dead without playing it out? They simply agree. One says "I think the game is done," the other has to agree, and if they don't, they keep playing. Jang's aside is the seed of the entire neural-network half of the talk:
Once two humans — their so-called value function — agree on a consensus, then the Chinese rules resolve that.
⌞jang⌟ — the human "this position is lost" judgment is a value network, stated 50 minutes before he draws one
This is why he refuses to start with the algorithm. The thing a value network learns — "glance at a board, output who's winning, don't play it out" — is something two strong humans already do when they resolve a game by agreement. The machine just makes that glance explicit.
Figure · the same board, scored two ways
How humans score
consensus + judgment
"This white group is dead — agreed?" Both players nod, the dead stones come off, and the surrounded territory counts for black. Ambiguous positions get talked out; if there's disagreement, you keep playing to settle it.
How a computer scores
Tromp–Taylor, unambiguous
No judgment calls. Play continues to the literal end, then count: stones you control, plus empty points touched only by your stones. Points touched by both count for nobody. Deterministic, so a program can resolve it.
All Go AIs train and resolve under Tromp–Taylor rules precisely because they remove the human consensus step. The gap between these two columns is exactly the gap a value network has to learn to bridge.
unfold Tromp–Taylor, and why suicide is legal
Human Go has soft rules; computer Go needs hard ones. Under Tromp–Taylor, a move humans would forbid as "instant suicide" — playing into a fully surrounded point — is perfectly legal: jangIn typical Go, when humans play, you're actually not allowed to put this white stone down here. It would be instant suicide. In Tromp-Taylor, it's actually fine. You put it down, and it immediately resolves to death, so the outcome is the same. you put the stone down and it instantly resolves to death. Same outcome, zero ambiguity.
Scoring is mechanical: count the stones you control, then count empty intersections touched only by your stones. Points adjacent to both colours score for neither. This produces the one quirk Jang flags — Tromp–Taylor will sometimes award a player points a human would recognise as already lost, because it has no notion of a group being "obviously dead." That mismatch is the price of being decidable by a program.
And the game ends the simple way: both players pass consecutively, or one resigns. jangThe game ends when either a player chooses to resign or both players pass consecutively. Those are the rules.
· · ·
00:08:17 · Monte Carlo Tree Search
The tree no one can hold, searched while it's built
First, the scale of the problem, stated precisely. Roughly 361 legal moves at the start, ~250–300 moves per game, branching factor dropping by one each turn — follow it all and you get about 361300 games, jangThis is something on the order of 361^300, which is far more than the number of atoms in the universe. "far more than the number of atoms in the universe." This is why computer scientists spent decades calling Go intractable for this century.
Two subtleties the told story skipped. First, the real tree is smaller than 361300 because of merging children: two different move-orders can reach the identical board, so that board is a shared node, not two. Second — and this is the engineering crux — you never build the tree and then search it. The tree is too big to store, so you grow it and search it at the same time, expanding only the leaves that look worth it. jangBecause Go is such a combinatorially complex game, you cannot afford to build the tree in advance and then search it. You must search while building the tree.
Figure · what an MCTS tree actually looks like
Most expanded leaves are visited once and abandoned (grey); visits pile up along the line the search trusts (vermillion). Contrast a tic-tac-toe tree, which you can enumerate exhaustively — 9!, fully dense. Go's tree is almost entirely unbuilt.
The node, and the trap that catches RL people
Each node is a small data structure. Jang flags the one thing that trips up anyone coming from robotics RL: there are no "action" fields. Because Go is deterministic, every child is an action — moving to that child is the action that produces it. jangOne thing that's easy to trip on if you come from robotics or other kinds of reinforcement learning is, where are the actions? I'm only talking about nodes. Nodes here represent states, and because this is a perfectly deterministic game with no randomness, you actually can just infer the action based on the child. (He notes the exact layout is "chef's choice" — this is the one Claude 4.6 wrote when he asked it to, and a reasonable one.)
Figure · one MCTS node
node ≈ a board state (reached by one action from its parent)
Navisit count — how often we chose this child
Qamean action value — avg win-rate through here
Paprior — the policy net's P(this move)
childrendict → more nodes (a reference tree)
No "action" entry: in a deterministic game the child node is the action. That single simplification is why Go's tree code is so much cleaner than a robotics MDP's.
PUCT, in full
The told story showed only "exploit + explore." Here is the actual selection rule AlphaGo uses — PUCT, Predicted Upper Confidence for Trees. On every step down the tree you take the move that maximises:
Q(s,a)Exploit. The running mean result through this move — your current best guess of its win-rate. If you knew the whole tree, this alone would be enough.CPUCTA constant knob setting how much exploration matters relative to exploitation.PaThe policy network's prior for this move — search trusts the hunch about where to look first.√N√(total visits to the parent). Grows as the parent is explored — keeps nudging attention outward.1 + NaVisits to this child, +1. As a move gets picked, this denominator grows and its explore bonus collapses.
Early on Na=0 everywhere, so selection follows the policy prior Pa. As visits accumulate the explore term shrinks and Q takes over — the search settles onto what it has learned is strong.
Jang traces the shape to the UCB1 bandit algorithm — "optimism in the face of uncertainty" — which bounds how wrong you can be (your regret) by adding an exploration bonus to the mean. jangThere are some early algorithms in the bandit literature like UCB1, which is not exactly appropriate for a sequential game like Go, but very much inspired the action selection algorithm used in AlphaGo. Patel adds the intuition that lands it: UCB1's bonus uses ln(N), which grows slower than N, so over time the argmax slides from being dominated by exploration to being dominated by Q. PUCT swaps ln(N) for √N precisely because Go has far more actions per move than a textbook bandit, so the terms have to grow differently.
unfold Where does "probability" even come from in a deterministic game?
Patel catches the contradiction: Go is deterministic, so with infinite compute there are no probabilities — you'd just compute the true mean value of each move. So why all the talk of distributions?
Jang's answer: the probability is an artefact of sampling. Pre-AlphaGo computer Go used Monte Carlo methods — estimate a move's value by averaging over a randomly sampled tree. The randomness of which tree you sample is where probability enters. Q becomes "the expected value under the distribution induced by your random search process," and Pa is what shapes that distribution. jangThe interpretation of Q is: what is the expected action value under the random distribution induced by some random search process?
If you sampled uniformly (every legal move equally likely), the average would be correct but uselessly slow — you'd spend almost all your samples on low-value lines. Jang names it: it's essentially an importance-sampling problem, where only a few paths carry high value and everything else is near-zero. The policy prior Pa is what concentrates sampling on the paths that matter.
The last step, backup, is just bookkeeping: when a simulation reaches a leaf and the value head scores it, that value is folded into a running mean for every node on the path back to the root. Do this a few hundred times and the visit counts themselves become the answer — the move you play is the most-visited child, the one the search kept returning to. Jang's reminder for the next chapter: you throw the whole tree away after each move — except for one thing, which is what makes the network learn.
· · ·
00:32:04 · What the neural network does
Two heads, a local-vision trunk, and a strong opinion about architecture
The told story gave you "policy head solves breadth, value head solves depth." Here are the engineering specifics Jang dwells on — both heads are just classifiers (value: a binary win/lose logit; policy: a categorical distribution over the 361 points), trainable with ordinary cross-entropy. The board enters as an RGB-like image: three channels — black, white, and empty (the empty channel doubling as a mask if you train on several board sizes at once).
Figure · the input — a 3-channel board
black
1 where black
white
1 where white
empty
1 where empty / mask
Stacked, these feed one shared ResNet trunk, which splits into a value head (→ R1, one win-probability logit) and a policy head (→ R361, a move distribution). The told story's "one trunk, two heads" — here's the tensor shape behind it.
Why ResNets beat transformers here — and what KataGo adds back
Jang tried hard to make transformers win and couldn't. In Go's low-data regime, ResNets give more bang for the buck because convolutions carry a free inductive bias: nearby intersections matter most. jangFor small data regimes, my experience is that ResNets still outperform transformers and give you more bang for the buck at lower budgets. Transformers' global attention only starts to pay once you have enough data to learn local structure from scratch — and Go fights are spatially local, so the bias is usually right.
But "usually local" isn't "always local," and that's the gap KataGo's global feature pooling fills. A convolution can read two separate battles on opposite sides of the board, but it struggles to connect them. KataGo periodically pools features across the whole board so value can flow side to side — jangthey found it quite useful to pool together and aggregate global features throughout the network, to give the network a global sense of how to connect value from one side of the board to the other. the local-vision trunk gets just enough global awareness to judge whole-board position.
unfold Why AlphaGo only needs the current board — perfect information & Nash
Patel asks the sharp question: Go reads only the current board, no history — fine for a deterministic full-information game, but what about poker or Diplomacy, where an old bluff matters now?
Jang's answer turns on perfect information. Such games have a Nash equilibrium strategy that can be played from the current state alone and can do no worse than any alternative. AlphaGo's design choice — condition only on the present board — works because that equilibrium turns out to be superhuman: jangto counter any given strategy, there exists a single Nash equilibrium that can be decided solely using the current state. That is a design choice AlphaGo made, which in hindsight turned out to work very well because the Nash equilibrium seems to be superhuman. No human strategy seems to be able to beat it. no human style beats it, so there's nothing to gain from modelling opponent history.
The moment you break perfect information — 2-versus-2 Go where you must model an unseen partner, or imperfect-information games — you genuinely need temporal context, and the architecture question reopens. Jang flags this as fertile, under-explored research and points at his repo.
"Initialization is everything"
The original AlphaGo (the Lee Sedol version) didn't learn from nothing — it was warm-started on a big dataset of expert human games, and only later did DeepMind remove that crutch and learn tabula rasa. Jang turns this into his single loudest piece of practical advice, repeated several times across the interview:
In deep learning, initialization is everything. Always pick something that works and then get it to do something better, rather than start from something that doesn't work at all and try to make it work.
⌞jang⌟ — the philosophy behind warm-starting his own bot against KataGo instead of from zero
A nice consequence falls out of training the value head on whole games: early boards, where the outcome is genuinely a coin-flip, drive the win-probability logit toward 0.5, and it only sharpens toward 0 or 1 as the game progresses. jangit's expected that once you train the model, a starting board state will look like 0.5, and then as you progress towards the end of the game, the win probability will either go up or down. And the raw policy head, even before any search, is already a formidable player — argmax its output and you get a strong "shoot from the hip" bot from under 3M parameters. Search is what takes it from strong to superhuman.
The rollout AlphaGo Lee used — and everyone dropped
One detail the told story folded away: in the evaluate step, the original AlphaGo Lee didn't fully trust the value network. It averaged the value-head estimate with an actual played-out game (a "rollout"), using the policy network as both players to play cheaply to the Tromp–Taylor end:
Vθ(leaf)The value network's single-pass guess of win probability.rollout(leaf)A real game played from the leaf to the end, policy-net vs itself — a cheap 0/1 sample of who'd actually win.αThe mixing weight between the network's intuition and a grounded playout.
The rollout kept estimates "tethered to reality," especially in the endgame. Every paper after AlphaGo Lee removed it — once the value net is good enough, it's pure overhead, and dropping it speeds training a lot. Jang dropped it too.
unfold Could you delete the policy head entirely?
Jang poses it as a test of understanding. When you expand a node you evaluate the value of each child — so you already know which children look good. Why not just normalise those values into a distribution and call that your policy? Drop the policy head.
It would basically work, he says — but you'd pay for it. To score every move you'd need up to 361 forward passes (one per child) instead of one policy pass that hands you the whole distribution at once. The single pass is how breadth gets pruned cheaply. janghaving a single forward pass that gives you a pretty good guess is how the breadth is pruned out.
There's also a consistency argument — policy and value are linked; it would be a red flag if the policy loved a move the value head rated as losing. But the real reason to keep an explicit policy is the next chapter: it's the thing MCTS improves and trains against, the surface self-play writes its lessons onto. An implicit value-normalisation has nowhere to put that signal.
· · ·
01:00:33 · Self-play
Amortizing the search into the network
The one thing kept from each move's discarded tree is its visit-count distribution — a sharper opinion than the raw policy produced. Train the policy to predict that sharper opinion and something remarkable happens: the search you paid for this turn gets baked into the network, so next time the policy starts from where the search ended.
Figure · test-time scaling, and what distillation does to it
Both curves rise monotonically with search (Jang is careful that the exact shape is unknown — don't read the sigmoid too literally). Distillation lifts the whole curve's starting point: the first ~1,000 sims of search become "free," folded into the policy, so the same budget now buys a stronger move. That lift, iterated, is how the bot climbs.
This is also the precise mechanism behind a told-story claim: a modern trained bot needs far less test-time compute than the Lee Sedol machine's tens of thousands of sims, because jangover time the raw network actually takes all of the burden of that big TPU pod and just pushes it into the network. You can do all of that work with one neural network forward pass. the network has absorbed the TPU pod's work into a forward pass. The pod only ever added "extra oomph on top" for the match.
Is MCTS guaranteed to beat the policy? No.
Patel presses the assumption everything rests on: is search always an improvement? Jang is candid — it's a heuristic, not a guarantee, and he draws the failure case. If the value network is wrong, bad leaf values propagate up through backup and corrupt the PUCT selection, so the search can land on a worse distribution than the raw policy. The concrete way this happens: if the bot has learned to resign rather than play losing endgames to the bitter end, its replay buffer never contains late-stage positions, so it never learns to value them:
If the terminal values of the leaves are not good, then this will propagate all the way up and cause your PUCT selection criteria and your backups to be off.
⌞jang⌟ — why MCTS is only an improvement operator if the value estimates are sound
Two fixes, both pragmatic. He suspects AlphaGo Lee kept its rollouts (last chapter) precisely to ground value in real playouts. And his own trick: for ~10% of self-play games, forbid resignation and force a full Tromp–Taylor resolution, so the buffer collects the late-game positions the value head would otherwise never see. jangfor 10% of the games, prevent the bots from resigning and just say, 'Resolve it to the end.' That way you get some training data in your replay buffer to really resolve those late-stage playouts that normal human players would not play to.
What's really happening in the first epochs of a cold start
Patel reframes AlphaZero's cold start cleanly and Jang confirms it: at the very start the policy is useless, so what those first epochs actually do is train the value head — play full games, label every preceding board with who eventually won, and learn to predict that. Only once the value function is decent does the policy start meaningfully improving. The hardest part to learn, Jang adds, is neither the opening (obviously ~0.5) nor the endgame (obviously decided) but the midgame, where who's winning is genuinely subtle. Patel spots the resemblance to TD learning — bootstrapping value estimates from later, more-certain ones — and Jang agrees there's a "beautiful connection," picked up later.
Figure · two ways to get a value function off the ground
Borrow expert data
the fast shortcut
Train on human games, or self-play games from an existing open-source bot. Late-stage positions are easy to score when the board is nearly full, so a good endgame value function falls out quickly. Jang's recommended start.
Small-board random play
the tabula-rasa route
Play ~50,000 random games on a 9×9 board — at that size random play already produces human-looking endgames you can score. Co-train 9×9 with 19×19 and the value head transfers up, since Go has no new pieces at larger sizes.
Either way the lesson is the same: "MCTS will fall apart if you don't have a grounding function for the value." Ground the value first, then let search do its work.
unfold Why merge the value and policy into one network?
Patel asks whether sharing a trunk roughly halves the compute. Jang: AlphaGo Lee used two separate networks; every version since merged them into one trunk with two heads, which presumably saves compute since they share representations — but he's honest that proving the exact saving rigorously "takes quite a bit of work to really resolve." The intuition is sound (policy and value should agree, so they should share features); the precise number is the kind of clean-sounding question that's expensive to actually answer.
· · ·
≈ 01:18 · interlude
Does understanding it make it less impressive?
Patel floats a provocation: the more he understands AlphaGo — the hand-built tree, the explicit explore/exploit tuning, all the bias you're injecting — the less impressive the 2017 result feels, the way simple RLVR producing genius code feels more magical. Jang firmly disagrees, and his reason is the deepest idea in the interview.
10 steps of neural network parallelized distributed-representation thinking is able to amortize and approximate to very high fidelity a nearly intractable search problem.
⌞jang⌟ — a ~10-layer network collapsing a 361300 search into one forward pass
A ten-layer network is, by construction, ten sequential steps of computation. That ten steps can stand in for a search with more branches than the universe has atoms is, to Jang, "a breakthrough that most people don't even fully comprehend." It's the same phenomenon under AlphaFold (protein folding) and AlphaTensor (discovering algorithms): a hard-feeling, NP-flavoured problem falling to a small macroscopic solution.
the crack in P = NP
It genuinely unsettles Jang's sense of computational hardness. jangIt actually makes me wonder if our understanding of problems like P=NP, or these fundamental computational hardness problems, is incomplete. Not a proof of anything — but a real dent in the intuition that hard problems stay hard. Patel sharpens it: these problems are NP-hard in the worst case, but we rarely face the worst case, and real instances carry exploitable structure. Jang's reframe: maybe the mistake is formulating these solutions in worst-case complexity at all. In the limit it suggests simulating something enormous — weather, even "do we live in a simulation" — might need far less compute than we assume, because so much can be amortized into a forward pass.
The thread runs into chaos. A single stone, like a single air molecule, can flip the exact future — Go is sensitive to initial conditions. Yet the macrostructure (who wins; where the hurricane goes) is predictable even when the microstate isn't — Jang reaches for the Lorenz attractor: you can't say where you'll land on it, but you know the shape. Patel contrasts a hash function: also wildly sensitive to inputs, but deliberately built to have no exploitable macrostructure — no value function to learn. They note Reiner Pope's observation that cryptographic protocols and neural nets are structurally similar (sequential layers jumbling information so the output depends on everything), and Jang closes on Jascha Sohl-Dickstein's "edge of chaos": a network has maximum expressive power right at that boundary, where chaos is not noise but useful.
The payoff for the RL discussion that follows: MCTS is doing something subtly different from what it looks like. It is not upweighting winning actions and downweighting losing ones. jangImportantly, what it is doing is saying: for every action we took, we did a pretty exhaustive search on MCTS to see if we could do better, and we're going to make every action that we took better by having the policy network predict that outcome instead. It is handing back a better label for every action — which is why its learning signal has such low variance compared to the alternative everyone else uses.
· · ·
01:25:38 · Alternative RL approaches
The variance disaster MCTS quietly avoids
To feel why MCTS's per-move labelling matters, Jang builds the naive alternative — the one that looks like modern LLM RL. Take a league of agent checkpoints, play them against each other, and for the games someone wins, reinforce the winner's actions. Then he does the arithmetic that kills it.
Take two evenly matched policies, true win-rate 50/50. Play 100 games of 300 moves. Say policy A wins 51, B wins 49 — pure noise, except in one game A happened to stumble onto a genuinely better move. So across 100 × 300 = 30,000 move-decisions, exactly one carries real signal; the other 29,999 are neutral actions you'd reinforce toward exactly the policy you already had.
Figure · where the signal lives in naive league RL
One real learning signal (vermillion) in a sea of neutral moves. This is Karpathy's "sucking supervision through a straw" — and Jang notes it does work at the scale of millions of samples, "so long as you find a way to mask out the supervision" from the neutral ones. That masking is the entire subject of advantage estimation.
Worse, it isn't just dilution — the variance actively grows with episode length. Jang sketches the gradient of naive policy-gradient RL and shows a term that scales quadratically with T (the trajectory length), because decomposing a long episode into per-step actions introduces correlations between them. jangYou end up with a term that grows quadratically with T. When you have a setup like this, this thing acts as a coupling effect on top of these terms here. A 300-move game doesn't just hide the signal — it amplifies the noise.
unfold Why this is exactly why LLM RL treats a whole answer as one action
This maps directly onto a puzzle in LLM RL: why do they do one-step RL — the entire generated sequence is a single action at, with T = 1 — rather than multi-step per-token RL?
Because transformers factor the sequence probability as a product of per-token conditionals, the log-prob of "hello world" is just log(hel) + log(lo) + log(world). If you handed each token its own reward, you'd reintroduce exactly the cross-multiplication of per-token log-prob terms and reward terms — the coupling that magnifies variance. Collapsing the whole sequence to a single (log-prob, reward) pair with T = 1 sidesteps it. The single-action term still contributes some variance, but you've refused to multiply it by trajectory length.
The fix everyone reaches for is advantage estimation: instead of a raw 0/1 return, subtract a baseline so the multiplier is ~0 for neutral actions and ~1 only for genuinely-better-than-average ones. jangIdeally, what you really want to do in RL is push up the actions that make you better than average and push down the actions that make you worse than average. They call this advantage. Jang points at John Schulman's Generalized Advantage Estimation as the canonical treatment — and notes the optimal case is to put a gradient on only the one move that actually got better, discarding the rest. Estimating that baseline well, of course, sends you right back to needing a good value function.
So the contrast crystallises. Naive RL is stuck solving a credit-assignment problem — guessing which actions among thousands earned the win. jangMonte Carlo tree search is doing something very fundamentally different. It's not trying to do credit assignment on wins. It's trying to improve the label for any given action you took. MCTS never plays that game: it improves the label of every action directly. But what do you do when you can't build a search tree at all?
· · ·
01:25:38 · Alternative RL approaches (cont.)
When you can't build a tree: neural fictitious self-play
Go is perfectly observable, so you can construct a deep tree that captures the whole game. StarCraft isn't — you don't control the binary, it may not even be deterministic — so MCTS-style search is off the table. The trick used in AlphaStar and OpenAI Five keeps MCTS's goal (a better label per state) while swapping out how you get it.
Figure · best-response, then distill
FIX AN OPPONENT
Pin πb
Freeze a strong opponent from the league. The self-play problem collapses into an ordinary fixed-environment RL problem: reward 1 if you beat πb, else 0.
BEST RESPONSE
Hill-climb with PPO
Use any model-free RL algorithm — PPO, SAC, V-MPO — to train a policy that beats that fixed opponent. This best-response is your "teacher," the stand-in for MCTS's search.
DISTILL
Average the league
Train best-responses against a whole league (πb, πc, πd…) and distill them into one mixed strategy — a policy that does no worse than an averagely-chosen opponent.
The model-free RL algorithm is the teacher here, playing the role MCTS plays in Go. jangfundamentally it's still about relabeling your states with better actions so that they improve your policy. Different machinery, identical principle: relabel each state with a better action.
unfold The Q-learning connection — and why you never strictly need a policy
MCTS backs a value up a tree of futures the agent hasn't visited. Model-free Q-learning does the mirror image — it plans over trajectories the agent has visited, via the Bellman backup:
Q(s,a) = r + γ · maxa' Q(s′, a′)
"The best you can do here equals the reward for this action plus the best you can do next." It's dynamic programming over a Markov decision process: knowing a future Q tells you something about the present Q, and you train a network to enforce that consistency. Crucially, this resolves the question Patel kept circling — why model a policy at all? You don't strictly have to: jangyou can recover the policy distribution by doing argmax over your Q values. a policy can be recovered by taking the argmax over Q. Q-learning mattered historically because for high-dimensional problems like robotics we couldn't search a tree — so we collected trajectories and planned with respect to the only thing that's certain: reward.
· · ·
01:45:47 · Why doesn't MCTS work for LLMs
The two things Go has that language doesn't
People have tried bolting tree search onto reasoning — Jang points to Google's 2023–24 tree-of-thoughts work — and "the jury is still out." His skepticism is specific and grounded in the machinery built above. Go hands MCTS two gifts language can't: a concrete value (to truncate depth) and a determined breadth (small enough for PUCT to be meaningful). Language gives neither.
Figure · why PUCT's explore term goes dead in language
Go
Na accumulates
A good move gets revisited across simulations — its visit count climbs, so √N/(1+Na) actually responds and the search hones in. The exploration heuristic has signal to work with.
LLM reasoning
Na never moves
"Moves" are tokens in a space so vast you'll essentially never sample the same continuation twice. jangIn an LLM, you're most likely never going to sample the same child more than once. If you have multiple steps of thinking, because language is so broad and open-ended, a discrete set of actions is not really an appropriate choice for an LLM. Na stays at 0 or 1, so the visit-count term carries no information.
Add the harder problem — mid-trajectory value estimation is concrete in Go (you won or you didn't) but genuinely hard for half-written code or a partial proof — and the two pillars MCTS leans on both crumble.
He won't say "no way," though. LLMs already do something that looks like search — try an approach, back up, try another — without an explicit tree, and Jang suspects forward search and simulation "might make a comeback, even if not in exactly the same instantiation as AlphaGo." His bet on where: domains with rigid logical structure. jangIf you think about mathematics, it often occupies more of a logical search procedure where you can back up and see which paths seem good or not. There's more of a rigid structure there, whereas in a business negotiation it's less of a tree. Mathematics is tree-shaped in a way a business negotiation isn't — and that's where a successor to MCTS might re-enter.
· · ·
≈ 01:53 · seated — scaling
Andy Jones predicted inference scaling in 2021 — on Go
Years before "test-time compute" became an LLM buzzword, Andy Jones's 2021 paper Scaling Scaling Laws with Board Games showed it on Go: you can trade search compute against training compute and get the same playing strength either way. Spend more on MCTS at test time, or spend more on training — the curve says they're interchangeable.
Figure · the test-time / training-time trade-off
Each curve is one playing strength; move along it to swap search for training and back. Outer curve = stronger. The same paper went further — it could predict how much compute a larger board would need, scaling-laws for the problem size itself. The LLM world rediscovered this shape years later.
The mistake: studying scaling laws on bad data
Jang's actual motivation was a Bitter-Lesson bet: could you build a strong bot without KataGo's clever tricks, just by leaning on scale and scaling laws? He hasn't pulled it off — and the reason is a sharp lesson. Early on, with bugs in his MCTS labelling, he'd gather expert data, treat it as supervised learning, and plot scaling curves. They looked fine. But: jangif you're in a regime where your policy is not working well, you might just be studying scaling laws on bad data. if the underlying system is broken, you're studying the scaling laws of a broken system.
You don't necessarily want to jump into the science of studying your man-made artifact before your man-made artifact is interesting enough to be studied.
⌞jang⌟ — get the system working and bug-free first; then the scaling laws mean something
The ordering is the whole point: you first build something that works, then use it to collect the data that reveals how things scale. Trying to discover scaling laws while you're still figuring out the recipe gets you a confident plot of noise.
· · ·
≈ 01:57 · seated — compute
The aberration on the curve, and the $7K rebuild
Plot the compute used to train the best AI model each year and you get a smooth exponential in log space — except for one spike. AlphaGo Zero was trained on ~3×1023 FLOPs, wildly more than anything else of its era, in the ballpark of a frontier LLM years ahead of schedule. A decade later Jang rebuilt a competitive bot for the price of a used car's down payment.
Figure · AlphaGo Zero, the outlier
A frontier-LLM-scale training run, but in 2017 and for a board game. Being first is what cost so much.
~$10K
Prime Intellect donation funding the whole rebuild
$4K + $3K
exploratory research, then the final run (rest for serving)
3×1023
FLOPs in the original AlphaGo Zero, for contrast
The gap isn't that DeepMind did it badly. It's a law Jang states flatly and ties straight to LLMs:
The compute required to be the first to do something is always much larger than the compute it takes to catch up.
⌞jang⌟ — once it exists, you get distillation, a strong init, and a decade of hardware as crutches
The pioneer pays a tax: no policy to warm-start from (AlphaZero had to go tabula rasa), no recipe to trust, every FLOP spent on getting it to work at all rather than on the compute-optimal frontier. Jang's own catch-up crutches were concrete. He warm-started with best-response training against KataGo, so jangIf you're initializing best-response training against KataGo itself, your own model needs none of the tricks that KataGo needs. his model needed none of KataGo's auxiliary-supervision tricks. Hardware did the rest: KataGo trained on V100s; Jang got equivalent results on half the number of desktop Blackwell GPUs. And the 9×9 co-training shortcut cut the warm-up — AlphaGo Zero spent its first ~30 hours just catching up to the supervised baseline, time you can skip by pre-learning endgames on a small board.
His honest summary of what actually mattered: architecture choice barely moved the needle (ResNet vs transformer is a wash at this scale), you can replace the whole distributed-async RL apparatus with a dumb synchronous loop (collect → train → collect), and many of KataGo's 2020 compute multipliers are simply transitory — faster GPUs made them irrelevant. jangThe core thing is getting as quickly as possible to some strong opponents. That matters a lot more than the specific architectural innovations. The one thing that mattered above all: get to strong opponents fast.
· · ·
02:01:09 · Off-policy training
Why AlphaGo's replay buffer is fine — and when it isn't
Patel raises a real tension: every AI researcher warns that being off-policy is dangerous, yet AlphaGo trains from a replay buffer full of moves its current network didn't make. Why is that okay? Jang's answer reframes off-policy data from a liability into the very thing that makes a policy robust — through the DAgger lens from chapter four.
The danger is real but specific: if your buffer is full of states the current policy would never reach, you waste capacity teaching it good moves in situations it'll never face — and it can get genuinely worse. jangIn the extreme limit, imagine the distribution of states in your training buffer are all states you would never visit. Then you're supervising them to take good actions on states you would never achieve. Therefore, your policy can get really bad. This is where off-policy can really hurt AlphaGo. That's the failure mode. The healthy version is the opposite of dangerous — it's what stops small mistakes from compounding.
Figure · the DAgger tube — why a little off-policy data is a feature
Optimal control assumes you stay on the line. Reality knocks you off — "the other player is always trying to do some shit." A buffer with mostly on-trajectory states plus a thin shell of drift-states (each labelled with the move that funnels you back) makes the policy robust. Too far outside the tube and it's wasted capacity.
The experiment: relabelling random states
Jang ran a telling experiment. To saturate the GPU, instead of running MCTS move-by-move through real games, he grabbed random board states from the dataset and re-ran MCTS on them with his current network — ignoring move causality entirely, re-labelling old states afresh. jangI ignore the causality of moves and pick random board states, and I label those with my current network... In practice, this actually does work. It works — and it converges on a setup that looks exactly like off-policy robotics learning (the Google QT-Opt days): a replay buffer of transitions, and a background "Bellman updater" constantly re-planning what the better action would have been.
"it's sort of like daydreaming"
Patel's phrase for the Bellman updater, and Jang takes it: jangYou can think about it like you're going back in hindsight. Given what I've seen in the historical buffer, was there a better action I could have taken? a background process going back through everything you've seen and asking, in hindsight, "was there a better move I could have made here?" Jang tried wiring an MCTS relabeller into exactly this slot — replacing the robotics target-network computation with a tree search on each stored transition. It was "moderately successful, but too complex to open source." A real benefit: you stop blocking on the Go game to feed you states, so the GPU stays saturated.
So why has the field mostly converged on staying on-policy? Jang's answer is unglamorous: it's just more stable. Modern setups still use off-policy data, but defensively — to estimate advantages and reduce variance, not to drive the objective directly — so a glitch in the dynamics doesn't blow up the loss. Off-policy as seasoning, not the main ingredient.
· · ·
02:12:02 · RL is even more information-inefficient than you thought
The plot that explains why RL crawls
Patel brings his own framing (from his "bits per sample" essay) and Jang sharpens it. Think of learning speed as bits per FLOP = samples per FLOP × bits per sample. The well-known inefficiency is that the first factor shrinks as tasks get long-horizon — you must unroll two days of agent work to get one signal. The less-appreciated one is that naive RL is also miserable on the second factor.
The "sky is…" example makes it concrete. A fresh model, vocab of ~100K. Supervised learning is told the answer is "blue" and, via cross-entropy, learns exactly how far off it was — a rich signal every time. RL just samples: "the sky is halcyon" — wrong; "the sky is told" — wrong; it might guess ~100,000 times before stumbling on "blue" and getting any signal. And the amount supervised learning teaches you is −log(p) bits (p = your pass rate); the amount RL teaches you is merely the entropy of a coin flip — and only when you happen to succeed.
Figure · bits learned per sample, against pass rate
Exactly where a fresh model lives — the low-pass-rate band — supervised learning is at its most generous and RL is flat against zero. jangarguably you spend all your time here, potentially never even getting a single success. It's a depressing plot in the sense that once you're here, it's not at all obvious how you get to there. Worse, if the policy can't even sample the right token, RL's signal is exactly zero — supervised always teaches something.
Why soft targets — and why AlphaGo is elegant
This is the deepest reason distillation works, and it ties the whole talk together. A one-hot label ("the answer is blue") has zero entropy. The full distribution of logits — the soft target — has far higher entropy, so it carries far more bits per sample. jangif you have access to the soft targets the entropy of this distribution is far higher than the one-hot. There's way more information in bits per sample in a soft label. That's why distillation is so effective per sample. Which is precisely the design choice in AlphaGo that the told story flagged: it trains the policy to imitate the whole MCTS visit distribution, not just the single chosen move — "dark knowledge" distillation, maximising information per sample.
So Jang's closing characterisation of why AlphaGo is a beautiful RL algorithm:
You never have to initialize at a zero percent success rate and solve the exploration problem of how to get to a non-zero success rate. This is what allows you to hill-climb this beautiful supervised learning signal.
⌞jang⌟ — and there's no TD error, no dynamic programming: just value-classification + policy-KL on improved labels
Every step is dense supervised learning on a strictly-better label, so the MCTS-vs-raw-network gap never collapses to zero (unless search has nothing left to add). Training is stable, scales to any network size, needs no complex distributed on-policy machinery — you just keep retraining on improved targets. That, mechanically, is why AlphaGo climbs smoothly where naive RL stalls in the dark.
· · ·
02:22:16 · Automated AI researchers
What the AI researcher could and couldn't do
Jang ran much of this project through an automated LLM coding loop — mostly Opus 4.6 and 4.7 — which makes his observations a rare concrete datapoint on the "intelligence explosion" question. The split between what worked and what didn't is the quietly important payload of the whole interview.
Figure · the inner loop vs the outer loop
Already strong
the inner loop
Open-ended hyperparameter search — not just grids: "the gradients are small in this layer, let me change it"; rewriting data loaders; inventing augmentations.
A "grad-student-like ability to just grind a performance metric." Give it an axis and it runs the experiment, compiles the plot, writes the report.
Still weak
the outer loop
Choosing which experiment to run next when a track is failing. The lateral step: "wait, this whole assumption is wrong — back to first principles."
It optimises harder inside the frame instead of escaping it. jangOften I had to catch infra bugs myself by prompting the right question to Claude to investigate what's causing the discrepancy, and then it'll answer the question. Jang still had to catch the infra bugs himself.
He drives the inner loop with a Claude Skill he wrote called Experiment: describe the x-axis, the y-axis, the question — it runs everything and reports back. The models grow the branches; a human still prunes the dead tracks.
He visualises the whole project as a tree of experiments — horizontal "rows" are parallel research tracks, each node a failed / mixed / successful outcome branching into its follow-up. Rabbit-hole down a track (like off-policy MCTS relabelling), realise it's not worth it, jump to a fresh row.
Figure · the tree of experiments (rows = tracks, nodes = outcomes)
architecture
off-policy relabel
9×9 cotrain
scaling laws
success mixed failed
Schematic of the shape Jang draws in his blog. Models are good at extending a row; the human decides when to abandon one and start another — the move that's still missing.
Go as a cheatproof outer loop — and the questions that follow
Why build a Go environment to study automated research? Because it closes the loop honestly: win-rate against KataGo is fast to verify and hard to reward-hack, while the inner work (distributed systems, architecture, hyperparameter tuning) is general ML research skill. jangGo captures a lot of very interesting research problems, often overlapping with LLMs or robotics. Yet it's very quick to verify. The outer loop is ultimately: does the agent do what I think it does? Train an automated scientist on that, and maybe the skill transfers to biosciences or — the real prize — automating AI research itself. (He notes the Mythos-class models coming online might just dissolve today's weaknesses through scale; or it might take RL environments deliberately built to reward lateral thinking.)
Patel and Jang then circle the hard parts, and neither pretends they're solved:
three open problems they leave open
Local verifiability (inner loop). Would the agent know whether a tweak is a real improvement or a bug? Jang invokes Ilya Sutskever: a great researcher holds a strong prior belief that "this idea should work, so there must be a bug" — and perseveres through bugs that would make a weaker conviction quit. Deep learning itself was a decades-long idea sustained against a committee saying it was wrong. How do you build an RL environment that rewards that kind of faith?
Stackability. Rumour from real labs: individually good ideas fail to stack — two sound improvements interact badly and the training run breaks. jangThese heuristics are probably somewhat redundant. That's probably why you see this effect where a lot of these compute multipliers don't necessarily stack. Many compute multipliers are redundant, correlated, and transitory — and stack even worse as GPUs improve. A single top-down vision matters more than a pile of local wins.
Outer-loop verifiability. Win-rate is a clean backstop, but discovering a new phenomenon — like scaling laws — isn't checkable that way. Back in 2015, no automated procedure could have flagged "this is the scaling-laws paper" versus another random plot. We measure what we can measure and improve that; the economically important stuff resists measurement until it's already done.
His one intuitive argument for optimism: DeepMind used games as their outer loop, their researchers learned from solving Atari and Go and StarCraft, and that experience plausibly transferred to making good LLMs. If humans got positive transfer from quick-to-verify games to ambitious real problems, why wouldn't an automated researcher? Patel pushes back — until recently people said Google was hobbled by being too tied to the old approach — and Jang concedes the jury's still out, even noting Google's "late start" may just have been a long bet on TPUs paying off. Even with today's data, humans can't say what the optimal research strategy is.
· · ·
02:50 · where to go next
The duality worth sitting with
Jang closes not with trees-for-LLMs but with a softer claim: there's a duality between MCTS/search and how LLMs reason that's profound and under-explored, precisely because Go got so little attention next to the LLM boom.
It's not to say that I think we should have trees in our LLMs, but there is some very interesting duality between them. You can actually do a lot of research on Go, MCTS, and reasoning with very small budgets.
⌞jang⌟ — the invitation, and the reason this old machine still teaches the new one
The whole interview is an argument that thinking-as-search and thinking-as-forward-pass are two faces of one thing — and that the cleanest place to study the relationship is a board game you can iterate on for a few thousand dollars. To go further: his write-up is at evjang.com with an interactive tutorial, the forkable code is github.com/ericjang/autogo, and the grander thesis — thinking as a primitive in computer science — is his essay "As Rocks May Think."
The bridge to your own table
This lands exactly where six qui prend already pointed. 2-player 6 nimmt! has fully known, low-dimensional rules — the game is its own simulator — which puts it squarely on the AlphaGo side of the divide this deep dive kept drawing: lightweight MCTS over known rules, a policy/value net guiding the search six-qui, not a learned world model. Everything granular here — PUCT's prior, the visit-count distribution as a soft target, grounding the value before you trust the search, the DAgger tube of recovery states — is now a concrete dial you can reach for. And Jang's closing fact is the permission slip: you can do real research on this with very small budgets.