tenko · debrief · zero to hero

Building AlphaGo
from scratch

In 2016 a machine beat Lee Sedol at Go and it felt like the future arriving. Ten years on, Eric Jang rebuilt that machine over a weekend for the price of a used laptop — and in doing so laid bare the three primitives of intelligence: search, learning from experience, and self-play. This is the walk-through, built up from what a neural network actually is.

00

Why step backwards to go forwards

The version you watched — AlphaGo against Lee Sedol, March 2016 — is the one the source still calls jangWhen I saw the early breakthroughs on AlphaGo in 2014, 2015, 2016 and so forth, it was profound to see how smart AI systems could become and the computational complexity class they could tackle with deep learning. a watershed. A problem long believed intractable for any computer fell to deep learning, and it did so in a way nobody fully expected. Jang's bet is that this old machine still teaches the future better than today's chatbots do.

His framing is worth holding onto for the whole read: AlphaGo is the cleanest worked example of the primitives of intelligence — search (reading ahead), learning from experience (training on what happened), and self-play (getting better by playing yourself). jangAlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn. Every modern system — the LLM you talk to, a robot learning to walk — is some recombination of these three. Go just shows them in their purest form.

And the punchline that frames the second half of the piece: the way AlphaGo learns is almost the opposite of how a large language model learns from reinforcement — and arguably much closer to how you learn. Getting to why that's true is the journey. But we can't talk about networks beating Go until we're honest about what a neural network even is. So we start there.

· · ·
01

What a neural network actually is

You already have the concept: a thing that learns patterns from data. Let's swap the concept for the machinery, because the rest of the piece leans on it.

The machinery — a network in three plain ideas

1. It's a giant tunable function. A neural network is just a mathematical function: numbers go in, numbers come out. What makes it special is that it has millions of little dials inside it — called weights or parameters — and turning those dials changes what the function computes. AlphaGo's network had under 3 million of these dials.

2. Running it once is a "forward pass." You feed the input (a Go board) in one end, it flows through a stack of layers — each layer a simple transformation — and an answer comes out the other end. Ten layers means ten transformations in sequence. That single front-to-back run is one forward pass. It's fast: milliseconds.

3. "Training" is just nudging the dials. You show it an example, compare its answer to the right answer (the gap is the loss), and nudge every dial a hair in the direction that would have made the gap smaller. Do that across millions of examples and the dials settle into a configuration that, as Jang puts it, holds crystallized knowledge and experience. That nudging procedure is gradient descent; you don't need the calculus, just the picture of millions of dials slowly settling.

One design choice matters later. AlphaGo's network is a ResNet — a style built from convolutions, which look at small local patches of the board rather than the whole thing at once. Jang found ResNets beat the trendier transformers here, because Go fights are spatially local (what matters is usually near the stones in question), and that local bias is baked into convolutions for free jang. The lesson generalises: in a low-data regime, a network that already "assumes" the right shape of the problem learns faster than a blank-slate one.

That's the whole toolkit. A board goes in; a few million dials, arranged in ten local-pattern layers, transform it; an answer comes out. Now — what answer do we want it to give?

· · ·
02

The two impossible problems — and the two networks

Go is hard for a brutally simple reason. The board is 19×19, so on the first move there are 361 places to play; the game runs 250–300 moves. Follow every branch and you get roughly 361300 possible games — jangThis is something on the order of 361^300, which is far more than the number of atoms in the universe. far more than there are atoms in the universe. No computer can search that. Two impossibilities sit inside that number, and AlphaGo defeats each with one network.

361
moves available on the opening board (19×19 intersections)
~300
moves in a typical game — the depth problem
361300
possible games — more than atoms in the universe

The breadth problem → the policy network

Too many moves to consider at each turn. The fix: a policy network that looks at the board and outputs a hunch — a probability over all 361 points saying "these few are worth considering." It narrows 361 down to a handful, the way a strong human barely notices most of the board.

The depth problem → the value network

Games are too long to play out to the end for every branch. The fix is the move Jang calls the most profound in all of AlphaGo: a value network that looks at a board mid-game and outputs a single number — the probability of winning from here — without playing the rest of the game at all.

The analogy that makes the value network click

Think about a strong Go player glancing at a board and instantly muttering "I'm probably going to lose." They didn't simulate the remaining 150 moves. They just know. Jang's framing: that human is running a value function in their head — a network that amortizes a huge number of possible playouts into a single glance. jangThe human glances at the board and knows, 'I'm probably going to lose.' They're essentially running a neural network that looks at a board and implicitly amortizes a huge number of possible game playouts. The value network is that intuition, made of silicon.

Figure · one trunk, two heads

board → ResNet trunk

Policy head → 361 numbers

"Which moves are worth a look?" — a hunch over the board. Solves breadth.

Value head → 1 number

"What's my win probability from here?" — no playout needed. Solves depth.

Both heads share one trunk reading a 3-channel board (black / white / empty). The original Lee Sedol version used two separate networks; every version since merges them into one trunk with two heads.

Here is why Jang calls it profound, and it's the single deepest idea in the piece. A ten-layer network — ten steps of computation — collapses a search problem with more branches than the universe has atoms into one forward pass.

A 10-layer neural network pass… is able to amortize and approximate to very high fidelity a nearly intractable search problem.
⌞jang⌟ — the same phenomenon underwriting AlphaFold (protein folding) and AlphaTensor

This isn't unique to Go. The same trick — a smallish network amortizing an enormous simulation into a glance — is what powers AlphaFold folding proteins and AlphaTensor discovering algorithms. Jang admits it genuinely unsettles his sense of what's computationally hard: jangIt actually makes me wonder if our understanding of problems like P=NP, or these fundamental computational hardness problems, is incomplete. what felt like an impossibly hard problem keeps falling to a surprisingly simple, "macroscopic" solution. Not a proof of anything — but a real crack in the intuition that hard problems must stay hard.

· · ·
03

Search — how Monte Carlo Tree Search actually turns

The policy network's snap hunch already plays a shockingly strong game — Jang notes you can just take its top suggestion and "shoot from the hip" for a formidable opponent jangit'll be a very fast Go player that doesn't think in terms of reasoning steps. It just shoots from the hip, and it'll be a very strong Go player.. But to reach superhuman, you add thinking — and thinking, here, means search. The search procedure is Monte Carlo Tree Search (MCTS), and it runs fresh on every single move.

"Monte Carlo" just means using guided random sampling instead of exhaustive enumeration — you can't read every branch, so you sample the promising ones cleverly and average what you learn. MCTS builds a little tree of possible futures, four steps at a time:

Figure · one MCTS simulation, four steps

1 SELECT

Walk down

From the current board, follow the most promising line deeper into the tree, using the PUCT rule (below) to decide where to go.

2 EXPAND

Grow a leaf

Hit a board you've never seen? Add it to the tree and ask the policy head for its hunch about the moves from there.

3 EVALUATE

Score it

Ask the value head for the win probability of that new board — a single forward pass, no playing to the end.

4 BACKUP

Propagate

Carry that score back up the line you walked, updating the running average for every move along the way.

↻ repeat hundreds to thousands of times per move · then play the move that got visited most

Run that loop a few hundred times and the tree has quietly concentrated its attention on the strongest lines. The move chosen isn't the highest-scoring one — it's the most-visited one, because a move the search kept returning to is a move it trusts. The original Lee Sedol machine ran tens of thousands of these simulations per move on a room of TPUs; a modern trained bot needs far fewer, because, as we'll see, the network has absorbed the search into itself.

The one rule worth seeing — PUCT

Step 1 hides the only formula in this piece, and it's worth meeting because its three parts are the intuition of search: balance exploiting what looks good against exploring what you haven't tried.

Figure · what "pick the best branch" actually weighs

Exploit

How good it looks so far

The running average result (Q) from every simulation that has gone through this move. High when the move keeps leading to wins.

+

Explore

How much it deserves a fresh look

Large when the policy's hunch rated this move highly but the search has barely tried it. Shrinks every time the move gets visited — so attention spreads instead of fixating.

MCTS walks toward whichever move maximises exploit + explore. The policy network seeds the explore term with a prior; the value network feeds the exploit term through backup. The two networks are the search's eyes.

That interplay is the whole engine: the policy network proposes, the value network judges, and PUCT arbitrates between trusting a good track and checking a neglected one. Crucially, the tree is thrown away after each move — only one thing is kept, and that one thing is what makes AlphaGo learn.

· · ·
04

Self-play — the improvement operator

Here is where the three primitives lock together, and where your RL nouns — agent, state, action, reward, policy, value — finally become a loop. The thing MCTS keeps from each move is its final visit-count distribution: a sharper, search-improved opinion about what to play than the policy's raw hunch was. That improved opinion is the training signal.

Figure · the self-play training loop

PLAY

Play yourself

The agent plays a full game against a copy of itself, running MCTS on every move. Record, for each move: the board (state), the search's distribution (improved action), and eventually who won.

LABEL

Relabel every move

Each move now has a strictly better target than what was actually played — the search distribution. And each board gets labelled with the final game outcome (won / lost).

TRAIN

Nudge both heads

Train the policy head to predict the search distribution; train the value head to predict the outcome. Then play again with the improved network. ↻

No "reward signal" in the textbook RL sense — no TD error, no dynamic programming. Jang's striking claim: it reduces to supervised learning on improved labels.

Sit with what step two is doing, because it's the heart of the whole machine. Even in a game the agent lost, MCTS hands back a better move for every single position along the way:

You played this game where you eventually lost, but on every single action, I'm going to give you a strictly better action that you should have taken instead.
⌞jang⌟ — MCTS as a per-move improvement operator

The analogy — learning to drive by being corrected

Jang ties this to DAgger, a trick from teaching robots by imitation. Picture a self-driving car that drifts toward the shoulder. Even in that bad spot, there exists a correct action — steer back. jangEven if you're in a not-great state—for example, a self-driving car that veers off the side of the road—there is still a valid action that corrects you and brings you back. Whether the car ultimately crashes is beside the point; every moment still has a right answer you can learn from. AlphaGo manufactures that right answer at every move, win or lose. That's why it never has to untangle the maddening question "which of my 300 moves actually lost me the game?"

And notice the elegance of the starting conditions. Most reinforcement learning begins at a 0% success rate and has to stumble onto its first win by luck before it has anything to learn from — the dreaded exploration problem. AlphaGo sidesteps it: jangyou never have to initialize at a zero percent success rate and solve the exploration problem of how to get to a non-zero success rate. This is what allows you to hill-climb this beautiful supervised learning signal. because MCTS always produces a better label, there's always a gradient to climb. This is also why Jang insists initialization is everything — he warm-started his bot against the open-source KataGo rather than from a blank slate, skipping the most expensive part of the original effort.

· · ·
05

Why this is not how a chatbot learns

Now the payoff promised at the start. When an LLM is trained with reinforcement learning, it does not get a better label for every step. It writes a whole answer — thousands of tokens — gets told only "right" or "wrong" at the very end, and must somehow figure out which tokens deserve the credit. This is the credit assignment problem, and AlphaGo simply doesn't have it.

Figure · where the learning signal lands

AlphaGo

a target on every move

Every move carries its own improved label from search. Hundreds of dense supervision signals per game.

LLM reinforcement

one verdict at the end

one trajectory · ~100,000 tokens ✓ / ✗

A single right/wrong lands on the whole sequence. Which token earned it? Unknown. Karpathy: "sucking supervision through a straw."

Same idea of "learn from outcomes," opposite density of signal. In a 300-move game one decisive move may sit among ~29,900 neutral ones — yet MCTS labels each individually, so the needle never hides in the haystack.

There's a deeper way to say why this matters, and it's about information per example. Supervised learning (predict this specific move) delivers a rich, detailed signal every time. Reinforcement learning delivers only a single bit — won or lost — and worse, when the model rarely succeeds, even that bit carries almost no information. jangnaive RL delivers only the entropy of a binary variable — and because most of training is spent at very low pass rates, RL spends virtually all its time in the near-zero information regime. Most of training is spent in exactly that starved regime. Jang adds a killer caveat: if your policy would never even try the right answer, RL gives you no signal at all — whereas a supervised label always teaches you something jangIf your policy has no chance of sampling 'blue,' then you will never get a signal..

This is also, quietly, why distillation works so well. AlphaGo doesn't train the policy to copy the single best move; it trains it to copy the whole search distribution — a "soft" target carrying far more information per example than a hard yes/no. The same "dark knowledge" reason a small model learns efficiently by imitating a big model's full distribution rather than just its top answer.

So why not just bolt MCTS onto LLMs?

People have tried; Jang is skeptical it transfers cleanly, for two concrete reasons grounded in the mechanics we just built:

Value estimation is harder. Go hands you a crisp, cheatproof outcome — you won or you didn't. Mid-way through a chain of reasoning or a piece of code, "how good is this so far?" is genuinely hard to score, and the value network was the load-bearing piece.

The action space is too wide. PUCT's explore term only works because a good Go move gets revisited — its visit count climbs and the math responds. In language, the "moves" are tokens, and the space is so vast you'll essentially never sample the same continuation twice, so that visit count never accumulates and the heuristic goes dead. jangIn an LLM, you're most likely never going to sample the same child more than once… a discrete set of actions is not really an appropriate choice for an LLM. Jang doesn't rule out a future MCTS-flavoured mechanism for rigidly logical domains like mathematics — but calls the jury still out.

The bridge to your own project

This is exactly the line your six qui prend work already walked. 2-player 6 nimmt! has fully known, low-dimensional rules — the game is a perfect simulator, so there's nothing to learn about its dynamics. That puts it squarely on the AlphaGo side of this divide: the right shape is lightweight MCTS over the known rules, with a policy/value net guiding the search six-qui — not the heavier learned-world-model machinery built for environments whose rules are unknown. The credit-assignment elegance in this movement is precisely what your agent gets to exploit.

· · ·
06

$7K, and a robot doing the research

The original AlphaGo was a flagship effort — a team of DeepMind scientists and, for the Lee Sedol match, AlphaGo Zero-era training runs on the order of 1023 floating-point operations, a scale comparable to a frontier language model. Jang rebuilt a competitive bot for roughly $7,000 of rented compute (funded by a Prime Intellect donation). The gap isn't magic; it's a clean illustration of a general law:

The compute required to be the first to do something is always much larger than the compute it takes to catch up.
⌞jang⌟ — once a thing exists, you get to use crutches: distillation, a strong initialization, a decade of hardware

The catch-up crutches were exactly the ones this piece has named: warm-starting against KataGo instead of from zero; Blackwell-class GPUs making many of KataGo's 2020 compute tricks irrelevant; and a neat shortcut of co-training on small 9×9 boards alongside full 19×19 ones, so the value network learns endgames quickly before scaling up. Jang is careful to flag these as honest "vibe guesses" from his own runs, not peer-reviewed claims jangthese are vibe guesses from his own experiments.

The coda — Go as a proving ground for automating research itself

Throughout the rebuild, Jang ran an autoresearch loop: frontier models (Opus 4.6 and 4.7) driving parts of the research process. The split he found is the most quietly important thing in the interview, because it's a real datapoint on the "intelligence explosion" question — what happens when AI starts doing AI research.

Figure · what the AI researcher could and couldn't do

Already good at

the inner loop

Hyperparameter search across open-ended options · inspecting gradient norms and reacting · rewriting data loaders · running experiments and compiling the plots and reports. Give it an axis to explore and it executes.

Still bad at

the outer loop

Choosing which experiment to run next when a track is failing · the lateral step of "wait, this whole assumption is wrong, go back to first principles." It optimises harder inside the frame instead of escaping it.

Jang visualises his project as a tree of experiments — nodes are outcomes, branches are follow-ups. The models grow the branches well; a human still has to prune the dead tracks and start new ones.

Why is Go the ideal arena to study this? Because it closes the loop honestly: a Go game is fast to verify and hard to reward-hack — win rate against KataGo is an unambiguous external judge — while the inner work (distributed systems, architecture choices, hyperparameter tuning) maps cleanly onto general ML research jangGo captures a lot of very interesting research problems… Yet it's very quick to verify. The outer loop is ultimately: does the agent do what I think it does?. Whether competence on that inner loop ever transfers to the truly hard part — choosing the right question, discovering a new phenomenon like scaling laws — is left open. Which lands us, fittingly, back where we started: AlphaGo as the cleanest mirror we have for how the next intelligence might learn — and where it might still be stuck.

Glossary — the hero's toolkit

Parameter weight
One of the millions of tunable numbers inside a network. Training = finding the configuration of all of them that maps inputs to good outputs. AlphaGo's net had under 3 million.
Forward pass
One front-to-back run of the network: input in, answer out. Fast (milliseconds). The whole point of the value network is doing useful work in a single one.
Policy network the hunch
Outputs a probability over all 361 moves — "which are worth considering." Solves the breadth problem. On its own it's already a strong, instinctive player.
Value network the glance
Outputs one number — win probability from this board — without playing it out. Solves the depth problem. Jang's "most profound" piece of AlphaGo.
MCTS search
Monte Carlo Tree Search. Builds a tree of likely futures by repeating Select → Expand → Evaluate → Backup hundreds of times per move, then plays the most-visited move. "Monte Carlo" = guided random sampling instead of brute enumeration.
PUCT
The rule MCTS uses to choose where to look next: exploit (how good a move looks so far) + explore (how under-tried a promising move is). Balances trusting vs checking.
Self-play
The agent plays itself, MCTS improves every move into a better label, the network trains on those labels, repeat. The "improvement operator." Reduces to supervised learning on improved labels.
Credit assignment
The problem of knowing which action in a long sequence deserves the credit/blame for the outcome. LLM reinforcement suffers it badly; AlphaGo dodges it entirely by labelling every move.
Policy gradient / REINFORCE
The standard "learn from outcomes" RL family used in LLMs — push up the probability of actions from winning trajectories. High variance because the signal is sparse; the lineage (baselines → actor-critic → PPO) is one long campaign to tame that variance weng.
Distillation
Training a network on another's full distribution (a soft target) rather than its single top answer — far more information per example. Why AlphaGo trains on the whole search distribution, not just the chosen move.
KataGo
The strong open-source Go engine (2020) Jang warm-started against — the "strong initialization" that turned a multi-million-dollar effort into a $7K one.