@alex_prompter: Everyone says “LLMs are black ...

1

Everyone says “LLMs are black boxes.”

This paper "How Do LLMs Use Their Depth?” just opened one and showed how intelligence forms layer by layer.

They follow a “Guess → Refine” strategy:

• Early layers make statistical guesses using frequent tokens (“the”, “of”, “and”)
• Middle layers pull in context to test those guesses
• Later layers refine them into precise, context-aware predictions

Across GPT-2-XL, Llama-2-7B, Llama-3-8B, and Pythia-6.9B, ~80% of early guesses get replaced before the final layer.

Even cooler: models use depth dynamically easy tasks (like punctuation or determiners) finish early, while hard ones (like fact recall or reasoning) go deeper.

In short:

LLMs aren’t just deep networks. They’re layered thinkers early guessers, late reasoners.

Paper: arxiv. org/abs/2510.18871

2

This single diagram captures the paper’s entire idea.

LLMs act like early guessers and late reasoners.

Early layers throw out high-frequency guesses (“the”, “is”, “and”).

Later layers refine them into meaningful, context-aware answers. Think intuition → reasoning → conclusion.

3

Wild stat: In the first layer of Pythia-6.9B, 75% of top predictions are just the 10 most common words.

By the final layer, that number drops to ~30%.

Early layers rely purely on word frequency they’re guessing from statistical priors before context even forms.

4

Here’s the proof that early guesses aren’t permanent.

~80% of first-layer predictions get overturned by the end.

Even frequent tokens (“the”, “and”) are refined 70%+ of the time.
The model doesn’t decide once — it debates itself layer by layer.

5

LLMs literally use depth based on word type.

Function words (DET, ADP, PUNCT) stabilize around layer 5,
while content-heavy words (NOUN, VERB, ADJ) take 15–20 layers.

Easy = shallow, hard = deep.

6

Multi-token answers like “New York City” expose how reasoning compounds.

The first token (“New”) needs 25+ layers of compute.

Later tokens (“York”, “City”) appear much earlier (~12–20). That’s depth scaling with complexity in real time.

7

To prove these insights aren’t probe artifacts, they compared TunedLens vs LogitLens.

Only TunedLens matched the final-layer probability distribution.

Meaning: the “guess → refine” behavior is real, not a decoding illusion.

8

They even masked high-frequency words (“the”) 1000× less during training.

Still appeared as early top predictions.

That means early layers genuinely encode frequency priors, not probe bias.

9

LLMs don’t think in one pass.

They guess, test, refine, and decide across their depth.

Each layer isn’t just computation it’s a thought step.

We’re literally watching models reason in slow motion.

github.com/akshat57/how-d…

@alex_prompter: Everyone says “LLMs are black ...

Actions

What You Can Do