| Thread Navigator

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

Alex Prompter

@alex_prompter

Everyone says “LLMs are black boxes.” This paper "How Do LLMs Use Their Depth?” just opened one and showed how intelligence forms layer by layer. They follow a “Guess → Refine” strategy: • Early layers make statistical guesses using frequent tokens (“the”, “of”, “and”) • Middle layers pull in context to test those guesses • Later layers refine them into precise, context-aware predictions Across GPT-2-XL, Llama-2-7B, Llama-3-8B, and Pythia-6.9B, ~80% of early guesses get replaced before the final layer. Even cooler: models use depth dynamically easy tasks (like punctuation or determiners) finish early, while hard ones (like fact recall or reasoning) go deeper. In short: LLMs aren’t just deep networks. They’re layered thinkers early guessers, late reasoners. Paper: arxiv. org/abs/2510.18871

Apply Image

Drag Post #2

Alex Prompter

@alex_prompter

This single diagram captures the paper’s entire idea. LLMs act like early guessers and late reasoners. Early layers throw out high-frequency guesses (“the”, “is”, “and”). Later layers refine them into meaningful, context-aware answers. Think intuition → reasoning → conclusion.

Apply Image

Drag Post #3

Alex Prompter

@alex_prompter

Wild stat: In the first layer of Pythia-6.9B, 75% of top predictions are just the 10 most common words. By the final layer, that number drops to ~30%. Early layers rely purely on word frequency they’re guessing from statistical priors before context even forms.

Apply Image

Drag Post #4

Alex Prompter

@alex_prompter

Here’s the proof that early guesses aren’t permanent. ~80% of first-layer predictions get overturned by the end. Even frequent tokens (“the”, “and”) are refined 70%+ of the time. The model doesn’t decide once — it debates itself layer by layer.

Apply Image

Drag Post #5

Alex Prompter

@alex_prompter

LLMs literally use depth based on word type. Function words (DET, ADP, PUNCT) stabilize around layer 5, while content-heavy words (NOUN, VERB, ADJ) take 15–20 layers. Easy = shallow, hard = deep.

Apply Image

Drag Post #6

Alex Prompter

@alex_prompter

Multi-token answers like “New York City” expose how reasoning compounds. The first token (“New”) needs 25+ layers of compute. Later tokens (“York”, “City”) appear much earlier (~12–20). That’s depth scaling with complexity in real time.

Apply Image

Drag Post #7

Alex Prompter

@alex_prompter

To prove these insights aren’t probe artifacts, they compared TunedLens vs LogitLens. Only TunedLens matched the final-layer probability distribution. Meaning: the “guess → refine” behavior is real, not a decoding illusion.

Apply Image

Drag Post #8

Alex Prompter

@alex_prompter

They even masked high-frequency words (“the”) 1000× less during training. Still appeared as early top predictions. That means early layers genuinely encode frequency priors, not probe bias.

Apply Image

Drag Post #9

Alex Prompter

@alex_prompter

LLMs don’t think in one pass. They guess, test, refine, and decide across their depth. Each layer isn’t just computation it’s a thought step. We’re literally watching models reason in slow motion. <a target="_blank" href="https://github.com/akshat57/how-do-llms-use-their-depth" color="blue">github.com/akshat57/how-d…</a>