| Thread Navigator

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

Rohan Paul

@rohanpaul_ai

Absolutely classic @GoogleResearch paper on In-Context-Learning by LLMs. Shows the mechanisms of how LLMs learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change. 💡The mechanism they reveal for in-context-learning. When the model reads a few examples in your prompt, it figures out a pattern (like a small rule or function). Instead of permanently changing its stored weights, it forms a temporary adjustment that captures this pattern. That adjustment can be written mathematically as a rank-1 matrix, meaning it only adds one simple direction of change to the existing weights. This rank-1 update is “low-rank”, so it is very cheap and compact. But it still lets the model shift its behavior to fit the examples in the prompt. Once the prompt is gone, that temporary rank-1 tweak also disappears. So, in simple terms: The paper shows that in-context learning happens because the model internally applies a temporary rank-1 (very simple) weight update based on your examples, instead of permanently retraining itself. --- That behavior looks impossible if learning always means gradient descent. The authors ask whether the transformer’s own math hides an update inside the forward pass. They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune. Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt. --- Shows that the attention part can take what it found in your prompt and package it into a tiny “instruction” that, for this 1 forward pass, acts exactly like a small temporary change to the MLP’s weights. Nothing is saved to disk, yet the block behaves as if the MLP just got a low-rank tweak computed from your examples. Remove the prompt, the tweak disappears, the saved weights stay the same. As the model reads your examples token by token, it keeps refining that temporary tweak. Each new token nudges the MLP a bit more toward the rule implied by your examples, similar to taking small gradient steps, again only for this pass. When the examples have done their job, those nudges shrink toward 0, which is what you want when the pattern has been “locked in” for the current answer. 🧵 Read on 👇

Apply Image

Drag Post #2

Rohan Paul

@rohanpaul_ai

🧵2/n. ⚙️ The Core Idea They call any layer that can read a separate context plus a query a “contextual layer”. Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”. For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.

Apply Image

Drag Post #3

Rohan Paul

@rohanpaul_ai

🧵3/n. 🛠️ Temporary rank 1 patch A transformer block first runs the self‑attention layer and gets two things for the query token: the usual activation and the tiny difference between “with context” and “without context”. It multiplies that difference by the frozen weight matrix, then projects the result back through the query activation. The outcome is a one‑column times one‑row outer product, so the whole tweak has rank 1 and adds almost no storage overhead. In the very next instruction the block behaves exactly as if the real weight matrix had been replaced by that patch plus the original weights, even though nothing on disk changed . 🌀 Why the change vanishes after each run The patch lives only inside the forward pass. Once the model finishes processing the current token, the computation graph is cleared and the base weights revert to their untouched state. Because the next token builds its own patch from scratch, no cumulative edit sticks around in memory, yet during the pass the effect is the same as a quick one‑step fine‑tune . Put simply, each prompt token writes a throw‑away sticky note on top of the first weight matrix, lets the model read that note to answer the query, then tosses it out before the weights ever hit the file system.

Apply Image

Drag Post #4

Rohan Paul

@rohanpaul_ai

🧵4/n. 🧩 How the Patch Works Theorem 2.2 shows a formula: multiply the base weights by the context change vector, then project it with the query representation, boom, you get the patch. Because the patch is rank 1, it stores almost no extra parameters yet still carries the full prompt signal. So the network behaves as if it fine‑tuned itself, even though no optimizer ran.

Apply Image

Drag Post #5

Rohan Paul

@rohanpaul_ai

🧵5/n. 📐 Hidden Gradient Descent Feeding tokens one by one stacks these tiny patches. Proposition 3.1 proves each added token shifts the weights the same way online gradient descent would, with a step size tied to the query vector length. The shift shrinks as soon as a token stops adding new info, matching the feel of a converging optimizer.

Apply Image

Drag Post #6

Rohan Paul

@rohanpaul_ai

🧵6/n. 🔬 Testing on Simple Linear Tasks They train a small transformer to map x→w·x using 50 prompt pairs plus 1 query. When they swap the prompt for its equivalent rank 1 patch and feed only the query, the loss curve overlaps the full‑prompt run almost perfectly. That overlap stays tight through 100 training steps.

Drag Post #7

Rohan Paul

@rohanpaul_ai

🧵7/n. 🤝 Finetune vs. Implicit Patch They compare classic gradient finetuning on the same examples to the single‑shot patch strategy. Both methods cut test loss in a similar pattern, yet the patch avoids any real back‑prop and keeps the rest of the network frozen.

Drag Post #8

Rohan Paul

@rohanpaul_ai

🧵8/n. 🔎 Limits They Admit Results cover only the first generated token and one transformer block without MLP skip, so full‑stack models need more work. Still, the finding hints that many in‑context tricks come from weight geometry rather than quirky attention rules.

Drag Post #9

Rohan Paul

@rohanpaul_ai

Paper – <a target="_blank" href="https://arxiv.org/abs/2507.16003" color="blue">arxiv.org/abs/2507.16003</a> Paper Title: "Learning without training: The implicit dynamics of in-context learning"