| Thread Navigator

Thread Truncated (Cap Enforced)

Only the first 20 tweets are unrolled into slides to ensure reliable PDF exporting and high server performance.

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

Jeffrey Emanuel

@doodlestein

I sometimes find myself preferring to read the thinking traces that lead to a response from an LLM over the final, polished response.

Apply Image

Drag Post #2

Jeffrey Emanuel

@doodlestein

I think this relates to the distinction in mathematics between “analysis” and “synthesis.”

Drag Post #3

Jeffrey Emanuel

@doodlestein

The mathematician Gauss famously tried to remove all traces of the analysis that allowed him to arrive at the answer from his mathematical works.

Drag Post #4

Jeffrey Emanuel

@doodlestein

He wanted the final “synthetic” results to flow perfectly from the definitions, axioms, and pure logic, as if handed down by God.

Drag Post #5

Jeffrey Emanuel

@doodlestein

While this is certainly “cleaner” and more polished and dense, what some people call elegance in math, it’s often not particularly illuminating for the reader, especially if they’re not a super genius like Gauss.

Drag Post #6

Jeffrey Emanuel

@doodlestein

I find myself most engaged by technical writing when I can understand the motivation at each step and effectively put myself in the position of the discoverer and see how I could have come to the same conclusion when presented with the same facts and framing.

Drag Post #7

Jeffrey Emanuel

@doodlestein

I’ll give you an example using Claude Opus 4.8 with max reasoning effort. I asked it the following:

Drag Post #8

Jeffrey Emanuel

@doodlestein

—

Drag Post #9

Jeffrey Emanuel

@doodlestein

With the benefit of hindsight, what would you say were the things that made Transformers “obvious” and unsurprisingly effective for modeling very complex relationships in optimal ways that generalize well? To what extent is this related to them being, in effect, a “fully differentiable Turing machine”?

Drag Post #10

Jeffrey Emanuel

@doodlestein

—

Drag Post #11

Jeffrey Emanuel

@doodlestein

Below is the full reasoning trace. Sure, some of it is logistical in nature, but seeing how it approached the question, the various threads it pursued, and then finally how it wove things together, is a lot more illuminating for me than the final response, which I’ll also post below the trace: — The user is asking a conceptual question about why transformer architectures seem obvious in retrospect and how they relate to the idea of transformers as fully differentiable Turing machines. This is a deep theoretical ML question that doesn't require any external research—it's about foundational concepts I can reason through, and based on their background in mathematical physics and probability, they're looking for a substantive theoretical explanation.

Drag Post #12

Jeffrey Emanuel

@doodlestein

I should be careful here because the "differentiable Turing machine" framing is actually somewhat contested and needs precision. Let me think through the genuine structural reasons why transformers are effective—starting with attention as a soft, content-based lookup mechanism, which is essentially a differentiable kernel smoother that learns its own representations through the query-key-value structure.

Drag Post #13

Jeffrey Emanuel

@doodlestein

The real power comes from input-dependent connectivity: unlike fixed-weight architectures, attention computes routing weights directly from the data itself, creating multiplicative interactions that are far more expressive per parameter. This connects to ideas like fast weights and hypernetworks—the computation graph itself adapts based on the input. Then there's the inductive bias: treating sequences as sets with explicit position encoding decouples "what" information from "where," which is a clean separation that works well for sequential data.

Drag Post #14

Jeffrey Emanuel

@doodlestein

The architecture also solves long-range dependencies elegantly—any two tokens are just one attention hop apart, giving constant-distance paths compared to the O(n) sequential bottleneck of RNNs or the O(log n) to O(n/k) depth of stacked CNNs. This structural solution to credit assignment is more fundamental than gating tricks like LSTMs. And perhaps most crucially, the whole thing parallelizes beautifully across positions, which is why it dominates modern hardware despite being slightly less sample-efficient than some alternatives—the hardware lottery matters enormously when an architecture aligns perfectly with GPU/TPU matmul operations.

Drag Post #15

Jeffrey Emanuel

@doodlestein

The residual stream architecture is elegant too: each layer reads from and writes to a shared communication channel, making the network an iterative refinement process where attention heads and MLPs contribute additively. This compositional view (mechanistic interpretability) makes very deep stacks trainable and gives clean semantics. And the uniformity of just repeating the same block with minimal architectural priors is a strength—it's simple, general, and lets the learned weights do the heavy lifting.

Drag Post #16

Jeffrey Emanuel

@doodlestein

The key insight is the separation of mixing operations: attention handles token-to-token mixing across the sequence, while MLPs handle feature mixing position-wise. This factorization turns out to be the essential decomposition, as later work like MLP-Mixer showed you can even swap out attention for another token-mixing layer and still get reasonable results.

Drag Post #17

Jeffrey Emanuel

@doodlestein

Beyond the architecture itself, there's something deeper about how transformers scale. The loss follows a smooth power law across orders of magnitude of compute, data, and parameters—no weird discontinuities or walls. This graceful degradation and smooth improvement is itself evidence the architecture is fundamentally sound. And empirically, attention heads compose into reusable algorithmic circuits: induction heads, for instance, implement a copy-and-complete mechanism by looking back to previous token occurrences and copying what followed, which is in-context learning at its most basic level.

Drag Post #18

Jeffrey Emanuel

@doodlestein

The real insight is that transformers aren't magic—they're the conjunction of several individually well-understood principles: input-dependent routing for expressiveness, constant path length for trainability at scale, parallelism for hardware efficiency, minimal inductive bias to let scale discover structure, a clean residual-stream compositional design, and factorization between token-mixing and channel-mixing. Each ingredient was known to be good in isolation, but the combination—and the bet that low inductive bias plus scale beats high inductive bias plus careful design—is what makes them work.

Drag Post #19

Jeffrey Emanuel

@doodlestein

That said, I should be honest about the hindsight bias here. Several things genuinely weren't obvious: that this would scale as far as it has, that in-context learning would emerge, that removing recurrence entirely rather than augmenting it would succeed, that you could stack hundreds of layers. The original paper was framed narrowly around machine translation and efficiency, not as a universal substrate for intelligence. So the honest answer is: some of it looks obvious now, but the empirical payoff of that particular bet was a real surprise.

Drag Post #20

Jeffrey Emanuel

@doodlestein

Now I'm thinking through the "fully differentiable Turing machine" question, which requires separating out several distinct claims that often get tangled together—there's the question of universality and Turing completeness, which has theoretical results under certain assumptions.