Thinking Traces are Better for Understanding

I sometimes find myself preferring to read the thinking traces that lead to a response from an LLM over the final, polished response.

I think this relates to the distinction in mathematics between “analysis” and “synthesis.”

The mathematician Gauss famously tried to remove all traces of the analysis that allowed him to arrive at the answer from his mathematical works.

He wanted the final “synthetic” results to flow perfectly from the definitions, axioms, and pure logic, as if handed down by God.

While this is certainly “cleaner” and more polished and dense, what some people call elegance in math, it’s often not particularly illuminating for the reader, especially if they’re not a super genius like Gauss.

I find myself most engaged by technical writing when I can understand the motivation at each step and effectively put myself in the position of the discoverer and see how I could have come to the same conclusion when presented with the same facts and framing.

I’ll give you an example using Claude Opus 4.8 with max reasoning effort. I asked it the following:

—

With the benefit of hindsight, what would you say were the things that made Transformers “obvious” and unsurprisingly effective for modeling very complex relationships in optimal ways that generalize well? To what extent is this related to them being, in effect, a “fully differentiable Turing machine”?

—

Below is the full reasoning trace. Sure, some of it is logistical in nature, but seeing how it approached the question, the various threads it pursued, and then finally how it wove things together, is a lot more illuminating for me than the final response, which I’ll also post below the trace:

—

The user is asking a conceptual question about why transformer architectures seem obvious in retrospect and how they relate to the idea of transformers as fully differentiable Turing machines. This is a deep theoretical ML question that doesn't require any external research—it's about foundational concepts I can reason through, and based on their background in mathematical physics and probability, they're looking for a substantive theoretical explanation.

I should be careful here because the "differentiable Turing machine" framing is actually somewhat contested and needs precision. Let me think through the genuine structural reasons why transformers are effective—starting with attention as a soft, content-based lookup mechanism, which is essentially a differentiable kernel smoother that learns its own representations through the query-key-value structure.

The real power comes from input-dependent connectivity: unlike fixed-weight architectures, attention computes routing weights directly from the data itself, creating multiplicative interactions that are far more expressive per parameter. This connects to ideas like fast weights and hypernetworks—the computation graph itself adapts based on the input. Then there's the inductive bias: treating sequences as sets with explicit position encoding decouples "what" information from "where," which is a clean separation that works well for sequential data.

The architecture also solves long-range dependencies elegantly—any two tokens are just one attention hop apart, giving constant-distance paths compared to the O(n) sequential bottleneck of RNNs or the O(log n) to O(n/k) depth of stacked CNNs. This structural solution to credit assignment is more fundamental than gating tricks like LSTMs. And perhaps most crucially, the whole thing parallelizes beautifully across positions, which is why it dominates modern hardware despite being slightly less sample-efficient than some alternatives—the hardware lottery matters enormously when an architecture aligns perfectly with GPU/TPU matmul operations.

The residual stream architecture is elegant too: each layer reads from and writes to a shared communication channel, making the network an iterative refinement process where attention heads and MLPs contribute additively. This compositional view (mechanistic interpretability) makes very deep stacks trainable and gives clean semantics. And the uniformity of just repeating the same block with minimal architectural priors is a strength—it's simple, general, and lets the learned weights do the heavy lifting.

The key insight is the separation of mixing operations: attention handles token-to-token mixing across the sequence, while MLPs handle feature mixing position-wise. This factorization turns out to be the essential decomposition, as later work like MLP-Mixer showed you can even swap out attention for another token-mixing layer and still get reasonable results.

Beyond the architecture itself, there's something deeper about how transformers scale. The loss follows a smooth power law across orders of magnitude of compute, data, and parameters—no weird discontinuities or walls. This graceful degradation and smooth improvement is itself evidence the architecture is fundamentally sound. And empirically, attention heads compose into reusable algorithmic circuits: induction heads, for instance, implement a copy-and-complete mechanism by looking back to previous token occurrences and copying what followed, which is in-context learning at its most basic level.

The real insight is that transformers aren't magic—they're the conjunction of several individually well-understood principles: input-dependent routing for expressiveness, constant path length for trainability at scale, parallelism for hardware efficiency, minimal inductive bias to let scale discover structure, a clean residual-stream compositional design, and factorization between token-mixing and channel-mixing. Each ingredient was known to be good in isolation, but the combination—and the bet that low inductive bias plus scale beats high inductive bias plus careful design—is what makes them work.

That said, I should be honest about the hindsight bias here. Several things genuinely weren't obvious: that this would scale as far as it has, that in-context learning would emerge, that removing recurrence entirely rather than augmenting it would succeed, that you could stack hundreds of layers. The original paper was framed narrowly around machine translation and efficiency, not as a universal substrate for intelligence. So the honest answer is: some of it looks obvious now, but the empirical payoff of that particular bet was a real surprise.

Now I'm thinking through the "fully differentiable Turing machine" question, which requires separating out several distinct claims that often get tangled together—there's the question of universality and Turing completeness, which has theoretical results under certain assumptions.

There are papers showing transformers are Turing complete, but only under unrealistic conditions like arbitrary precision arithmetic and unbounded decoding steps. The theoretical result is vacuous in practice. But a fixed-depth transformer is actually quite limited and definitely not Turing complete. Turing machine—it's a shallow parallel circuit instead. The key insight is that autoregressive decoding changes everything: when the transformer generates tokens and feeds them back as context, it creates an unbounded sequential loop with external memory (the growing context window), which is where real computational power emerges. This is what chain-of-thought results actually demonstrate—the expressive power comes from the iterative process, not a single forward pass.

The mapping to Turing machine anatomy becomes clear: the transformer weights act as the finite control, while the context window serves as both the readable tape and the writable output, with each decoding step functioning like a machine transition.

The universal Turing machine analogy is particularly elegant here—the weights remain fixed like a universal machine's core logic, while the prompt and context supply the "program" that gets executed. This is really what in-context learning amounts to: the model reading instructions off its tape and executing them. The differentiability aspect matters for training, allowing the transition function itself to be learned through gradient descent.

But there's a crucial asymmetry: while the network is differentiable end-to-end during training, the actual inference loop—where discrete tokens are sampled and fed back—breaks differentiability at that sampling step. So you get the TM-like unbounded computation power from a discrete, non-differentiable process, even though the underlying function was learned differentiably. This contrasts sharply with earlier work like the Neural Turing Machine and Differentiable Neural Computer, which were explicitly designed to be fully differentiable throughout, including memory operations using soft attention. Those were the literal differentiable machines, but they proved difficult to scale despite their theoretical elegance.

The irony is that transformers abandoned the fully-differentiable-memory ambition in favor of fixed-size attention and discrete autoregression—and that's precisely what scaled. So the architecture that only later got reinterpreted as an approximate differentiable computer actually outperformed the ones explicitly engineered to be differentiable from the ground up.

There's also another lens worth considering: transformers as in-context learners that implement learned circuits. Recent work on induction heads and transformers performing gradient descent in-context suggests the forward pass can encode actual learning algorithms—one step of gradient descent, ridge regression, and so on—directly in the activations. Within the bounded-depth circuit, it's running little optimization and inference procedures.

To directly answer whether this relates to being a fully differentiable Turing machine: the intuition captures something genuinely important. The system does have the anatomy of a universal Turing machine—a fixed learned transition function, external readable/writable memory via context, content-addressable access through attention, and an unbounded sequential loop via autoregression. That structure is why CoT and agentic loops unlock so much capability. But it's also importantly imprecise when applied to just the forward pass alone. A fixed-depth transformer is a bounded-depth parallel circuit, provably not Turing complete, and can't even handle some basic computational tasks.

The real power lives in the loop and the tape, not the network itself. The "differentiable" part refers to trainability of the control function, not the computation loop, which is discrete. Historically, the literal fully differentiable Turing machines—the NTM and DNC—actually lost to transformers, so calling a transformer one conflates two different things and gets the history backwards. The actual success isn't primarily because it's a Turing machine. Turing-completeness is cheap and common; what matters is the combination of expressiveness, trainability, parallelism, and low inductive bias. Turing-completeness alone doesn't explain generalization or learning.

The real insight is that universality is almost beside the point—what matters is the learnability of useful programs and the inductive bias toward compositional, reusable circuits, plus the optimization-friendliness and scaling behavior. People over-index on "Turing complete equals powerful," but the truth is that expressivity is necessary but radically insufficient. The magic is in the learnability, the loss landscape, and how well it aligns with SGD. A brainfuck interpreter is Turing complete and useless for learning. The transformer's advantage is that its space of learnable programs is well-suited to the problems we care about.

I should also connect this to circuit complexity and why chain-of-thought helps: it's not just "more thinking," it's trading circuit depth (which transformers are fundamentally limited in) for sequential computation steps. CoT converts problems that need more serial depth than the model has into ones solvable by a sequence of shallow steps, each within TC⁰. This gives a rigorous, principled explanation for why intermediate reasoning works.

Now I'm thinking through the nuances of soft versus hard attention and how precision assumptions affect the exact complexity class. The key distinction is that soft attention with infinite precision could theoretically be Turing complete, but that's unrealistic—with fixed precision and either soft or hard attention, we're looking at something closer to TC⁰ or logspace. I want to be precise about the exact results here, particularly Hahn's work showing that hard-attention transformers can't recognize certain formal languages like PARITY or DYCK-2.

The Merrill and Sabharwal results are central: log-precision transformers can be simulated by constant-depth threshold circuits, which means they're bounded by TC⁰ and logspace. The constant depth is the limiting factor—it severely restricts serial computation. More recent work by Merrill and Sabharwal on chain-of-thought shows that adding intermediate reasoning steps gives transformers additional computational power, though the thinking cuts off before fully developing that point. ursive computation captures serial power. Given the sophistication of this question and the asker's technical background, I should aim for substantive depth in prose rather than relying on headers or bullet points. I can use precise terminology like TC⁰, NC¹, and reference key papers directly since they'll appreciate the technical precision and the contrarian-but-correct framing. The key ingredients that make transformers feel inevitable in retrospect—expressiveness through input-dependent routing, efficient training via short gradient paths, natural hardware alignment through parallelism, and the compositionality enabled by the residual stream and factorized token-channel mixing. But I should be honest that "obvious" is partly hindsight; the scaling laws and emergence phenomena were genuine surprises at the time. On the differentiable Turing machine analogy, there's something real there when you map the system-level architecture—fixed weights as control, context as tape, attention as the read head, autoregression as the loop, in-context learning as the universal machine property—but the forward pass itself is fundamentally bounded-depth circuitry, which provably can't simulate full Turing computation.

Chain-of-thought reasoning is where the two halves actually connect: you're trading circuit depth for sequential steps, and that's the rigorous bridge between the computational model and what the system can actually learn to do. The deeper point is that Turing-completeness itself is almost cheap and doesn't explain the success at all; what matters is learnability, generalization, and optimization-friendliness. And there's a historical irony worth noting: the fully differentiable computers like Neural Turing Machines and Differentiable Neural Computers were the literal answer to this question, yet transformers won by being less explicitly a differentiable computer—the framing in the question is seductive but ultimately misleading about what's actually driving the performance. Attention's real power comes from computing mixing weights directly from the input data rather than applying fixed transformations—this makes interactions multiplicative and far more parameter-efficient than traditional architectures. The constant path length between any two positions eliminates the vanishing gradient problem that plagued RNNs, making long-range dependencies trainable without special tricks. And crucially, the fully parallelizable structure maps perfectly onto modern hardware, which is the unglamorous but decisive advantage that actually enabled scaling.

The minimal inductive bias—stacking the same block with position information kept separate—lets the architecture avoid baking in assumptions like locality that constrain learning. Instead of fighting the data with architectural priors, transformers let scale itself discover structure, which is the Bitter Lesson playing out in practice. The residual stream acts as a shared linear bus that each component refines iteratively, allowing information to flow and accumulate across layers.

The factorization of mixing—attention across tokens, MLPs across channels—turns out to be the core insight; MLP-Mixer proved you can replace attention with other token-mixers and still work well, suggesting the factorization matters more than attention specifically. The honest truth is that each piece seemed reasonable in 2017, but the real bet was that low-bias-plus-scale would beat clever architectural priors, combined with smooth power-law scaling and the emergence of unexpected capabilities.

Now I'm questioning the "differentiable Turing machine" framing because it conflates two different things. On a single forward pass, a transformer is really just a bounded-depth parallel circuit with surprisingly low complexity—complexity-theory work shows it sits around TC⁰, and hard-attention transformers can't even reliably solve problems like PARITY or balanced parentheses.

But where the analogy actually works is when you look at the whole autoregressive system unfolding over time. The frozen weights act like a fixed control function applied repeatedly, the context window and generated tokens form the tape, attention provides content-addressable read access, and the sequential generation loop creates the unbounded computation.

This is genuinely a Turing machine structure, and the universal version is even cleaner: the weights stay fixed like a universal machine, while the prompt supplies the program on the tape—which is exactly what in-context learning does. Chain-of-thought exploits this by trading the circuit depth the network lacks for sequential steps the loop provides, decomposing deep problems into shallow TC⁰-sized steps executed serially.

So chain-of-thought isn't just "more thinking"—it's escaping the constant-depth constraint by spending tokens. But I need to clarify what "fully differentiable" actually means here: the control function is learned end-to-end through gradient descent, but the inference-time loop itself is discrete (argmax, sampling, feedback) and not differentiable. And historically, the literal differentiable Turing machines—Neural Turing Machines and Differentiable Neural Computers—had soft, differentiable memory operations but ultimately didn't win out.

The real irony is that transformers succeeded by being less of an explicit differentiable computer: they dropped the differentiable memory for a fixed context window and discrete autoregression, only later getting reinterpreted as approximate universal computers.

The key insight I want to emphasize is that Turing-completeness isn't what makes transformers work. Universality is cheap—Rule 110 is Turing complete, Brainfuck is Turing complete, and neither learns or generalizes. What actually matters is that the computations transformers can express are smoothly searchable by gradient descent and biased toward solutions that generalize well.

The attention mechanism creates a space where gradient descent reliably discovers compositional, reusable algorithms—the kind of in-context learning where a forward pass can implement gradient descent or ridge regression steps. The magic is learnability and a friendly loss landscape, not raw expressive power. In fact, the constraints—bounded depth, homogeneous architecture—are features, not bugs; they're what make transformers learnable and predictable to scale.

The differentiable Turing machine intuition works well for understanding the system as a whole: weights as control, context as tape, attention as the read head, autoregression as the loop. But it's misleading as an explanation for why they actually work.

I'm reconsidering the length here. The question is genuinely sophisticated and deserves a substantive answer, so I don't want to oversimplify. But I should tighten any redundancy and make sure each sentence pulls its weight. I used some bullet points earlier, and while lists can work for multifaceted content, prose flows better for this kind of conceptual argument. I'm double-checking the technical claims I made—verifying that the Merrill & Sabharwal result about log-precision transformers fitting within logspace-uniform TC⁰ is accurate, and confirming Hahn's 2020 findings on why transformers fail to robustly recognize languages like PARITY and DYCK through perturbation sensitivity arguments. Now I need to verify what TC⁰ itself cannot simulate to make sure my claims about its limitations are sound. you can compute parity with threshold gates by summing the bits and checking the low bit. So if transformers are in TC⁰, that doesn't separate them from parity since TC⁰ can handle it. But Hahn's result about hard-attention transformers failing on parity is really about a robustness and Lipschitz constraint specific to that architecture, not a general circuit-class argument. I need to be precise here: saying "transformers are TC⁰ and TC⁰ can't do parity" is wrong because TC⁰ actually can do parity.

The circuit-complexity result from Merrill-Sabharwal shows that log-precision soft-attention transformers fit within uniform TC⁰. Now, what can't TC⁰ do? Under standard conjectures, TC⁰ is strictly contained in NC¹, so it can't handle NC¹-complete problems like evaluating arbitrary Boolean formulas or simulating general finite automata over non-solvable monoids. That's where the real separation is—not with parity, which TC⁰ handles fine. I should pick a different example to illustrate what TC⁰ can't do.

Let me separate the two limitations cleanly: Hahn's robustness result is architecture-specific and applies to parity and Dyck languages, while the Merrill-Sabharwal containment in TC⁰ is the broader complexity bound—and TC⁰ itself is conjectured to be strictly weaker than NC¹. I should use "iterated composition of permutations" as my example since it's intuitive and directly connects to Barrington's theorem. Let me verify the in-context learning citations: von Oswald et al showed linear self-attention can implement gradient descent steps, Akyürek et al demonstrated transformers can perform ridge regression, and Garg et al provided empirical validation. I also want to confirm the Pérez et al 2019 paper on Turing completeness of modern neural architectures.

Now I'm refining the complexity bounds section—soft-attention transformers with log-precision fall within TC⁰ according to Merrill & Sabharwal, while Hahn showed hard-attention transformers can't reliably recognize parity or balanced parentheses at arbitrary lengths, which means a single forward pass can't handle tasks requiring unbounded sequential composition like the S₅ word problem. I'm checking my language choices now—trimming down words like "genuinely" and "genuine" where they're redundant, replacing them with simpler alternatives like "real" or just removing the modifier entirely. I want to keep the tone sharp and direct without the filler phrases.

Given the depth this question deserves, I'm aiming for around 900–1100 words to match the PhD-seminar level of inquiry while cutting any unnecessary elaboration.

=================================

OK, so that was the complete reasoning trace. Compare that to the final response below. Which one do you prefer? Which one helped you understand things better? I'm not claiming that the final response isn't worth reading, but I think that most people would get a lot more out of reading BOTH of them:

—

Short version: transformers look obvious in hindsight because they're the conjunction of several independently-good ideas — but the "fully differentiable Turing machine" framing credits the wrong property for their success, and is only half-right even as a description.

Why they look obvious now

A handful of structural choices, each individually motivated, whose combination was the actual bet:

Input-dependent routing. The defining move is that attention computes its mixing weights from the data itself (softmax(QKᵀ/√d)·V). Every prior workhorse — MLP, CNN, RNN — applies a fixed weight matrix regardless of input. Attention is a soft content-addressable dictionary lookup, equivalently a kernel smoother over learned representations, generating "fast weights" on the fly. Data-dependent multiplicative interactions are wildly more expressive per parameter than fixed additive ones. That's the expressiveness story.

Constant path length. Any two positions are one attention hop apart in the computation graph, versus O(n) for an RNN. Long-range credit assignment stops being a vanishing-gradient battle and becomes structural — which is why LSTM-style gating tricks became unnecessary.

Parallelism / hardware fit. RNNs are serial in time; transformers process all positions at once and reduce to dense matmuls. This is the unglamorous decisive factor — Sara Hooker's "hardware lottery." An architecture that's more sample-efficient but unparallelizable loses to one that maps onto the most optimized operation in computing and therefore scales.

Minimal inductive bias. Same block stacked, with position injected separately so "what" is decoupled from "where." CNNs bake in locality and translation invariance — great in the small-data regime, a ceiling at scale (ViT overtakes them given enough data). The near-absence of priors is the Bitter Lesson in architectural form: don't fight the data, let scale find the structure.

The residual stream. Residual + LayerNorm turns the stack into iterative refinement over a shared linear "bus" that each head and MLP reads from and writes to additively. That's what makes hundred-layer depth trainable and gives the clean compositional semantics the interpretability work exploits.

Factorized mixing. Attention mixes across tokens, the MLP mixes across channels, position-wise. MLP-Mixer later showed you can swap attention for another token-mixer and still do fine — evidence that the factorization is the essential thing, and attention is just an unusually good, input-adaptive instance of it.

The honest caveat: "obvious" does a lot of hindsight work. Each ingredient was reasonable in 2017, but the bet that low-bias-plus-scale beats clever-priors, the smooth power-law scaling, and the emergence of in-context learning and reusable circuits (induction heads, Olsson et al.) were real empirical surprises. The original paper framed this as a faster machine-translation model, not a universal substrate.

The differentiable-Turing-machine question

Here I'd push back, because the framing conflates two different objects.

Applied to a single forward pass, it's basically wrong. A fixed-depth, finite-precision transformer is a bounded-depth parallel circuit. The complexity work places it shockingly low: log-precision soft-attention transformers sit inside (uniform) TC⁰ (Merrill & Sabharwal 2023), and Hahn (2020) separately showed hard-attention transformers can't even robustly recognize PARITY or balanced brackets at arbitrary length. Since TC⁰ is conjectured strictly below NC¹, one pass provably can't do things requiring unbounded sequential composition — iterating a long chain of permutations (the S₅ word problem, NC¹-complete by Barrington) or evaluating deeply nested formulas. So a forward pass is not a Turing machine; it's a shallow, very wide, learned circuit. Turing-completeness reappears only under unphysical assumptions like infinite-precision activations (Pérez et al. 2019).

Where the analogy does bite is at the level of the whole autoregressive system, and there it's illuminating. Map the anatomy:

fixed control / transition function = the frozen weights, applied identically each step

tape = the context window (read) plus the generated tokens (written back)

read head with content-addressable access = attention

unbounded sequential loop = autoregressive decoding

That is the skeleton of a Turing machine, and the universal version is tighter still: the weights are fixed like a universal machine, and the program is supplied on the tape — the prompt. In-context learning is then exactly "read your program off the tape." That's the strongest defensible version of your intuition, and it's why chain-of-thought and agentic loops unlock so much.

And the link between the two halves is now rigorous. Chain-of-thought (Merrill & Sabharwal 2024; Feng et al.) lets the model trade circuit depth — which it structurally lacks — for sequential steps, which the loop supplies. A problem needing more serial depth than the network has gets decomposed into a sequence of shallow TC⁰-sized steps. CoT isn't vaguely "more thinking"; it's literally climbing out of the constant-depth box by spending tokens.

Two corrections to "fully differentiable":

First, the differentiability is about training — the control function is learned end-to-end by gradient descent. The inference-time loop that confers the TM-like power is discrete (argmax/sampling, feedback) and not differentiable through. It's a differentiably-trained controller driving a non-differentiable computation loop.

Second, the literal "fully differentiable Turing machines" already existed and lost: the Neural Turing Machine and Differentiable Neural Computer (Graves et al., 2014/2016) had soft, differentiable read/write over external memory by explicit design. They worked and didn't scale. The irony is that the transformer won by being less of an explicit differentiable computer — dropping differentiable memory for a fixed window plus discrete autoregression — and only got reinterpreted as an approximate one afterward.

The deflationary point I'd actually stress

Don't credit the success to Turing-completeness. Universality is cheap and nearly irrelevant to why transformers work. Rule 110 is Turing complete; a Brainfuck interpreter is Turing complete; neither learns or generalizes. Computational universality says nothing about sample efficiency, optimization, or generalization.

What makes transformers special isn't that they can express arbitrary computation — it's that the programs they can express are smoothly searchable by SGD and biased toward solutions that generalize. The space of attention circuits is one where gradient descent reliably finds compositional, reusable algorithms (the von Oswald and Akyürek results, that a forward pass can implement gradient-descent or ridge-regression steps in-context, are a nice concretization). The magic is learnability and the friendliness of the loss landscape, not the expressive ceiling. If anything, the limitations — bounded depth, low-bias homogeneity — are part of why they're learnable and why they scale predictably.

So: the differentiable-TM picture is a good intuition for the system (weights as control, context as tape, attention as read head, autoregression as the loop, prompt as program) and a misleading one for the network. And even where it's right, it explains the model's reach, not its effectiveness. The effectiveness comes from the boring-sounding stuff in the first half.

I get a few messages like this per week and it’s one of the best parts of doing this kind of work.

I certainly never heard stuff like this when I was working as a hedge fund analyst, that’s for sure!

My skills site really is the very highest ROI thing you can buy for $20/mo!

This is the skill he was talking about btw:

jeffreys-skills.md/skills/interac…

Thinking Traces are Better for Understanding

Actions

What You Can Do