✨ Visual Editor

close

Thread Truncated

Only the first 20 tweets are shown to ensure high-quality rendering and prevent image size issues.

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Peter Wang
@BrainsAndTennis
How do you build an agent that actually performs in a domain — one customers pick because it's better?
Thread image
Peter Wang
@BrainsAndTennis
The basics have been standardized over the past year: an agent is a while-loop around a model that calls tools until the task is done. Give it a filesystem, give it a shell, and let it do most things through that. You can write it in an afternoon, and most people have. Everyone can build an agent — it really isn't that hard, and, as I'll spell out, it isn't that deep either. What separates a good one from a toy isn't cleverness; it's a real understanding of your domain and the patience to do some tedious, careful work in the few places that matter.
Peter Wang
@BrainsAndTennis
I've spent almost a year now building the Shortcut agent, which is widely considered the most accurate spreadsheet agent around — it's deployed inside three of the largest four multistrategy hedge funds, where being wrong is expensive and nobody grades on a curve. We don't have Microsoft's or Anthropic's distribution. What we have is that the agent is right more often, and in this domain that has been the single most compelling reason customers pick us. So agent performance is the question I think about all day.
Peter Wang
@BrainsAndTennis
And here's the gap I keep running into: plenty is written about building agents, but few about building good ones. Look at how much the field varies on something as basic as tool count — Codex and Claude Code ship ~30 tools each; Pi ships 7. When popular agents disagree 4x on the most basic design question, it's a tell: there's no agreed-on principle. So I'm sharing mine, from a year of building one, to demystify the process for anyone writing their own.
Peter Wang
@BrainsAndTennis
Here it is: a good agent is a faithful compression of its task distribution. The rest of this is just what that means, and what it forces you to build.
Peter Wang
@BrainsAndTennis
## Context as a layered cache
Peter Wang
@BrainsAndTennis
Assume you don't own the environment and you didn't train the model. Then three things are yours to design — the system prompt, the tools, and the artifacts (skills, curated docs, references) — and they're all the same thing: the agent's context.
Peter Wang
@BrainsAndTennis
So the game is simple to state. With the model fixed, accuracy is a function of context quality: bloated context buries the signal, missing context forces guessing, and both cost you accuracy. And accuracy is what you're selling — the relationship isn't linear, a task that scores 99% is worth 10x more than one that scores 95%.
Peter Wang
@BrainsAndTennis
But your users don't bring you a uniform distribution of problems to solve. They bring you a long tail:
Peter Wang
@BrainsAndTennis
  how often
|
| ████
| ████
| ████
| ████
| ████
| ████
| ████
| ████
| ████
| ████ ▓▓▓▓
| ████ ▓▓▓▓ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░
+----------------------------------------------------> task variety

████ bread-and-butter the bulk of every session
▓▓▓▓ crucial-but-occasional a handful of times a session
░░░░ the long tail each one rare — but there are many,
and each still has to work
Peter Wang
@BrainsAndTennis
The agent has to handle all of it. But it cannot hold the union of everything in context at once — that's the bloated-prompt failure mode. So the real objective is sharper than "have everything available": minimize the context spent per task, averaged over the task distribution.
Peter Wang
@BrainsAndTennis
This is exactly the problem a CPU faces. A program might touch gigabytes of data, but the storage right next to the processor is tiny — so computers stack memory in tiers: a small, instant cache (L1), bigger-and-slower ones below it (L2, L3), then main memory and disk. It works because access is long-tailed too: keep the hot set in the fast tier, reach down to the slow tiers only for the rare stuff. A "cache miss" is when what you need isn't in the fast tier and you pay to fetch it from a slower one — exactly the cost you're avoiding on the common path.
Peter Wang
@BrainsAndTennis
Agents should have the same structure. Build your context as L1 / L2 / L3.
Peter Wang
@BrainsAndTennis
       +---------------------------------------------+
L1 | ALWAYS RESIDENT - tiny, instant. |
| The 80%. Lives in the system prompt. |
+---------------------------------------------+
| miss -> one cheap call
v
+---------------------------------------------+
L2 | ON DEMAND - curated English specs. |
| The next ~15%. One discovery step to load. |
+---------------------------------------------+
| miss -> read the skill, then search
v
+---------------------------------------------+
L3 | ESCAPE HATCH - the raw API tome. |
| The long tail. 3-6 grep calls to mine. |
+---------------------------------------------+
Peter Wang
@BrainsAndTennis
Almost every optimization trades compression of information against speed of discovery. Put something in L1 and it's instant, but it costs prompt tokens on every single task whether it's used or not. Push it to L3 and it costs nothing until needed — but then it costs several tool calls to find. Your job is to place each capability at the tier that minimizes total cost across the distribution. That's the whole craft. Let me make it concrete with the domain I know best.
Peter Wang
@BrainsAndTennis
## Aside: one tool, not thirty
Peter Wang
@BrainsAndTennis
Before the hierarchy, the substrate. Every spreadsheet capability I'm about to describe — every read, every write, every curated lookup — is code executed under a single tool.
Peter Wang
@BrainsAndTennis
async function execute() {
const data = await sheet.getCellRange("Sheet1!A1:D200");
// ...read, compute, write...
}
Peter Wang
@BrainsAndTennis
The agent writes code; the code calls our functions; the functions touch the sheet. There is no read_range tool, no write_range tool, no make_chart tool. There is one tool, and the API lives inside the code.
Peter Wang
@BrainsAndTennis
Why? Because model accuracy degrades as you add tools. That's been consistent in our own experiments. Every tool you add is more schema in the prompt, more surface to confuse, more ways to pick the wrong one, especially if the tools occupy overlapping responsibilities. A single execute_code tool collapses all of that into one decision — write code — and lets the model compose capabilities with the full expressive power of a programming language or DSL instead of stitching together rigid tool calls (more on this in a future post).
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export