Thread Truncated (Cap Enforced)
Only the first 20 tweets are unrolled into slides to ensure reliable PDF exporting and high server performance.
Canvas & Ratio
Choose your destination platform format
Layout Template
Choose a content structure for your slides
Preset Themes
Typography & Sizing
Brand Kit Customization
AGENCYConfigure brand assets for headers & footers
Outro Slide CTA
Customize your closing call-to-action slide
Background Pattern
Build Your Carousel
Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

How do you build an agent that actually performs in a domain — one customers pick because it's better?


The basics have been standardized over the past year: an agent is a while-loop around a model that calls tools until the task is done. Give it a filesystem, give it a shell, and let it do most things through that. You can write it in an afternoon, and most people have. Everyone can build an agent — it really isn't that hard, and, as I'll spell out, it isn't that deep either. What separates a good one from a toy isn't cleverness; it's a real understanding of your domain and the patience to do some tedious, careful work in the few places that matter.

I've spent almost a year now building the Shortcut agent, which is widely considered the most accurate spreadsheet agent around — it's deployed inside three of the largest four multistrategy hedge funds, where being wrong is expensive and nobody grades on a curve. We don't have Microsoft's or Anthropic's distribution. What we have is that the agent is right more often, and in this domain that has been the single most compelling reason customers pick us. So agent performance is the question I think about all day.

And here's the gap I keep running into: plenty is written about building agents, but few about building good ones. Look at how much the field varies on something as basic as tool count — Codex and Claude Code ship ~30 tools each; Pi ships 7. When popular agents disagree 4x on the most basic design question, it's a tell: there's no agreed-on principle. So I'm sharing mine, from a year of building one, to demystify the process for anyone writing their own.

Here it is: a good agent is a faithful compression of its task distribution. The rest of this is just what that means, and what it forces you to build.

## <b>Context as a layered cache</b>

Assume you don't own the environment and you didn't train the model. Then three things are yours to design — the system prompt, the tools, and the artifacts (skills, curated docs, references) — and they're all the same thing: the agent's context.

So the game is simple to state. With the model fixed, accuracy is a function of context quality: bloated context buries the signal, missing context forces guessing, and both cost you accuracy. And accuracy is what you're selling — the relationship isn't linear, a task that scores 99% is worth 10x more than one that scores 95%.

But your users don't bring you a uniform distribution of problems to solve. They bring you a long tail:

<pre><code lang="markdown"> how often | | ████ | ████ | ████ | ████ | ████ | ████ | ████ | ████ | ████ | ████ ▓▓▓▓ | ████ ▓▓▓▓ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░ +----------------------------------------------------> task variety ████ bread-and-butter the bulk of every session ▓▓▓▓ crucial-but-occasional a handful of times a session ░░░░ the long tail each one rare — but there are many, and each still has to work</code></pre>

The agent has to handle all of it. But it cannot hold the union of everything in context at once — that's the bloated-prompt failure mode. So the real objective is sharper than "have everything available": minimize the context spent per task, averaged over the task distribution.

This is exactly the problem a CPU faces. A program might touch gigabytes of data, but the storage right next to the processor is tiny — so computers stack memory in tiers: a small, instant cache (L1), bigger-and-slower ones below it (L2, L3), then main memory and disk. It works because access is long-tailed too: keep the hot set in the fast tier, reach down to the slow tiers only for the rare stuff. A "cache miss" is when what you need isn't in the fast tier and you pay to fetch it from a slower one — exactly the cost you're avoiding on the common path.

Agents should have the same structure. Build your context as L1 / L2 / L3.

<pre><code lang="markdown"> +---------------------------------------------+ L1 | ALWAYS RESIDENT - tiny, instant. | | The 80%. Lives in the system prompt. | +---------------------------------------------+ | miss -> one cheap call v +---------------------------------------------+ L2 | ON DEMAND - curated English specs. | | The next ~15%. One discovery step to load. | +---------------------------------------------+ | miss -> read the skill, then search v +---------------------------------------------+ L3 | ESCAPE HATCH - the raw API tome. | | The long tail. 3-6 grep calls to mine. | +---------------------------------------------+</code></pre>

Almost every optimization trades compression of information against speed of discovery. Put something in L1 and it's instant, but it costs prompt tokens on every single task whether it's used or not. Push it to L3 and it costs nothing until needed — but then it costs several tool calls to find. Your job is to place each capability at the tier that minimizes total cost across the distribution. That's the whole craft. Let me make it concrete with the domain I know best.

## <b>Aside: one tool, not thirty</b>

Before the hierarchy, the substrate. Every spreadsheet capability I'm about to describe — every read, every write, every curated lookup — is code executed under a single tool.

<pre><code lang="typescript">async function execute() { const data = await sheet.getCellRange("Sheet1!A1:D200"); // ...read, compute, write... }</code></pre>

The agent writes code; the code calls our functions; the functions touch the sheet. There is no read_range tool, no write_range tool, no make_chart tool. There is one tool, and the API lives inside the code.

Why? Because model accuracy degrades as you add tools. That's been consistent in our own experiments. Every tool you add is more schema in the prompt, more surface to confuse, more ways to pick the wrong one, especially if the tools occupy overlapping responsibilities. A single execute_code tool collapses all of that into one decision — write code — and lets the model compose capabilities with the full expressive power of a programming language or DSL instead of stitching together rigid tool calls (more on this in a future post).