| Thread Navigator

Thread Truncated (Cap Enforced)

Only the first 20 tweets are unrolled into slides to ensure reliable PDF exporting and high server performance.

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

Peter Wang

@BrainsAndTennis

How do you build an agent that actually performs in a domain — one customers pick because it's better?

Apply Image

Drag Post #2

Peter Wang

@BrainsAndTennis

The basics have been standardized over the past year: an agent is a while-loop around a model that calls tools until the task is done. Give it a filesystem, give it a shell, and let it do most things through that. You can write it in an afternoon, and most people have. Everyone can build an agent — it really isn't that hard, and, as I'll spell out, it isn't that deep either. What separates a good one from a toy isn't cleverness; it's a real understanding of your domain and the patience to do some tedious, careful work in the few places that matter.

Drag Post #3

Peter Wang

@BrainsAndTennis

I've spent almost a year now building the Shortcut agent, which is widely considered the most accurate spreadsheet agent around — it's deployed inside three of the largest four multistrategy hedge funds, where being wrong is expensive and nobody grades on a curve. We don't have Microsoft's or Anthropic's distribution. What we have is that the agent is right more often, and in this domain that has been the single most compelling reason customers pick us. So agent performance is the question I think about all day.

Drag Post #4

Peter Wang

@BrainsAndTennis

And here's the gap I keep running into: plenty is written about building agents, but few about building good ones. Look at how much the field varies on something as basic as tool count — Codex and Claude Code ship ~30 tools each; Pi ships 7. When popular agents disagree 4x on the most basic design question, it's a tell: there's no agreed-on principle. So I'm sharing mine, from a year of building one, to demystify the process for anyone writing their own.

Drag Post #5

Peter Wang

@BrainsAndTennis

Here it is: a good agent is a faithful compression of its task distribution. The rest of this is just what that means, and what it forces you to build.

Drag Post #6

Peter Wang

@BrainsAndTennis

## <b>Context as a layered cache</b>

Drag Post #7

Peter Wang

@BrainsAndTennis

Assume you don't own the environment and you didn't train the model. Then three things are yours to design — the system prompt, the tools, and the artifacts (skills, curated docs, references) — and they're all the same thing: the agent's context.

Drag Post #8

Peter Wang

@BrainsAndTennis

So the game is simple to state. With the model fixed, accuracy is a function of context quality: bloated context buries the signal, missing context forces guessing, and both cost you accuracy. And accuracy is what you're selling — the relationship isn't linear, a task that scores 99% is worth 10x more than one that scores 95%.

Drag Post #9

Peter Wang

@BrainsAndTennis

But your users don't bring you a uniform distribution of problems to solve. They bring you a long tail:

Drag Post #10

Peter Wang

@BrainsAndTennis

<pre><code lang="markdown"> how often | | ████ | ████ | ████ | ████ | ████ | ████ | ████ | ████ | ████ | ████ ▓▓▓▓ | ████ ▓▓▓▓ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░ ░░░░ +----------------------------------------------------> task variety ████ bread-and-butter the bulk of every session ▓▓▓▓ crucial-but-occasional a handful of times a session ░░░░ the long tail each one rare — but there are many, and each still has to work</code></pre>

Drag Post #11

Peter Wang

@BrainsAndTennis

The agent has to handle all of it. But it cannot hold the union of everything in context at once — that's the bloated-prompt failure mode. So the real objective is sharper than "have everything available": minimize the context spent per task, averaged over the task distribution.

Drag Post #12

Peter Wang

@BrainsAndTennis

This is exactly the problem a CPU faces. A program might touch gigabytes of data, but the storage right next to the processor is tiny — so computers stack memory in tiers: a small, instant cache (L1), bigger-and-slower ones below it (L2, L3), then main memory and disk. It works because access is long-tailed too: keep the hot set in the fast tier, reach down to the slow tiers only for the rare stuff. A "cache miss" is when what you need isn't in the fast tier and you pay to fetch it from a slower one — exactly the cost you're avoiding on the common path.

Drag Post #13

Peter Wang

@BrainsAndTennis

Agents should have the same structure. Build your context as L1 / L2 / L3.

Drag Post #14

Peter Wang

@BrainsAndTennis

<pre><code lang="markdown"> +---------------------------------------------+ L1 | ALWAYS RESIDENT - tiny, instant. | | The 80%. Lives in the system prompt. | +---------------------------------------------+ | miss -> one cheap call v +---------------------------------------------+ L2 | ON DEMAND - curated English specs. | | The next ~15%. One discovery step to load. | +---------------------------------------------+ | miss -> read the skill, then search v +---------------------------------------------+ L3 | ESCAPE HATCH - the raw API tome. | | The long tail. 3-6 grep calls to mine. | +---------------------------------------------+</code></pre>

Drag Post #15

Peter Wang

@BrainsAndTennis

Almost every optimization trades compression of information against speed of discovery. Put something in L1 and it's instant, but it costs prompt tokens on every single task whether it's used or not. Push it to L3 and it costs nothing until needed — but then it costs several tool calls to find. Your job is to place each capability at the tier that minimizes total cost across the distribution. That's the whole craft. Let me make it concrete with the domain I know best.

Drag Post #16

Peter Wang

@BrainsAndTennis

## <b>Aside: one tool, not thirty</b>

Drag Post #17

Peter Wang

@BrainsAndTennis

Before the hierarchy, the substrate. Every spreadsheet capability I'm about to describe — every read, every write, every curated lookup — is code executed under a single tool.

Drag Post #18

Peter Wang

@BrainsAndTennis

<pre><code lang="typescript">async function execute() { const data = await sheet.getCellRange("Sheet1!A1:D200"); // ...read, compute, write... }</code></pre>

Drag Post #19

Peter Wang

@BrainsAndTennis

The agent writes code; the code calls our functions; the functions touch the sheet. There is no read_range tool, no write_range tool, no make_chart tool. There is one tool, and the API lives inside the code.

Drag Post #20

Peter Wang

@BrainsAndTennis

Why? Because model accuracy degrades as you add tools. That's been consistent in our own experiments. Every tool you add is more schema in the prompt, more surface to confuse, more ways to pick the wrong one, especially if the tools occupy overlapping responsibilities. A single execute_code tool collapses all of that into one decision — write code — and lets the model compose capabilities with the full expressive power of a programming language or DSL instead of stitching together rigid tool calls (more on this in a future post).