> I had a swarm running. 80 agents on the same task, the kind where you can check the answer at the end. About a third of them were quietly garbage.

I did what everyone does. Averaged all 80. Throw a pile of agents at it, average, the mess washes out. Error came back at 0.99. Useless.
So I tried something else. I let the agents grade each other against a small set of questions where I already knew the answer, and fire the worst. Cut the bad ones, average who's left.
0.135.
86% of the error, gone. Same agents. I didn't add anything. I removed.
## Why more agents was never the answer
If your agents are wrong in random, independent ways, adding more cancels the wrongness out. That's the whole pitch, and it's true.
But they all came off the same model. So they miss together. Same hallucinated convention, same misread of the spec, all leaning the same way. Averaging a stack of numbers that lean the same way doesn't move the lean.
Agent 300, agent 400, doesn't matter. The agent count on the slide is the most worthless number in the system, and nobody wants to hear it.
## So you cut instead
Stop trying to drown the bad agents. Remove them.
You need a verify gate. A few questions where you know the truth. Tests, anchors, whatever you have. Score every agent, cut the worst, average the survivors. 0.99 to 0.135.
A plain median on the same dirty swarm gives 0.56. A 20% trimmed mean, 0.82. The firing, 0.135.

Median and trim are blind. They cut a fixed amount and hope. Firing isn't blind. Same idea as trimming, except it knows where the bodies are buried.
## But you can't just crank it
Firing is not a slider you push to 100.
I pushed it. Error dropped, bottomed out, then climbed straight back up. 128% above the bottom by the time I'd gutted nearly everyone. Cut too deep and four agents are holding the whole answer, and four agents is loud and shaky.
The bottom sits further out than your gut says. 30% of my agents were bad. The best cut was 70%.

Generated by Thread Navigator
Press ⌘ + S to quick-export
