✨ Visual Editor

close

Thread Truncated

Only the first 20 tweets are shown to ensure high-quality rendering and prevent image size issues.

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
gemchanger
@gemchange_ltd
> I had a swarm running. 80 agents on the same task, the kind where you can check the answer at the end. About a third of them were quietly garbage.
Thread image
gemchanger
@gemchange_ltd
I did what everyone does. Averaged all 80. Throw a pile of agents at it, average, the mess washes out. Error came back at 0.99. Useless.
gemchanger
@gemchange_ltd
So I tried something else. I let the agents grade each other against a small set of questions where I already knew the answer, and fire the worst. Cut the bad ones, average who's left.
gemchanger
@gemchange_ltd
0.135.
gemchanger
@gemchange_ltd
86% of the error, gone. Same agents. I didn't add anything. I removed.
gemchanger
@gemchange_ltd
## Why more agents was never the answer
gemchanger
@gemchange_ltd
If your agents are wrong in random, independent ways, adding more cancels the wrongness out. That's the whole pitch, and it's true.
gemchanger
@gemchange_ltd
But they all came off the same model. So they miss together. Same hallucinated convention, same misread of the spec, all leaning the same way. Averaging a stack of numbers that lean the same way doesn't move the lean.
gemchanger
@gemchange_ltd
Agent 300, agent 400, doesn't matter. The agent count on the slide is the most worthless number in the system, and nobody wants to hear it.
gemchanger
@gemchange_ltd
## So you cut instead
gemchanger
@gemchange_ltd
Stop trying to drown the bad agents. Remove them.
gemchanger
@gemchange_ltd
You need a verify gate. A few questions where you know the truth. Tests, anchors, whatever you have. Score every agent, cut the worst, average the survivors. 0.99 to 0.135.
gemchanger
@gemchange_ltd
A plain median on the same dirty swarm gives 0.56. A 20% trimmed mean, 0.82. The firing, 0.135.
gemchanger
@gemchange_ltd
Thread image
gemchanger
@gemchange_ltd
Median and trim are blind. They cut a fixed amount and hope. Firing isn't blind. Same idea as trimming, except it knows where the bodies are buried.
gemchanger
@gemchange_ltd
## But you can't just crank it
gemchanger
@gemchange_ltd
Firing is not a slider you push to 100.
gemchanger
@gemchange_ltd
I pushed it. Error dropped, bottomed out, then climbed straight back up. 128% above the bottom by the time I'd gutted nearly everyone. Cut too deep and four agents are holding the whole answer, and four agents is loud and shaky.
gemchanger
@gemchange_ltd
The bottom sits further out than your gut says. 30% of my agents were bad. The best cut was 70%.
gemchanger
@gemchange_ltd
Thread image
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export