@lihanc02: An agent that beats Claude Myt...
@lihanc02
11 views
Apr 10, 2026
Advertisement
1
An agent that beats Claude Mythos on Terminal Bench and SWE-bench Verified?
🎉We are excited to share Terminator-1, our newest agent that achieved 95+% on SWE-bench Verified and Terminal-Bench with @MogicianTony!
We show that besides model capabilities, well-designed harness could actually boost the accuracy by 3x in coding tasks.
Well if you really wanted you could get 100% accuracy without solving a single task.
The actual finding is that most AI benchmarks can be easily reward-hacked with simple exploits. Read more about the same 7 design flaws that almost every evaluation has ⬇️
🎉We are excited to share Terminator-1, our newest agent that achieved 95+% on SWE-bench Verified and Terminal-Bench with @MogicianTony!
We show that besides model capabilities, well-designed harness could actually boost the accuracy by 3x in coding tasks.
Well if you really wanted you could get 100% accuracy without solving a single task.
The actual finding is that most AI benchmarks can be easily reward-hacked with simple exploits. Read more about the same 7 design flaws that almost every evaluation has ⬇️
View Tweet
2
It is just a hack
