Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

@lihanc02: An agent that beats Claude Myt...

@lihanc02
11 views Apr 10, 2026
Advertisement
1
An agent that beats Claude Mythos on Terminal Bench and SWE-bench Verified?

🎉We are excited to share Terminator-1, our newest agent that achieved 95+% on SWE-bench Verified and Terminal-Bench with @MogicianTony!

We show that besides model capabilities, well-designed harness could actually boost the accuracy by 3x in coding tasks.

Well if you really wanted you could get 100% accuracy without solving a single task.

The actual finding is that most AI benchmarks can be easily reward-hacked with simple exploits. Read more about the same 7 design flaws that almost every evaluation has ⬇️
Media image
2
It is just a hack
Actions
Visual Editor Carousel Maker NEW
Update Thread
What You Can Do
  • Download as PDF
  • Save to Notion
  • Export as Markdown
  • Visual Editor
  • LinkedIn & Instagram Carousel Maker
Create Free Account

Includes 7-day Premium trial

Advertisement