Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Hanchen Li
@lihanc02
An agent that beats Claude Mythos on Terminal Bench and SWE-bench Verified?

🎉We are excited to share Terminator-1, our newest agent that achieved 95+% on SWE-bench Verified and Terminal-Bench with @MogicianTony!

We show that besides model capabilities, well-designed harness could actually boost the accuracy by 3x in coding tasks.

Well if you really wanted you could get 100% accuracy without solving a single task.

The actual finding is that most AI benchmarks can be easily reward-hacked with simple exploits. Read more about the same 7 design flaws that almost every evaluation has ⬇️
Thread image
Hanchen Li
@lihanc02
It is just a hack
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export