Visualize Thread by @omarsar0

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

elvis

@omarsar0

Introducing... Agent Leaderboard!

Many devs ask me which LLMs work best for AI agents.

The new Agent Leaderboard (by @rungalileo) was built to provide insights and evaluate LLMs on real-world tool-calling tasks—crucial for building AI agents.

Let's go over the results:

elvis

@omarsar0

1️⃣ Leader

After evaluating 17 leading LLMs across 14 diverse datasets, here are the key findings:

Google's 𝗚𝗲𝗺𝗶𝗻𝗶-𝟮.𝟬-𝗳𝗹𝗮𝘀𝗵 leads with a 0.94 score at a remarkably low cost.

elvis

@omarsar0

2️⃣ Pricing

The top 3 models span a 10x price difference with only 4% performance gap. Many of you might be overpaying.

elvis

@omarsar0

3️⃣ Open-source

Mistral AI's mistral-small-2501 leads open-source options, matching GPT-4o-mini at 0.83. Smaller models tuned for tool calling have a lot of potential.

elvis

@omarsar0

4️⃣ Reasoning models

While reasoning models like o1 and o3-mini demonstrated excellent integration with function calling capabilities, DeepSeek-R1 didn't make the rankings as it doesn't support native function calling (yet).

elvis

@omarsar0

5️⃣ Edge cases

Claude-sonnet achieves standout performance in tool miss detection (0.92). In general, current models still struggle with edge cases.

elvis

@omarsar0

6️⃣ Architecture trade-offs

Long context vs. parallel execution shows architectural limits: o1 leads long context (0.98) but fails parallel tasks (0.43), while GPT-4o shows the opposite pattern.

More results are in the links below:

elvis

@omarsar0

Leaderboard: huggingface.co/spaces/galileo…

Blog: galileo.ai/blog/agent-lea…

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export