| Thread Navigator

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

elvis

@omarsar0

Introducing... Agent Leaderboard! Many devs ask me which LLMs work best for AI agents. The new Agent Leaderboard (by @rungalileo) was built to provide insights and evaluate LLMs on real-world tool-calling tasks—crucial for building AI agents. Let's go over the results:

Apply Image

Drag Post #2

elvis

@omarsar0

1️⃣ Leader After evaluating 17 leading LLMs across 14 diverse datasets, here are the key findings: Google's 𝗚𝗲𝗺𝗶𝗻𝗶-𝟮.𝟬-𝗳𝗹𝗮𝘀𝗵 leads with a 0.94 score at a remarkably low cost.

Drag Post #3

elvis

@omarsar0

2️⃣ Pricing The top 3 models span a 10x price difference with only 4% performance gap. Many of you might be overpaying.

Drag Post #4

elvis

@omarsar0

3️⃣ Open-source Mistral AI's mistral-small-2501 leads open-source options, matching GPT-4o-mini at 0.83. Smaller models tuned for tool calling have a lot of potential.

Drag Post #5

elvis

@omarsar0

4️⃣ Reasoning models While reasoning models like o1 and o3-mini demonstrated excellent integration with function calling capabilities, DeepSeek-R1 didn't make the rankings as it doesn't support native function calling (yet).

Drag Post #6

elvis

@omarsar0

5️⃣ Edge cases Claude-sonnet achieves standout performance in tool miss detection (0.92). In general, current models still struggle with edge cases.

Drag Post #7

elvis

@omarsar0

6️⃣ Architecture trade-offs Long context vs. parallel execution shows architectural limits: o1 leads long context (0.98) but fails parallel tasks (0.43), while GPT-4o shows the opposite pattern. More results are in the links below:

Drag Post #8

elvis

@omarsar0

Leaderboard: <a target="_blank" href="https://huggingface.co/spaces/galileo-ai/agent-leaderboard" color="blue">huggingface.co/spaces/galileo…</a> Blog: <a target="_blank" href="http://galileo.ai/blog/agent-leaderboard" color="blue">galileo.ai/blog/agent-lea…</a>