Carousel Studio

Repurpose X Threads into LinkedIn & Instagram Carousels

Canvas & Ratio

Choose your destination platform format


Layout Template

Choose a content structure for your slides


Preset Themes


Typography & Sizing

Title Font Size36px
Body Font Size18px
Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)
AGENCY
SAVE PRESETS (AGENCY)

Outro Slide CTA

Customize your closing call-to-action slide

#1
#2
#3

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1
elvis
@omarsar0

Introducing... Agent Leaderboard! Many devs ask me which LLMs work best for AI agents. The new Agent Leaderboard (by @rungalileo) was built to provide insights and evaluate LLMs on real-world tool-calling tasks—crucial for building AI agents. Let's go over the results:

Apply Image
Drag Post #2
elvis
@omarsar0

1️⃣ Leader After evaluating 17 leading LLMs across 14 diverse datasets, here are the key findings: Google's 𝗚𝗲𝗺𝗶𝗻𝗶-𝟮.𝟬-𝗳𝗹𝗮𝘀𝗵 leads with a 0.94 score at a remarkably low cost.

Drag Post #3
elvis
@omarsar0

2️⃣ Pricing The top 3 models span a 10x price difference with only 4% performance gap. Many of you might be overpaying.

Drag Post #4
elvis
@omarsar0

3️⃣ Open-source Mistral AI's mistral-small-2501 leads open-source options, matching GPT-4o-mini at 0.83. Smaller models tuned for tool calling have a lot of potential.

Drag Post #5
elvis
@omarsar0

4️⃣ Reasoning models While reasoning models like o1 and o3-mini demonstrated excellent integration with function calling capabilities, DeepSeek-R1 didn't make the rankings as it doesn't support native function calling (yet).

Drag Post #6
elvis
@omarsar0

5️⃣ Edge cases Claude-sonnet achieves standout performance in tool miss detection (0.92). In general, current models still struggle with edge cases.

Drag Post #7
elvis
@omarsar0

6️⃣ Architecture trade-offs Long context vs. parallel execution shows architectural limits: o1 leads long context (0.98) but fails parallel tasks (0.43), while GPT-4o shows the opposite pattern. More results are in the links below:

Drag Post #8
elvis
@omarsar0

Leaderboard: <a target="_blank" href="https://huggingface.co/spaces/galileo-ai/agent-leaderboard" color="blue">huggingface.co/spaces/galileo…</a> Blog: <a target="_blank" href="http://galileo.ai/blog/agent-leaderboard" color="blue">galileo.ai/blog/agent-lea…</a>