@ggerganov: Let me demonstrate the true po...
@ggerganov
8 views
Apr 05, 2026
Advertisement
1
Let me demonstrate the true power of llama.cpp:
- Running on Mac Studio M2 Ultra (3 years old)
- Gemma 4 26B A4B Q8_0 (full quality)
- Built-in WebUI (ships with llama.cpp)
- MCP support out of the box (web-search, HF, github, etc.)
- Prompt speculative decoding
The result: 300t/s
(realtime video)
- Running on Mac Studio M2 Ultra (3 years old)
- Gemma 4 26B A4B Q8_0 (full quality)
- Built-in WebUI (ships with llama.cpp)
- MCP support out of the box (web-search, HF, github, etc.)
- Prompt speculative decoding
The result: 300t/s
(realtime video)
2
Of course, this is a trivial example of prompt-based speculative decoding, because the model recites sections from what is already in the prompt (so don't get too excited 😉).
Still it's a nice and quick showcase of some of llama.cpp capabilities
Still it's a nice and quick showcase of some of llama.cpp capabilities
3
llama.cpp with it's integrated WebUI is effectively the most lightweight and self-contained agent that you can run locally.
Here are a few more examples of using @huggingface MCP to search for models
Here are a few more examples of using @huggingface MCP to search for models