Canvas & Ratio
Choose your destination platform format
Layout Template
Choose a content structure for your slides
Preset Themes
Typography & Sizing
Brand Kit Customization
AGENCYConfigure brand assets for headers & footers
Outro Slide CTA
Customize your closing call-to-action slide
Background Pattern
Build Your Carousel
Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

how we implemented Moondream inference on Apple Silicon (spoiler: we don't use MLX) ⬇️ (1/N) <a target="_blank" href="https://twitter.com/mayfer/status/2050323883950313980" color="blue">x.com/mayfer/status/…</a>

Photon, our inference engine, isn't fast just because of GPU kernels. A lot of the speedup comes from engine-level work: request scheduling, prefix caching, image processing, all tuned to keep the GPU saturated. <a target="_blank" href="https://moondream.ai/p/photon" color="blue">moondream.ai/p/photon</a>

Our engine is highly coupled with PyTorch. ~15k lines of Python and Rust... scheduler, KV manager, radix tree prefix caching, LoRA, image pipeline, skill state machines. Porting all of that to MLX would've mean maintaining two parallel runtimes forever... ouch.

So we tried using the MPS backend in PyTorch. Unfortunately, PyTorch-on-Metal is painfully slow. Every op is a Metal dispatch with ~100µs of host overhead, and our decode steps fire thousands of ops across all layers. That adds up fast.

Fortunately, PyTorch lets you create custom Metal ops. We wrote ~1,600 lines of MSL kernels custom-built for our model shapes. Metal is similar to CUDA: threadgroups for blocks, simdgroups for warps etc. If you know CUDA, a lot of the skills translate. <a target="_blank" href="https://x.com/vikhyatk/status/2047942358214607284?s=20" color="blue">x.com/vikhyatk/statu…</a>

A good example: token sampling. The path from raw model output to a sampled token was 14 separate torch ops: temperature, softmax, top-K, top-P, etc. We fused it into one Metal kernel: 687µs → 130µs per token. Significant when you're decoding hundreds per response.

Running on Apple Silicon will never be as fast as an H100. But for interactive workloads like computer use, wall-clock latency is dominated by the network, not the accelerator. Skipping a large image uploads buys you more than the H100 buys back. <a target="_blank" href="https://x.com/mayfer/status/2050328884374388953" color="blue">x.com/mayfer/status/…</a>

Vision inference belongs on the edge. Images are large, latency is interactive, the data is often private. The right place to run a vision model is close to where the input lives. Photon is built to run there, on whatever hardware that means. <a target="_blank" href="https://moondream.ai/blog/photon-1-2-0-update" color="blue">moondream.ai/blog/photon-1-…</a>