✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
vik
@vikhyatk
how we implemented Moondream inference on Apple Silicon (spoiler: we don't use MLX)

⬇️ (1/N)
vik
@vikhyatk
Photon, our inference engine, isn't fast just because of GPU kernels. A lot of the speedup comes from engine-level work: request scheduling, prefix caching, image processing, all tuned to keep the GPU saturated. moondream.ai/p/photon
vik
@vikhyatk
Our engine is highly coupled with PyTorch. ~15k lines of Python and Rust... scheduler, KV manager, radix tree prefix caching, LoRA, image pipeline, skill state machines.

Porting all of that to MLX would've mean maintaining two parallel runtimes forever... ouch.
vik
@vikhyatk
So we tried using the MPS backend in PyTorch. Unfortunately, PyTorch-on-Metal is painfully slow. Every op is a Metal dispatch with ~100µs of host overhead, and our decode steps fire thousands of ops across all layers. That adds up fast.
vik
@vikhyatk
Fortunately, PyTorch lets you create custom Metal ops. We wrote ~1,600 lines of MSL kernels custom-built for our model shapes. Metal is similar to CUDA: threadgroups for blocks, simdgroups for warps etc. If you know CUDA, a lot of the skills translate.
vik
@vikhyatk
A good example: token sampling. The path from raw model output to a sampled token was 14 separate torch ops: temperature, softmax, top-K, top-P, etc. We fused it into one Metal kernel: 687µs → 130µs per token. Significant when you're decoding hundreds per response.
vik
@vikhyatk
Running on Apple Silicon will never be as fast as an H100. But for interactive workloads like computer use, wall-clock latency is dominated by the network, not the accelerator. Skipping a large image uploads buys you more than the H100 buys back.
vik
@vikhyatk
Vision inference belongs on the edge. Images are large, latency is interactive, the data is often private. The right place to run a vision model is close to where the input lives. Photon is built to run there, on whatever hardware that means. moondream.ai/blog/photon-1-…
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export