Visualize Thread by @vikhyatk

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

vik

@vikhyatk

how we implemented Moondream inference on Apple Silicon (spoiler: we don't use MLX)

⬇️ (1/N)

View Tweet

vik

@vikhyatk

Photon, our inference engine, isn't fast just because of GPU kernels. A lot of the speedup comes from engine-level work: request scheduling, prefix caching, image processing, all tuned to keep the GPU saturated. moondream.ai/p/photon

vik

@vikhyatk

Our engine is highly coupled with PyTorch. ~15k lines of Python and Rust... scheduler, KV manager, radix tree prefix caching, LoRA, image pipeline, skill state machines.

Porting all of that to MLX would've mean maintaining two parallel runtimes forever... ouch.

vik

@vikhyatk

So we tried using the MPS backend in PyTorch. Unfortunately, PyTorch-on-Metal is painfully slow. Every op is a Metal dispatch with ~100µs of host overhead, and our decode steps fire thousands of ops across all layers. That adds up fast.

vik

@vikhyatk

Fortunately, PyTorch lets you create custom Metal ops. We wrote ~1,600 lines of MSL kernels custom-built for our model shapes. Metal is similar to CUDA: threadgroups for blocks, simdgroups for warps etc. If you know CUDA, a lot of the skills translate.

View Tweet

vik

@vikhyatk

A good example: token sampling. The path from raw model output to a sampled token was 14 separate torch ops: temperature, softmax, top-K, top-P, etc. We fused it into one Metal kernel: 687µs → 130µs per token. Significant when you're decoding hundreds per response.

vik

@vikhyatk

Running on Apple Silicon will never be as fast as an H100. But for interactive workloads like computer use, wall-clock latency is dominated by the network, not the accelerator. Skipping a large image uploads buys you more than the H100 buys back.

View Tweet

vik

@vikhyatk

Vision inference belongs on the edge. Images are large, latency is interactive, the data is often private. The right place to run a vision model is close to where the input lives. Photon is built to run there, on whatever hardware that means. moondream.ai/blog/photon-1-…

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export