| Thread Navigator

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

vik

@vikhyatk

how we implemented Moondream inference on Apple Silicon (spoiler: we don't use MLX) ⬇️ (1/N) <a target="_blank" href="https://twitter.com/mayfer/status/2050323883950313980" color="blue">x.com/mayfer/status/…</a>

Drag Post #2

vik

@vikhyatk

Photon, our inference engine, isn't fast just because of GPU kernels. A lot of the speedup comes from engine-level work: request scheduling, prefix caching, image processing, all tuned to keep the GPU saturated. <a target="_blank" href="https://moondream.ai/p/photon" color="blue">moondream.ai/p/photon</a>

Drag Post #3

vik

@vikhyatk

Our engine is highly coupled with PyTorch. ~15k lines of Python and Rust... scheduler, KV manager, radix tree prefix caching, LoRA, image pipeline, skill state machines. Porting all of that to MLX would've mean maintaining two parallel runtimes forever... ouch.

Drag Post #4

vik

@vikhyatk

So we tried using the MPS backend in PyTorch. Unfortunately, PyTorch-on-Metal is painfully slow. Every op is a Metal dispatch with ~100µs of host overhead, and our decode steps fire thousands of ops across all layers. That adds up fast.

Drag Post #5

vik

@vikhyatk

Fortunately, PyTorch lets you create custom Metal ops. We wrote ~1,600 lines of MSL kernels custom-built for our model shapes. Metal is similar to CUDA: threadgroups for blocks, simdgroups for warps etc. If you know CUDA, a lot of the skills translate. <a target="_blank" href="https://x.com/vikhyatk/status/2047942358214607284?s=20" color="blue">x.com/vikhyatk/statu…</a>

Drag Post #6

vik

@vikhyatk

A good example: token sampling. The path from raw model output to a sampled token was 14 separate torch ops: temperature, softmax, top-K, top-P, etc. We fused it into one Metal kernel: 687µs → 130µs per token. Significant when you're decoding hundreds per response.

Drag Post #7

vik

@vikhyatk

Running on Apple Silicon will never be as fast as an H100. But for interactive workloads like computer use, wall-clock latency is dominated by the network, not the accelerator. Skipping a large image uploads buys you more than the H100 buys back. <a target="_blank" href="https://x.com/mayfer/status/2050328884374388953" color="blue">x.com/mayfer/status/…</a>

Drag Post #8

vik

@vikhyatk

Vision inference belongs on the edge. Images are large, latency is interactive, the data is often private. The right place to run a vision model is close to where the input lives. Photon is built to run there, on whatever hardware that means. <a target="_blank" href="https://moondream.ai/blog/photon-1-2-0-update" color="blue">moondream.ai/blog/photon-1-…</a>