Visualize Thread by @banana_baeee

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Banana

@banana_baeee

Finally broke the 3k token per second input/prompt processing barrier for Qwen 3.5 27B on Spark/GB10 thanks to FlashQLA!

Results and steps to reproduce up on @LottoLabs LocalMaxxing here: localmaxxing.com/runs/cmouqgx9q…

3130t/s pp2048 is close to 4x faster than the fastest M5 Max number I could find on Reddit.

For long running agents, input token processing can be at least as important as output token processing and Spark shines for that!

Banana

@banana_baeee

My DFlash decode optimized numbers are here for 3.6 - quite variable, but can make a big difference. I am hoping to combine the decode and prefill optimizations into one fast 27B dense solution and get the best of both!

localmaxxing.com/runs/cmomgvsoo…

Banana

@banana_baeee

My reproduction repositories are here if you want to try this yourself!

(Though I hope that ultimately a lot of these sorts of optimizations become vLLM defaults in the future)

github.com/my-other-githu…

Banana

@banana_baeee

Still lots of room for optimization here, I’m still using generic cutlass NVFP4 kernels instead of something GB10 optimized - and I crudely hacked FlashQLA in so I’m positive there’s headroom there when someone smart gets better, official vLLM support for GB10 in there.

GB10 has a lot of potential if the software can catch up!

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export