✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Banana
@banana_baeee
Finally broke the 3k token per second input/prompt processing barrier for Qwen 3.5 27B on Spark/GB10 thanks to FlashQLA!

Results and steps to reproduce up on @LottoLabs LocalMaxxing here: localmaxxing.com/runs/cmouqgx9q…

3130t/s pp2048 is close to 4x faster than the fastest M5 Max number I could find on Reddit.

For long running agents, input token processing can be at least as important as output token processing and Spark shines for that!
Banana
@banana_baeee
My DFlash decode optimized numbers are here for 3.6 - quite variable, but can make a big difference. I am hoping to combine the decode and prefill optimizations into one fast 27B dense solution and get the best of both!

localmaxxing.com/runs/cmomgvsoo…
Banana
@banana_baeee
My reproduction repositories are here if you want to try this yourself!

(Though I hope that ultimately a lot of these sorts of optimizations become vLLM defaults in the future)

github.com/my-other-githu…
Banana
@banana_baeee
Still lots of room for optimization here, I’m still using generic cutlass NVFP4 kernels instead of something GB10 optimized - and I crudely hacked FlashQLA in so I’m positive there’s headroom there when someone smart gets better, official vLLM support for GB10 in there.

GB10 has a lot of potential if the software can catch up!
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export