@banana_baeee: Finally broke the 3k token per...

10 views May 07, 2026

Finally broke the 3k token per second input/prompt processing barrier for Qwen 3.5 27B on Spark/GB10 thanks to FlashQLA!

Results and steps to reproduce up on @LottoLabs LocalMaxxing here: localmaxxing.com/runs/cmouqgx9q…

3130t/s pp2048 is close to 4x faster than the fastest M5 Max number I could find on Reddit.

For long running agents, input token processing can be at least as important as output token processing and Spark shines for that!

My DFlash decode optimized numbers are here for 3.6 - quite variable, but can make a big difference. I am hoping to combine the decode and prefill optimizations into one fast 27B dense solution and get the best of both!

localmaxxing.com/runs/cmomgvsoo…

My reproduction repositories are here if you want to try this yourself!

(Though I hope that ultimately a lot of these sorts of optimizations become vLLM defaults in the future)

github.com/my-other-githu…

Still lots of room for optimization here, I’m still using generic cutlass NVFP4 kernels instead of something GB10 optimized - and I crudely hacked FlashQLA in so I’m positive there’s headroom there when someone smart gets better, official vLLM support for GB10 in there.

GB10 has a lot of potential if the software can catch up!

@banana_baeee: Finally broke the 3k token per...

Actions

What You Can Do