Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

@banana_baeee: Finally broke the 3k token per...

@banana_baeee
10 views May 07, 2026
Advertisement
1
Finally broke the 3k token per second input/prompt processing barrier for Qwen 3.5 27B on Spark/GB10 thanks to FlashQLA!

Results and steps to reproduce up on @LottoLabs LocalMaxxing here: localmaxxing.com/runs/cmouqgx9q…

3130t/s pp2048 is close to 4x faster than the fastest M5 Max number I could find on Reddit.

For long running agents, input token processing can be at least as important as output token processing and Spark shines for that!
2
My DFlash decode optimized numbers are here for 3.6 - quite variable, but can make a big difference. I am hoping to combine the decode and prefill optimizations into one fast 27B dense solution and get the best of both!

localmaxxing.com/runs/cmomgvsoo…
3
My reproduction repositories are here if you want to try this yourself!

(Though I hope that ultimately a lot of these sorts of optimizations become vLLM defaults in the future)

github.com/my-other-githu…
4
Still lots of room for optimization here, I’m still using generic cutlass NVFP4 kernels instead of something GB10 optimized - and I crudely hacked FlashQLA in so I’m positive there’s headroom there when someone smart gets better, official vLLM support for GB10 in there.

GB10 has a lot of potential if the software can catch up!
Actions
Visual Editor Carousel Maker NEW
Update Thread
What You Can Do
  • Download as PDF
  • Save to Notion
  • Export as Markdown
  • Visual Editor
  • LinkedIn & Instagram Carousel Maker
Create Free Account

Includes 7-day Premium trial

Advertisement