@above_spec: "You need a 24 GB GPU for seri...
@above_spec
7 views
May 02, 2026
Advertisement
2
How to make it work:
MoE offload. Qwen3.6-35B activates only 3 B params per token. Keep attention + shared weights on GPU, push the cold expert FFNs to system RAM. In llama.cpp: -ngl 99 -ncmoe 99.
q8_0 KV cache. ~10 KB/token. 200k fits in 2 GB VRAM with FlashAttention on.
MoE offload. Qwen3.6-35B activates only 3 B params per token. Keep attention + shared weights on GPU, push the cold expert FFNs to system RAM. In llama.cpp: -ngl 99 -ncmoe 99.
q8_0 KV cache. ~10 KB/token. 200k fits in 2 GB VRAM with FlashAttention on.
4
Why RTX 3070 8GB owners shouldn't write the card off:
The real bottleneck is host-RAM bandwidth for the MoE experts (~3 B active × Q4 ≈ 1.5 GB/token of streaming reads from DDR5), not GPU compute. The 3070 actually has higher memory bandwidth than the 4060 Ti (448 vs 288 GB/s).
The real bottleneck is host-RAM bandwidth for the MoE experts (~3 B active × Q4 ≈ 1.5 GB/token of streaming reads from DDR5), not GPU compute. The 3070 actually has higher memory bandwidth than the 4060 Ti (448 vs 288 GB/s).
5
If you have 64 GB RAM and a half-decent CPU, MoE + 8GB is arguably the new home-LLM sweet spot.
What setup are you running?
What setup are you running?


