@leftcurvedev_: Anyone with 8GB or 12GB VRAM s...
@leftcurvedev_
71 views
May 08, 2026
Advertisement
1
Anyone with 8GB or 12GB VRAM setups needs to understand that "-ncmoe" is the key flag to boost performance on llama.cpp
Here are my results for Qwen3.6 35B A3B, with 64k q8_0 context on a 8GB RTX 3070Ti:
βͺοΈ no flag β 8.7 tok/s
RAM: 13.6GB & VRAM: 7.8GB
π΄ -ncmoe 35 β 27.5 tok/s
RAM: 12.1GB & VRAM: 4.3GB
π’ -ncmoe 30 β 32.5 tok/s
RAM: 12GB & VRAM: 5.6GB
π΅ -ncmoe 25 β 40.9 tok/s
RAM: 12GB & VRAM: 6.9GB
Please note the ram and vram usage you see are total usage of a windows pc, with the model running. My friend's setup: 8GB VRAM and 16GB RAM. You can boost performance by switching to Linux, just something to keep in mind.
Basically, this flag keeps the MoE experts in the first X layers on your CPU + RAM, instead of eating all your VRAM straight away. This is a smart hybrid offload way that lets you run bigger models without OOM while keeping the rest on your GPU for speed.
As we can see on the data, there's a sweet spot. When we lower it from 35 to 25, speed bumps +50% because there are more layers on your GPU (look at the VRAM usage). The key here is to play around with the number and fit as much as possible on your VRAM, goal is to have 1GB/800MB headroom to avoid stress.
β server flags below
Here are my results for Qwen3.6 35B A3B, with 64k q8_0 context on a 8GB RTX 3070Ti:
βͺοΈ no flag β 8.7 tok/s
RAM: 13.6GB & VRAM: 7.8GB
π΄ -ncmoe 35 β 27.5 tok/s
RAM: 12.1GB & VRAM: 4.3GB
π’ -ncmoe 30 β 32.5 tok/s
RAM: 12GB & VRAM: 5.6GB
π΅ -ncmoe 25 β 40.9 tok/s
RAM: 12GB & VRAM: 6.9GB
Please note the ram and vram usage you see are total usage of a windows pc, with the model running. My friend's setup: 8GB VRAM and 16GB RAM. You can boost performance by switching to Linux, just something to keep in mind.
Basically, this flag keeps the MoE experts in the first X layers on your CPU + RAM, instead of eating all your VRAM straight away. This is a smart hybrid offload way that lets you run bigger models without OOM while keeping the rest on your GPU for speed.
As we can see on the data, there's a sweet spot. When we lower it from 35 to 25, speed bumps +50% because there are more layers on your GPU (look at the VRAM usage). The key here is to play around with the number and fit as much as possible on your VRAM, goal is to have 1GB/800MB headroom to avoid stress.
β server flags below
View Tweet
2
llama.cpp built from source
CUDA drivers 13.0
UD-IQ3_XXS GGUF from Unsloth
server command with flags:
/llama.cpp/build/bin/llama-server \
-m Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
-ngl 99 \
-np 1 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--ctx-size 65536 \
--host 0.0.0.0 \
-ncmoe 25
huggingface.co/unsloth/Qwen3.β¦
CUDA drivers 13.0
UD-IQ3_XXS GGUF from Unsloth
server command with flags:
/llama.cpp/build/bin/llama-server \
-m Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf \
-ngl 99 \
-np 1 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--ctx-size 65536 \
--host 0.0.0.0 \
-ncmoe 25
huggingface.co/unsloth/Qwen3.β¦
3
Btw, left host at 0.0.0.0 but donβt do that boys, use it locally or use your tailscale ip directly π
4
More testing
View Tweet