@andthatto: Qwen 3.6 is frontier for local...

@andthatto
4 views Apr 29, 2026
Advertisement
1
Qwen 3.6 is frontier for local.

It also thinks forever.

I tried a dumb inference-time trick: make its block obey a tiny grammar.

Result:
- HumanEval+: 22x fewer think tokens, no accuracy loss
- LiveCodeBench public slice: +14% pass@1, ~5x fewer total tokens
2
No finetuning.
Just GBNF-constrained decoding.

The constraint is applied only to the reasoning block, not the final answer/code.
3
On HumanEval+ with Qwen3.6-35B-A3B:

Free-form thinking:

92.1% pass@1
3087 mean think tokens

Grammar:

92.7% pass@1
138 mean think tokens

Same accuracy band.
~22x fewer thinking tokens.
4
Then I tried a recent LiveCodeBench v6 LeetCode slice.

Free-form: 50% pass@1 and 11553 mean think tokens
Grammar: 64% pass@1 and 267 mean think tokens
5
This is not “reasoning disappeared.”

On harder tasks, some reasoning moved into comments / post-think answer text.

Yet it reacts to how grammar is constructed.
I believe there may be task specific grammars discovered through @DSPyOSS style prompt optimization.
6
My insight is that a lot of verbose CoT is scaffolding, not essential computation.

Constrained decoding can force a denser interface to the model’s latent reasoning.

But if the task really needs more deliberation, it leaks somewhere else.
7
I think this is a useful middle ground between:

verbose CoT at inference
training models to reason in latent space

Just constrain the text interface.

Full writeup + results:

andthattoo.dev/blog/structure…

and repo: github.com/andthattoo/str…
Actions
Visual Editor Carousel Maker NEW
Update Thread
What You Can Do
  • Download as PDF
  • Save to Notion
  • Export as Markdown
  • Visual Editor
  • LinkedIn & Instagram Carousel Maker
Create Free Account

Includes 7-day Premium trial

Advertisement