@andthatto: Qwen 3.6 is frontier for local...

1

Qwen 3.6 is frontier for local.

It also thinks forever.

I tried a dumb inference-time trick: make its block obey a tiny grammar.

Result:
- HumanEval+: 22x fewer think tokens, no accuracy loss
- LiveCodeBench public slice: +14% pass@1, ~5x fewer total tokens

2

No finetuning.
Just GBNF-constrained decoding.

The constraint is applied only to the reasoning block, not the final answer/code.

3

On HumanEval+ with Qwen3.6-35B-A3B:

Free-form thinking:

92.1% pass@1
3087 mean think tokens

Grammar:

92.7% pass@1
138 mean think tokens

Same accuracy band.
~22x fewer thinking tokens.

4

Then I tried a recent LiveCodeBench v6 LeetCode slice.

Free-form: 50% pass@1 and 11553 mean think tokens
Grammar: 64% pass@1 and 267 mean think tokens

5

This is not “reasoning disappeared.”

On harder tasks, some reasoning moved into comments / post-think answer text.

Yet it reacts to how grammar is constructed.
I believe there may be task specific grammars discovered through @DSPyOSS style prompt optimization.

6

My insight is that a lot of verbose CoT is scaffolding, not essential computation.

Constrained decoding can force a denser interface to the model’s latent reasoning.

But if the task really needs more deliberation, it leaks somewhere else.

7

I think this is a useful middle ground between:

verbose CoT at inference
training models to reason in latent space

Just constrain the text interface.

Full writeup + results:

andthattoo.dev/blog/structure…

and repo: github.com/andthattoo/str…

@andthatto: Qwen 3.6 is frontier for local...

Actions

What You Can Do