✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
andthattoo
@andthatto
Qwen 3.6 is frontier for local.

It also thinks forever.

I tried a dumb inference-time trick: make its block obey a tiny grammar.

Result:
- HumanEval+: 22x fewer think tokens, no accuracy loss
- LiveCodeBench public slice: +14% pass@1, ~5x fewer total tokens
Video thumbnail
VIDEO
andthattoo
@andthatto
No finetuning.
Just GBNF-constrained decoding.

The constraint is applied only to the reasoning block, not the final answer/code.
andthattoo
@andthatto
On HumanEval+ with Qwen3.6-35B-A3B:

Free-form thinking:

92.1% pass@1
3087 mean think tokens

Grammar:

92.7% pass@1
138 mean think tokens

Same accuracy band.
~22x fewer thinking tokens.
andthattoo
@andthatto
Then I tried a recent LiveCodeBench v6 LeetCode slice.

Free-form: 50% pass@1 and 11553 mean think tokens
Grammar: 64% pass@1 and 267 mean think tokens
andthattoo
@andthatto
This is not “reasoning disappeared.”

On harder tasks, some reasoning moved into comments / post-think answer text.

Yet it reacts to how grammar is constructed.
I believe there may be task specific grammars discovered through @DSPyOSS style prompt optimization.
andthattoo
@andthatto
My insight is that a lot of verbose CoT is scaffolding, not essential computation.

Constrained decoding can force a denser interface to the model’s latent reasoning.

But if the task really needs more deliberation, it leaks somewhere else.
andthattoo
@andthatto
I think this is a useful middle ground between:

verbose CoT at inference
training models to reason in latent space

Just constrain the text interface.

Full writeup + results:

andthattoo.dev/blog/structure…

and repo: github.com/andthattoo/str…
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export