@TheTuringPost: Absolute Zero is a new paradig...
@TheTuringPost
11 views
May 13, 2025
1
Absolute Zero is a new paradigm from @Tsinghua_Uni that encourages models to learn without human-labeled data.
It's a self-play process, where the model is both a proposer and a solver.
- A model creates its own tasks to learn from.
- It solves these tasks on its own, using feedback from an environmental tool.
Based in this, researchers also built the Absolute Zero Reasoner (AZR) system.
This paradigm shows that you don't need thousands of outside data examples or human guidance to get SOTA results.
Details ๐งต
It's a self-play process, where the model is both a proposer and a solver.
- A model creates its own tasks to learn from.
- It solves these tasks on its own, using feedback from an environmental tool.
Based in this, researchers also built the Absolute Zero Reasoner (AZR) system.
This paradigm shows that you don't need thousands of outside data examples or human guidance to get SOTA results.
Details ๐งต
2
1. Roles and rewards in Absolute Zero:
The model plays 2 roles:
- A proposer: It invents a new reasoning task.
- A solver: It tries to solve that task.
An environment tool checks if the task makes sense and provides the right answer. The model then tries to answer the task. If it does well, it gets rewarded.
There are 2 types of feedback:
- One for coming up with a good, learnable task.
- Another for solving it correctly.
The model plays 2 roles:
- A proposer: It invents a new reasoning task.
- A solver: It tries to solve that task.
An environment tool checks if the task makes sense and provides the right answer. The model then tries to answer the task. If it does well, it gets rewarded.
There are 2 types of feedback:
- One for coming up with a good, learnable task.
- Another for solving it correctly.
4
3. AZR uses code problems as its main learning tool.
- It creates and solves a set of coding tasks based on past tasks it already made and solved, and the type of reasoning it wants to practice (deduction, abduction, or induction).
- Python is used to check if the tasks are valid and then, if the model's answers are correct.
- AZR uses 2 scores for training: for proposing good tasks and for solving them.
- It creates and solves a set of coding tasks based on past tasks it already made and solved, and the type of reasoning it wants to practice (deduction, abduction, or induction).
- Python is used to check if the tasks are valid and then, if the model's answers are correct.
- AZR uses 2 scores for training: for proposing good tasks and for solving them.
5
4. Even though AZR was trained without any human-written data, its has impressive results:
- It beat the best "zero-data" models by +1.8%
- AZR improved its math score by +15.2% vs. +0.65% of other top models
Also:
- Bigger models learn more
- Code helps general reasoning (even in math)
- Emergent planning: AZR starts writing step-by-step explanations as comments.
- It beat the best "zero-data" models by +1.8%
- AZR improved its math score by +15.2% vs. +0.65% of other top models
Also:
- Bigger models learn more
- Code helps general reasoning (even in math)
- Emergent planning: AZR starts writing step-by-step explanations as comments.
6




