Visualize Thread by @TheTuringPost

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Turing Post

@TheTuringPost

Good answers follow good reasoning

VeriFree is a new method that keeps the benefits of reinforcement learning (RL) but gets rid of a verifier model and rule-based checking.

It trains the model to get closer to a known good answer, called a reference answer.

Benefits:

• It's faster and simpler
• Requires less compute
• Is more stable

Here's how VeriFree works🧵

Turing Post

@TheTuringPost

1. Step-by-step VeriFree workflow:

• The model generates a reasoning trace.
• The final answer isn't checked directly.
• Instead, it's checked how likely the model is to generate the correct answer based on its reasoning.
• That likelihood becomes the reward. It's higher if the model seems confident in the correct answer.

Turing Post

@TheTuringPost

2. Smart tokenization:

Splitting the model’s response into <reasoning> and <answer> needs to be done carefully.

Instead of splitting at "<answer>", researchers decided stop at "<answer" without the closing bracket. This avoids token mismatches and keeps training stable.

Turing Post

@TheTuringPost

3. VeriFree's working process also allows to train the model using just one correct answer per question, without needing to generate and verify each possible answer during training.

Turing Post

@TheTuringPost

4. Why is VeriFree more stable?

It skips sampling the final answer — the system just calculates the chance the model would say the right thing. This removes randomness and makes the learning process more efficient.

Turing Post

@TheTuringPost

VeriFree reinforces answers that follow good reasoning. If the reasoning is off, the training signal gets weaker. This helps the model learn to reason better, not just guess answers.

Paper: arxiv.org/abs/2505.21493
Code: github.com/sail-sg/VeriFr…

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export