@rohanpaul_ai: ๐Ÿ› ๏ธ @AnthropicAI researchers ju...

@rohanpaul_ai
61 views Jun 15, 2025
1
๐Ÿ› ๏ธ @AnthropicAI researchers just found out a way to teach language models to fine-tune themselves.

So basically, models now fine-tune themselves by rating their own answers.

i.e. self-grading frees models from human bottlenecks.

Internal Coherence Maximization lets the model generate and trust its own labels to bridge that gap.

๐Ÿงฉ The Core Concepts

Pretrained language models already store rich notions of truth, correctness, and preference; unsupervised elicitation tries to surface those notions without outside help.

ICM defines a score that combines how predictable each label is from the others and whether all labels obey simple logic.

Fine-tuning on labels that maximize this score steers the model toward the buried concept it already understands.

๐Ÿ”— Mutual Predictability

For every example the model predicts its own label while seeing the remaining labeled examples, and the summed log probabilities form the mutual predictability term.

A coherent label set drives these probabilities up because each label completes the same underlying idea.
Media image
2
โš–๏ธ Logical Consistency

Assigning every example the same label would also boost predictability, so ICM subtracts a penalty whenever two labels contradict obvious rules such as โ€œA beats Bโ€ contradicting โ€œB beats Aโ€.

Even coarse rules like โ€œdifferent numeric answers cannot both be trueโ€ are enough to block degenerate solutions.

๐Ÿ› ๏ธ The ICM (Internal Coherence Maximization) Search Procedure

Exhaustive search is infeasible, so ICM starts with K random labels, proposes one new label at a time, fixes any clashes, and accepts changes that raise the score or, early on, sometimes accepts worse moves to escape local optima.
Media image
3
The upper panel shows how Internal Coherence Maximization measures whether a batch of labels hangs together.

Three claims already carry tentative true or false tags. The model hides one tag at a time, guesses it from the remaining two, and records the log-likelihood of guessing correctly. Adding these three numbers gives a single coherence score: the higher the sum, the more the labels fit the modelโ€™s own sense of consistency.

The lower panel shows the search loop that improves those tags. The system begins with a small arithmetic dataset that has provisional labels. It samples a fresh statement, invents several alternative labelings that obey basic arithmetic logic, and scores each candidate with the same coherence metric.

When a candidate scores better, or occasionally when it is only slightly worse early in the search, the system adopts the new labels and repeats the process. Over many iterations the dataset moves toward labels that the model can predict from one another, letting the model supervise itself without human input.
Media image
4
Algorithm 1 Internal Coherence Maximization (ICM)

The routine begins by picking a handful of unlabeled examples at random, assigning them rough labels, and then running a quick pass that flips any tags which clash with obvious logic so that the starting set is self-consistent.

Next it enters a loop that repeats a set number of times. At the start of each round it lowers a control variable called the temperature; high temperature at the beginning encourages exploration, and steady cooling later on encourages caution.

The algorithm then selects one still-unlabeled item and asks the pretrained model which label looks most plausible given everything it has labeled so far. It adds this tentative label to the working dataset and again repairs any logical conflicts that pop up.

It calculates how much the internal coherence score changes after this addition. If the score has improved, the new label is kept. If the score is worse, the label might still be kept, but only with a probability that shrinks as the temperature drops.

By repeating these steps the method slowly grows a dataset whose labels are mutually predictable and logically consistent, letting the model teach itself without human grading.
Media image
5
Fix Inconsistencies.

The paper adds a clean-up step because chasing a high coherence score alone can leave many labels that break simple logic.

ConsistencyFix scans the current dataset for any pair of items that now clash with each other, such as one claim saying a statement is true while another says the opposite.

It picks one conflicting pair, lists every way their two tags could be made consistent, and asks the model which option would give the best overall coherence score.

If this new combination lifts the score, the algorithm keeps the change, otherwise it leaves the tags untouched.

The process repeats for a fixed number of tries or until no contradictions remain, so each new label can trigger repairs to old mistakes instead of being thrown away.
Media image
6
๐Ÿ“Š Experimental Findings

With Llama 3.1 models, ICM matches golden-label fine-tuning on TruthfulQA and GSM8K-verification and beats crowdsourced Alpaca supervision.

On an author-gender task where models are superhuman, ICM hits 80 % accuracy while humans stay near 60 %.

Labeling needs only about 2 โ€“ 4 forward passes per data point, keeping costs modest.
Media image
7
๐Ÿš€ Frontier Model Application

The team trained a reward model for Claude 3.5 Haiku on 400 000 prompt pairs using only ICM labels; it scored 75 % on Rewardbench versus 72.2 % for the human-labeled baseline.

Reinforcement learning with the unsupervised reward model produced an assistant that wins 60 % of head-to-head comparisons against the baseline trained with human supervision.
Media image
8
The bars on the left compare two reward models. One is fine-tuned on human labels, the other on labels produced by Internal Coherence Maximization. On Alpaca data the unsupervised model reaches about 68 % accuracy, ten points higher than the human-labeled version. On a larger production set it still stays a few points ahead, around 75 % versus 72 %.

The bars on the right track how often a chatbot trained with each reward model wins direct face-offs when judged by a stronger reference system. The bot guided by the unsupervised reward model wins roughly 60 % of matches, while the bot trained with human supervision wins about 50 %.

Together these two plots show that labels generated by the model itself can train both the reward model and the final assistant at least as well as human annotation.
Media image
9
๐Ÿ”’ Limitations

ICM fails when the desired concept is absent in the base model; a poem-quality task keyed solely on the word โ€œsunโ€ collapsed to random guessing.

The score must fit many examples into context, so tasks with very long inputs remain out of reach.
Media image
10
๐Ÿง  Practical Takeaways

Internal coherence is a powerful supervision signal already inside the network, and simple logic checks keep the search on track.

When the target skill is latent, the model can refine itself without a single human-written label.
Media image
11
The @AnthropicAI Paper

arxiv.org/pdf/2506.10139
Actions
Visual Editor Carousel Maker NEW
Update Thread
What You Can Do
  • Download as PDF
  • Save to Notion
  • Export as Markdown
  • Visual Editor
  • LinkedIn & Instagram Carousel Maker
Create Free Account

Includes 7-day Premium trial