@rohanpaul_ai: 🛠️ @AnthropicAI researchers ju...

61 views Jun 15, 2025

🛠️ @AnthropicAI researchers just found out a way to teach language models to fine-tune themselves.

So basically, models now fine-tune themselves by rating their own answers.

i.e. self-grading frees models from human bottlenecks.

Internal Coherence Maximization lets the model generate and trust its own labels to bridge that gap.

🧩 The Core Concepts

Pretrained language models already store rich notions of truth, correctness, and preference; unsupervised elicitation tries to surface those notions without outside help.

ICM defines a score that combines how predictable each label is from the others and whether all labels obey simple logic.

Fine-tuning on labels that maximize this score steers the model toward the buried concept it already understands.

🔗 Mutual Predictability

For every example the model predicts its own label while seeing the remaining labeled examples, and the summed log probabilities form the mutual predictability term.

A coherent label set drives these probabilities up because each label completes the same underlying idea.

⚖️ Logical Consistency

Assigning every example the same label would also boost predictability, so ICM subtracts a penalty whenever two labels contradict obvious rules such as “A beats B” contradicting “B beats A”.

Even coarse rules like “different numeric answers cannot both be true” are enough to block degenerate solutions.

🛠️ The ICM (Internal Coherence Maximization) Search Procedure

Exhaustive search is infeasible, so ICM starts with K random labels, proposes one new label at a time, fixes any clashes, and accepts changes that raise the score or, early on, sometimes accepts worse moves to escape local optima.

The upper panel shows how Internal Coherence Maximization measures whether a batch of labels hangs together.

Three claims already carry tentative true or false tags. The model hides one tag at a time, guesses it from the remaining two, and records the log-likelihood of guessing correctly. Adding these three numbers gives a single coherence score: the higher the sum, the more the labels fit the model’s own sense of consistency.

The lower panel shows the search loop that improves those tags. The system begins with a small arithmetic dataset that has provisional labels. It samples a fresh statement, invents several alternative labelings that obey basic arithmetic logic, and scores each candidate with the same coherence metric.

When a candidate scores better, or occasionally when it is only slightly worse early in the search, the system adopts the new labels and repeats the process. Over many iterations the dataset moves toward labels that the model can predict from one another, letting the model supervise itself without human input.

Algorithm 1 Internal Coherence Maximization (ICM)

The routine begins by picking a handful of unlabeled examples at random, assigning them rough labels, and then running a quick pass that flips any tags which clash with obvious logic so that the starting set is self-consistent.

Next it enters a loop that repeats a set number of times. At the start of each round it lowers a control variable called the temperature; high temperature at the beginning encourages exploration, and steady cooling later on encourages caution.

The algorithm then selects one still-unlabeled item and asks the pretrained model which label looks most plausible given everything it has labeled so far. It adds this tentative label to the working dataset and again repairs any logical conflicts that pop up.

It calculates how much the internal coherence score changes after this addition. If the score has improved, the new label is kept. If the score is worse, the label might still be kept, but only with a probability that shrinks as the temperature drops.

By repeating these steps the method slowly grows a dataset whose labels are mutually predictable and logically consistent, letting the model teach itself without human grading.

Fix Inconsistencies.

The paper adds a clean-up step because chasing a high coherence score alone can leave many labels that break simple logic.

ConsistencyFix scans the current dataset for any pair of items that now clash with each other, such as one claim saying a statement is true while another says the opposite.

It picks one conflicting pair, lists every way their two tags could be made consistent, and asks the model which option would give the best overall coherence score.

If this new combination lifts the score, the algorithm keeps the change, otherwise it leaves the tags untouched.

The process repeats for a fixed number of tries or until no contradictions remain, so each new label can trigger repairs to old mistakes instead of being thrown away.

📊 Experimental Findings

With Llama 3.1 models, ICM matches golden-label fine-tuning on TruthfulQA and GSM8K-verification and beats crowdsourced Alpaca supervision.

On an author-gender task where models are superhuman, ICM hits 80 % accuracy while humans stay near 60 %.

Labeling needs only about 2 – 4 forward passes per data point, keeping costs modest.

🚀 Frontier Model Application

The team trained a reward model for Claude 3.5 Haiku on 400 000 prompt pairs using only ICM labels; it scored 75 % on Rewardbench versus 72.2 % for the human-labeled baseline.

Reinforcement learning with the unsupervised reward model produced an assistant that wins 60 % of head-to-head comparisons against the baseline trained with human supervision.

The bars on the left compare two reward models. One is fine-tuned on human labels, the other on labels produced by Internal Coherence Maximization. On Alpaca data the unsupervised model reaches about 68 % accuracy, ten points higher than the human-labeled version. On a larger production set it still stays a few points ahead, around 75 % versus 72 %.

The bars on the right track how often a chatbot trained with each reward model wins direct face-offs when judged by a stronger reference system. The bot guided by the unsupervised reward model wins roughly 60 % of matches, while the bot trained with human supervision wins about 50 %.

Together these two plots show that labels generated by the model itself can train both the reward model and the final assistant at least as well as human annotation.

🔒 Limitations

ICM fails when the desired concept is absent in the base model; a poem-quality task keyed solely on the word “sun” collapsed to random guessing.

The score must fit many examples into context, so tasks with very long inputs remain out of reach.

🧠 Practical Takeaways

Internal coherence is a powerful supervision signal already inside the network, and simple logic checks keep the search on track.

When the target skill is latent, the model can refine itself without a single human-written label.

The @AnthropicAI Paper

arxiv.org/pdf/2506.10139

@rohanpaul_ai: 🛠️ @AnthropicAI researchers ju...

Actions

What You Can Do