@MatthieuWyart: LLMs learn by predicting token...

1

LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs more data? For data with hidden hierarchy, we prove the gap is exponential. arxiv.org/pdf/2605.27734

2

Why? A network discovers a latent variable from its correlation with a prediction target. Correlations between latents at the same level of abstraction are far stronger than between a latent and raw tokens. Token prediction dilutes the signal that latent prediction amplifies.

3

We make this precise on simple context-free grammars. Token-level SSL need a sample size exponential in the depth of the latent tree. Learning from your own latents is nearly independent of depth. We show that data2vec implicitly does exactly this hierarchical latent prediction.

4

A consequence: if a single latent-prediction module (data2vec) is already implicitly multi-scale, then explicitly stacking them (e.g. H-JEPA) is to some extent redundant. Work led by @DanKorchinski & @alesfav.

5

@DanKorchinski @alesfav see excellent threads by @DanKorchinski

View Tweet

and @alesfav .

View Tweet

@MatthieuWyart: LLMs learn by predicting token...

Actions

What You Can Do