Visualize Thread by @MatthieuWyart

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Matthieu wyart

@MatthieuWyart

LLMs learn by predicting tokens. World models (JEPA, data2vec) learn by predicting their own abstractions. Which needs more data? For data with hidden hierarchy, we prove the gap is exponential. arxiv.org/pdf/2605.27734

05:21 AM · Jun 01, 2026

Matthieu wyart

@MatthieuWyart

Why? A network discovers a latent variable from its correlation with a prediction target. Correlations between latents at the same level of abstraction are far stronger than between a latent and raw tokens. Token prediction dilutes the signal that latent prediction amplifies.

05:22 AM · Jun 01, 2026

Matthieu wyart

@MatthieuWyart

We make this precise on simple context-free grammars. Token-level SSL need a sample size exponential in the depth of the latent tree. Learning from your own latents is nearly independent of depth. We show that data2vec implicitly does exactly this hierarchical latent prediction.

05:29 AM · Jun 01, 2026

Matthieu wyart

@MatthieuWyart

A consequence: if a single latent-prediction module (data2vec) is already implicitly multi-scale, then explicitly stacking them (e.g. H-JEPA) is to some extent redundant. Work led by @DanKorchinski & @alesfav.

05:33 AM · Jun 01, 2026

Matthieu wyart

@MatthieuWyart

@DanKorchinski @alesfav see excellent threads by @DanKorchinski

View Tweet

and @alesfav .

View Tweet

05:37 AM · Jun 01, 2026

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export