@alesfav: AI needs vastly more data than...

1

AI needs vastly more data than we do. One idea might close the gap: don't predict raw signals (tokens), predict your own abstract latent representation (JEPA, data2vec).

With @DanKorchinski @MatthieuWyart, on a toy model, we prove how much that helps: the gap is exponential.

🧵

2

We study recovering the hidden latent tree of a hierarchical grammar.

Token-level SSL pays a depth tax: the data it needs grows exponentially with the tree's depth. We prove that iteratively supervising on latents escapes it, recovering the tree with constant-in-depth data!

3

Surprisingly, we found data2vec already does this with a single module. Through its teacher, it implicitly supervises on latents at every level, reaching the same constant-in-depth scaling. 🤯

The hierarchy unfolds during training rather than being stacked into the architecture.

4

This result also suggests that explicit stacking, like H-JEPA, may be redundant.

Many open questions!

📄 Our paper: arxiv.org/abs/2605.27734

5

@TMoldwin @DanKorchinski @MatthieuWyart Latent prediction avoids that bottleneck by learning one level, then using that learned level as the target/context for the next.

We may write a more accessible blog post version at some point!

@alesfav: AI needs vastly more data than...

Actions

What You Can Do