AI needs vastly more data than we do. One idea might close the gap: don't predict raw signals (tokens), predict your own abstract latent representation (JEPA, data2vec).
With @DanKorchinski @MatthieuWyart, on a toy model, we prove how much that helps: the gap is exponential.
๐งต

We study recovering the hidden latent tree of a hierarchical grammar.
Token-level SSL pays a depth tax: the data it needs grows exponentially with the tree's depth. We prove that iteratively supervising on latents escapes it, recovering the tree with constant-in-depth data!
Token-level SSL pays a depth tax: the data it needs grows exponentially with the tree's depth. We prove that iteratively supervising on latents escapes it, recovering the tree with constant-in-depth data!

Surprisingly, we found data2vec already does this with a single module. Through its teacher, it implicitly supervises on latents at every level, reaching the same constant-in-depth scaling. ๐คฏ
The hierarchy unfolds during training rather than being stacked into the architecture.
The hierarchy unfolds during training rather than being stacked into the architecture.
This result also suggests that explicit stacking, like H-JEPA, may be redundant.
Many open questions!
๐ Our paper: arxiv.org/abs/2605.27734
Many open questions!
๐ Our paper: arxiv.org/abs/2605.27734
@TMoldwin @DanKorchinski @MatthieuWyart Latent prediction avoids that bottleneck by learning one level, then using that learned level as the target/context for the next.
We may write a more accessible blog post version at some point!
We may write a more accessible blog post version at some point!
Generated by Thread Navigator
Press โ + S to quick-export
