Hi,๐Ÿ‘‹ we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. ๐Ÿ˜Š

@alesfav: AI needs vastly more data than...

@alesfav
9 views Jun 22, 2026
Advertisement
1
AI needs vastly more data than we do. One idea might close the gap: don't predict raw signals (tokens), predict your own abstract latent representation (JEPA, data2vec).

With @DanKorchinski @MatthieuWyart, on a toy model, we prove how much that helps: the gap is exponential.

๐Ÿงต
Media image
2
We study recovering the hidden latent tree of a hierarchical grammar.

Token-level SSL pays a depth tax: the data it needs grows exponentially with the tree's depth. We prove that iteratively supervising on latents escapes it, recovering the tree with constant-in-depth data!
Media image
3
Surprisingly, we found data2vec already does this with a single module. Through its teacher, it implicitly supervises on latents at every level, reaching the same constant-in-depth scaling. ๐Ÿคฏ

The hierarchy unfolds during training rather than being stacked into the architecture.
4
This result also suggests that explicit stacking, like H-JEPA, may be redundant.

Many open questions!

๐Ÿ“„ Our paper: arxiv.org/abs/2605.27734
5
@TMoldwin @DanKorchinski @MatthieuWyart Latent prediction avoids that bottleneck by learning one level, then using that learned level as the target/context for the next.

We may write a more accessible blog post version at some point!
Actions
Visual Editor Carousel Maker NEW
Update Thread
What You Can Do
  • Download as PDF
  • Save to Notion
  • Export as Markdown
  • Visual Editor
  • LinkedIn & Instagram Carousel Maker
Create Free Account

Includes 7-day Premium trial

Advertisement