Hi,๐Ÿ‘‹ we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. ๐Ÿ˜Š

Carousel Studio

Repurpose X Threads into LinkedIn & Instagram Carousels

Canvas & Ratio

Choose your destination platform format


Layout Template

Choose a content structure for your slides


Preset Themes


Typography & Sizing

Title Font Size36px
Body Font Size18px
Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)
AGENCY
SAVE PRESETS (AGENCY)

Outro Slide CTA

Customize your closing call-to-action slide

#1
#2
#3

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1
Vatsal Baherwani
@vatsalbaherwani

Scaling laws predict an LLM's pretraining loss, but not its capabilities. Abilities like in-context learning emerge abruptly and only past a certain scale. Our new paper traces this to one bottleneck: learning which tokens attention should focus on. ๐Ÿงต<a target="_blank" href="https://arxiv.org/abs/2606.25010" color="blue">arxiv.org/abs/2606.25010</a>

Apply Image
Drag Post #2
Vatsal Baherwani
@vatsalbaherwani

Consider 10 LMs with an identical training setup, differing only in their random initialization. We evaluate each on multiple tasks; some models solve them early on while others never do. Even among those that succeed, the step at which each one "gets it" varies significantly.

Apply Image
Drag Post #3
Vatsal Baherwani
@vatsalbaherwani

For many training steps the model cannot solve a given task, then performance jumps abruptly. What changes at that step? We see the jump occurs when one or more attention heads discover a task-relevant pattern. Learning this pattern is the bottleneck for the capability to emerge.

Drag Post #4
Vatsal Baherwani
@vatsalbaherwani

To study this bottleneck directly, we design synthetic tasks where we know the ground-truth attention map. The loss curve consists of abrupt jumps; in each one the model learns a portion of the correct pattern. The model is effectively searching for the right attention patterns.

Apply Image
Drag Post #5
Vatsal Baherwani
@vatsalbaherwani

The larger the search space of possible attention patterns, the longer the model plateaus. Medium-sparsity patterns are the hardest to find. Also, simply scaling the context length can make a task more difficult to learn, or even unlearnable.

Apply Image
Drag Post #6
Vatsal Baherwani
@vatsalbaherwani

So how do we make the search easier? Adding attention heads helps, since each head gives the model another chance to find the correct pattern. Changing the mixing mechanism can also help, as MLP-Mixer beats transformers by almost an order of magnitude on our linear map task.

Apply Image
Drag Post #7
Vatsal Baherwani
@vatsalbaherwani

Making this attention pattern search easier is a direct path to more capable, sample-efficient language models. ๐Ÿ“„<a target="_blank" href="http://arxiv.org/abs/2606.25010" color="blue">arxiv.org/abs/2606.25010</a> โœ๏ธ<a target="_blank" href="http://vatsal0.github.io/blog/emergence.html" color="blue">vatsal0.github.io/blog/emergenceโ€ฆ</a> Thank you to my collaborators who made this work possible! @charllechen @ShikaiQiu @andrewgwils @Pavel_Izmailov