@_avichawla: The growth of LLM context leng...
@_avichawla
8 views
Aug 23, 2025
1
The growth of LLM context length with time:
- GPT-3.5-turbo → 4k tokens
- OpenAI GPT4 → 8k tokens
- Claude 2 → 100k tokens
- Llama 3 → 128k tokens
- Gemini → 1M tokens
Let's understand how they extend the context length of LLMs:
- GPT-3.5-turbo → 4k tokens
- OpenAI GPT4 → 8k tokens
- Claude 2 → 100k tokens
- Llama 3 → 128k tokens
- Gemini → 1M tokens
Let's understand how they extend the context length of LLMs:
4
A similar idea was used in ModernBERT.
It is an upgraded version of BERT with:
- 16x larger sequence length
- Much better downstream performance, and
- The most memory-efficient encoder
They used alternating attention.
Check this 👇
It is an upgraded version of BERT with:
- 16x larger sequence length
- Much better downstream performance, and
- The most memory-efficient encoder
They used alternating attention.
Check this 👇
5
Here's the idea:
- Use full global attention in every third layer.
- Use local attention otherwise, where a token attends to 128 tokens.
This allows ModernBERT to process longer sequences, while also being significantly faster than other encoder models.
Check this 👇
- Use full global attention in every third layer.
- Use local attention otherwise, where a token attends to 128 tokens.
This allows ModernBERT to process longer sequences, while also being significantly faster than other encoder models.
Check this 👇
6
Here's an intuitive explanation taken from the paper:
Picture yourself reading a book. For every sentence you read, do you need to be fully aware of the entire plot to understand most of it (full global attention)?
Or is awareness of the current chapter enough (local attention), as long as you occasionally think back on its significance to the main plot (global attention)?
In the vast majority of cases, it’s the latter.
Picture yourself reading a book. For every sentence you read, do you need to be fully aware of the entire plot to understand most of it (full global attention)?
Or is awareness of the current chapter enough (local attention), as long as you occasionally think back on its significance to the main plot (global attention)?
In the vast majority of cases, it’s the latter.
11
That's a wrap!
If you found it insightful, reshare it with your network.
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
If you found it insightful, reshare it with your network.
Find me → @_avichawla
Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.
View Tweet





