✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135Β°

style Card Style

40px
16px

text_fields Typography

16px
Akshay πŸš€
@akshay_pachaar
How LLMs understand relative positions of input words, clearly explained:
Akshay πŸš€
@akshay_pachaar
RoPE (Rotary Positional Embeddings) revolutionised the way positional information is encoded in LLMs and it's widely used by models like Llama-3.

Today, I'll clearly explain what they are & how positional embeddings evolved over time.

Let's go! πŸš€
Thread image
Akshay πŸš€
@akshay_pachaar
Why Positional embeddings❓

Let's first understand the concept of positional embeddings and why they are essential in the first place.

Without positional embeddings a transformer doesn't understand the relative arrangement of words in a sentence.
Thread image
Akshay πŸš€
@akshay_pachaar
Absolute Positional embeddings

So, let's start with the simplest way to encode position of tokens (words) in an input.
Thread image
Akshay πŸš€
@akshay_pachaar
Problem with absolute positional embeddings! ⚠️

Each positional embedding is independent of each other, they don't capture relative arrangements/positioning.

Check this outπŸ‘‡
Thread image
Akshay πŸš€
@akshay_pachaar
Next up, let's talk about relative positional embeddings!

They definitely overcome the limitations of absolute positional embeddings, but they come with the drawback of increased parameters.

For a sequence of length N, we require 2N+1 positional embeddings.
Thread image
Akshay πŸš€
@akshay_pachaar
Introducing Rotary Positional Embeddings (RoPE), the best of both worlds!

Instead of adding a token & its positional embedding, we rotate the token embedding by a fixed factor (theta) depending on its position in the sequence.

Imagine our token embeddings are 2-dimensional!
Thread image
Akshay πŸš€
@akshay_pachaar
RoPE preserves relative position and relation.

For example, in the scenario below, the embedding vectors for the words 'love' and 'Tennis' will have the same cosine similarity as long as their relative positions stay the same, regardless of the sequence length.

Check this outπŸ‘‡
Thread image
Akshay πŸš€
@akshay_pachaar
Mathematical representation.

This can be easily generalised to N dimensions by taking a pair at a time.

(I have also shared my detailed blog on self-attention at the end)

Check this outπŸ‘‡
Thread image
Akshay πŸš€
@akshay_pachaar
If you interested in:

- Python 🐍
- ML/AI Engineering βš™οΈ

1. Find me β†’ @akshay_pachaar βœ”οΈ
2. Subscribe to our Newsletter β†’ @DailyDoseOfDS_ and get a FREE eBook, covering 150+ core DS/ML lessons!

Cheers!
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press ⌘ + S to quick-export