Visualize Thread by @akshay_pachaar

✨ Visual Editor

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Akshay 🚀

@akshay_pachaar

How LLMs work, clearly explained:

Akshay 🚀

@akshay_pachaar

Before diving into LLMs, we must understand conditional probability.

Let's consider a population of 14 individuals:

- Some of them like Tennis 🎾
- Some like Football ⚽️
- A few like both 🎾 ⚽️
- And few like none

Here's how it looks 👇

Akshay 🚀

@akshay_pachaar

So what is Conditional probability ⁉️

It's a measure of the probability of an event given that another event has occurred.

If the events are A and B, we denote this as P(A|B).

This reads as "probability of A given B"

Check this illustration 👇

Akshay 🚀

@akshay_pachaar

For instance, if we're predicting whether it will rain today (event A), knowing that it's cloudy (event B) might impact our prediction.

As it's more likely to rain when it's cloudy, we'd say the conditional probability P(A|B) is high.

That's conditional probability for you! 🎉

Akshay 🚀

@akshay_pachaar

Now, how does this apply to LLMs like GPT-4❓

These models are tasked with predicting the next word in a sequence.

This is a question of conditional probability: given the words that have come before, what is the most likely next word?

Akshay 🚀

@akshay_pachaar

To predict the next word, the model calculates the conditional probability for each possible next word, given the previous words (context).

The word with the highest conditional probability is chosen as the prediction.

Akshay 🚀

@akshay_pachaar

The LLM learns a high-dimensional probability distribution over sequences of words.

And the parameters of this distribution are the trained weights!

The training or rather pre-training** is supervised.

I'll talk about the different training steps next time!**

Check this 👇

Akshay 🚀

@akshay_pachaar

But there a problem❗️

If we always pick the word with the highest probability, we end up with repetitive outputs, making LLMs almost useless and stifling their creativity.

This is where temperature comes into picture.

Check this before we understand more about it...👇

Akshay 🚀

@akshay_pachaar

However a high temperate value produces gibberish

Let's understand what's going on...👇

Akshay 🚀

@akshay_pachaar

So, the LLMs instead of selecting the best token (for simplicity let's think of tokens as words), they "sample" the prediction.

So even if “Token 1” has the highest score, it may not be chosen since we are sampling.

Akshay 🚀

@akshay_pachaar

Now, temperature introduces the following tweak in the softmax function, which, in turn, influences the sampling process:

Akshay 🚀

@akshay_pachaar

Let take a code example!

At low temperature, probabilities concentrate around the most likely token, resulting in nearly greedy generation.

At high temperature, probabilities become more uniform, producing highly random and stochastic outputs.

Check this out👇

Akshay 🚀

@akshay_pachaar

That's a wrap!

Hopefully, this guide has demystified some of the magic behind LLMs.

And, if you enjoyed this breakdown:

Find me → @akshay_pachaar ✔️
For more insights and tutorials on AI and Machine Learning.

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export