@karpathy: New (2h13m 😅) lecture: "Let's ...

1

New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"

Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.

2

We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.

3

Also, releasing new repository on GitHub: minbpe
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
github.com/karpathy/minbpe

In the video we essentially build minbpe from scratch.
Don't miss the exercise.md to build your own

4

The actual link to the lecture:
youtube.com/watch?v=zduSFx…

(at the end of the thread here (sorry) otherwise X really really dislikes external links and would bury this post. I could eventually upload here too, for now X is missing a lot of very nice features, chapters especially)

@karpathy: New (2h13m 😅) lecture: "Let's ...

Actions

What You Can Do