@karpathy: New (2h13m 😅) lecture: "Let's ...
@karpathy
67 views
Feb 20, 2024
1
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"
Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.
Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.
3
Also, releasing new repository on GitHub: minbpe
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
github.com/karpathy/minbpe
In the video we essentially build minbpe from scratch.
Don't miss the exercise.md to build your own
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
github.com/karpathy/minbpe
In the video we essentially build minbpe from scratch.
Don't miss the exercise.md to build your own
4
The actual link to the lecture:
youtube.com/watch?v=zduSFx…
(at the end of the thread here (sorry) otherwise X really really dislikes external links and would bury this post. I could eventually upload here too, for now X is missing a lot of very nice features, chapters especially)
youtube.com/watch?v=zduSFx…
(at the end of the thread here (sorry) otherwise X really really dislikes external links and would bury this post. I could eventually upload here too, for now X is missing a lot of very nice features, chapters especially)

