Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

@karpathy: New (2h13m 😅) lecture: "Let's ...

@karpathy
67 views Feb 20, 2024
1
New (2h13m 😅) lecture: "Let's build the GPT Tokenizer"

Tokenizers are a completely separate stage of the LLM pipeline: they have their own training set, training algorithm (Byte Pair Encoding), and after training implement two functions: encode() from strings to tokens, and decode() back from tokens to strings. In this lecture we build from scratch the Tokenizer used in the GPT series from OpenAI.
Media image
2
We will see that a lot of weird behaviors and problems of LLMs actually trace back to tokenization. We'll go through a number of these issues, discuss why tokenization is at fault, and why someone out there ideally finds a way to delete this stage entirely.
Media image
3
Also, releasing new repository on GitHub: minbpe
Minimal, clean, code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
github.com/karpathy/minbpe

In the video we essentially build minbpe from scratch.
Don't miss the exercise.md to build your own
4
The actual link to the lecture:
youtube.com/watch?v=zduSFx…

(at the end of the thread here (sorry) otherwise X really really dislikes external links and would bury this post. I could eventually upload here too, for now X is missing a lot of very nice features, chapters especially)
Actions
Visual Editor Carousel Maker NEW
Update Thread
What You Can Do
  • Download as PDF
  • Save to Notion
  • Export as Markdown
  • Visual Editor
  • LinkedIn & Instagram Carousel Maker
Create Free Account

Includes 7-day Premium trial