How To Train Your Own ChatGPT from Scratch (Complete Builder's Guide)
You don't need billions to train the next ChatGPT
All you need is a $100 and Andrej's Karpathy's Nanochat
I used it for the last week here is what I found
Disclaimer: the cost of compute is expected to go down the next decade. Even though my statement is hyperbolic , you can get a usable version in less than $100. This is a build and not a bold statement. I do agree that the capex right now to train these AI models is insanely high, but I'm expecting that one day will come a time where we will be able to train awesome frontier models at really economical prices.
I spent ~$100 and one weekend training a ChatGPT-style model from scratch on my own notes, writing, and exported AI chats.
It now answers in my voice and recalls my own ideas, with no API and no rented brain.
This guide is the version I wish I'd had: every command, every code change, and plain-English explanations of the jargon so you don't get stuck.
If you've never trained a model before, you're the target reader. Take it one step at a time.
Read this first (what you're signing up for)
What you'll end up with: a small GPT, roughly as capable as OpenAI's original GPT-2 (2019), fine-tuned on your own data so it sounds like you and knows your stuff. You can chat with it in a ChatGPT-style web page.
Honest expectations: this is not GPT-4. It's "a kindergartener with your memories", charming, useful for recall and drafting, and confidently wrong sometimes. The magic isn't raw IQ; it's that it's yours, it's private, and you understand every part of it.
What it costs: about $48β$100 in rented GPU time for the full run. You can learn the entire pipeline for ~$0 first (more on that below).
Skills you need:
Time: budget a weekend. The actual training is ~3 hours; the rest is setup and preparing your data.
The 60-second mental model
Training a chatbot happens in two big phases. Keep these straight and everything else makes sense.
π Analogy: Pretraining is sending the model to school to learn language and the world. Fine-tuning is the apprenticeship where it learns a specific job, in our case, being your second brain.
There's an optional third phase, RL (reinforcement learning), that sharpens specific skills like math. We'll skip it for the main build and mention it at the end.
Words you'll keep seeing (quick hits)
You don't need to memorize these, there's a full glossary at the bottom, but here are the big ones up front:
The tool: nanochat
We're using nanochat, an open-source project by Andrej Karpathy (the same person behind nanoGPT and a lot of foundational LLM teaching).
It's ~8,000 lines of clean, readable PyTorch that covers the entire pipeline: tokenizer, pretraining, fine-tuning, RL, inference, and a web chat UI, designed to run start-to-finish on one machine.
Why it's perfect for a first-time builder:
π In plain English: "scaling laws" are well-measured rules of thumb that say, for a model of a given size, how much data and how big a batch you need for best results. nanochat bakes these in so you don't have to know them.
Before you spend a dollar: do a free dry run
I strongly recommend this. Run the whole pipeline once on a tiny model on your own laptop before renting any GPUs. You'll learn the commands and catch mistakes for free.
nanochat includes a script for exactly this. On a Mac (Apple Silicon) or any computer:
git clone https://github.com/karpathy/nanochat.git
cd nanochat
bash runs/runcpu.shWhat this does:
π Why bother: the #1 way beginners waste money is renting an expensive GPU and then fumbling setup for an hour while the meter runs. Do the fumbling for free first.
What you need before the real run
A short checklist:
π What's an "8ΓH100 node"? A single rented computer ("node") that has eight H100 GPUs in it. nanochat is tuned to split the work across all eight, which is why training is only ~3 hours instead of ~24.
Step 1: Rent and connect to a GPU box
This is the part that intimidates newcomers. It's just a few clicks and one command.
1a. Launch the instance.
π What's SSH? A way to securely control another computer from your terminal. You "SSH in" to the rented box and type commands as if you were sitting at it.
1b. Connect from your laptop's terminal:
ssh ubuntu@209.20.xxx.xxx # use your instance's username + IPIf it connects, you're now typing commands on the rented machine. Everything from here happens on the box.
1c. Check the GPUs are there:
nvidia-smiYou should see a table listing 8 GPUs. If you see eight H100s, you're good.
β οΈ The money rule: this box bills every minute it's on (~$24/hour). When you're done, terminate it (not just "stop") or you'll keep paying. More on shutdown in Step 12.
Step 2: Install nanochat
On the box, get the code and install dependencies.
# get the code
git clone https://github.com/karpathy/nanochat.git
cd nanochat
# install 'uv', a fast Python package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# create an isolated Python environment and install everything
uv venv
uv sync --extra gpu
# 'activate' the environment so 'python' uses it
source .venv/bin/activateWhat just happened, in plain English:
π --extra gpu vs --extra cpu: tells the installer to grab the GPU (CUDA) build of PyTorch. On the H100 box you want gpu. On a Mac dry-run you'd use cpu.
Step 3: Know where things live
nanochat writes everything to one folder: ~/.cache/nanochat. Knowing this saves confusion later.
Inside it, over the course of the run, you'll get:
π Tip: d26 is just the folder name nanochat uses for a depth-26 model. If you train a different depth, the folder name changes to match.
Step 4: Download the training data
The base model learns from a big public text dataset called ClimbMix (a 400-billion-token web corpus, cleaned and shuffled). You download it in shards (chunks).
# download 8 shards now (enough to train the tokenizer)
python -m nanochat.dataset -n 8
# download ~240 shards in the background for pretraining
python -m nanochat.dataset -n 240 &Explanation:
π What's a "shard"? Just a numbered file holding a slice of the dataset. Splitting a huge dataset into shards lets you download and stream it piece by piece instead of all at once.
π Disk space: 240 shards β ~24GB, plus checkpoints. An 8ΓH100 box normally has plenty of disk, but if you ever see "no space left," download fewer shards.
Step 5: Train the tokenizer
Before the model can read text, you need a tokenizer to chop text into tokens.
python -m scripts.tok_train # trains it (vocab of 32,768 tokens)
python -m scripts.tok_eval # shows how efficiently it compresses textWhat's happening:
π Why train your own tokenizer? A smaller, custom vocabulary is more efficient for a small model than reusing a giant one built for huge models.
π Special tokens: the tokenizer also reserves a few special markers like <|user_start|> and <|assistant_start|>. These are how the model will later tell who's talking in a conversation. You don't touch these, just know they exist.
Step 6: Pretrain the base model (the expensive step)
This is the big one: ~3 hours, most of your cost. You're creating the base model, the general "internet brain" you'll later personalize.
6a. Start it in a "screen" session first
Because it runs for hours, run it inside screen so it survives if your connection drops.
screen -S train # opens a persistent session named "train"(If your SSH disconnects, reconnect and type screen -r train to reattach. To leave it running and detach: press Ctrl-A then D.)
π Why screen? If you just ran the command normally and your laptop slept or WiFi blipped, the training would die. screen keeps it alive on the server regardless.
6b. Run pretraining
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=26 \
--target-param-data-ratio=8 \
--device-batch-size=16 \
--fp8 \
--run=base_d26Let's decode that command piece by piece, because beginners get tripped up here:
π Single GPU instead of 8? Run python -m scripts.base_train --depth=26 --device-batch-size=16 (no torchrun, no --). nanochat automatically compensates with "gradient accumulation." It produces the same result but takes ~8Γ longer.
π What's "OOM"? "Out of memory", the GPU ran out of room. The fix is almost always lowering --device-batch-size. The math still works out because nanochat keeps the effective batch size constant behind the scenes.
6c. What you'll see while it runs
The screen fills with lines like step 00500/16704 | loss: 2.81 | .... Reading them:
If you set up Weights & Biases, open its web page to watch two key graphs:
6d. When it finishes, grade it
torchrun --standalone --nproc_per_node=8 -m scripts.base_eval -- --device-batch-size=16This prints a final CORE score. If it's above 0.256525, your base brain is GPT-2-grade. π
π What if loss goes up or becomes "NaN"? "NaN" means the math blew up (rare). Usually it's a too-high setting; for a first run, just use the exact command above, which is known-good. If it happens, restart the step.
Step 7: Prepare YOUR data (the part that makes it yours)
This is 90% of the real work. The model is only as "you" as the data you feed it. The base model gives it language; your data gives it you.
You're going to convert your personal stuff into the format nanochat fine-tunes on: conversations, saved as a .jsonl file (one conversation per line).
I used three sources, each teaching a different thing:
π Why all three? Each fixes a different failure. Skip the chats and it doesn't know how you talk; skip the notes and it doesn't know what you know; skip the identity and it has no sense of self.
The exact format nanochat wants
nanochat has a built-in loader called CustomJSON. Its rule: a .jsonl file where every line is one conversation, written as a JSON list of messages that alternate user, assistant, user, assistantβ¦
One line looks like this:
[{"role":"user","content":"What's my thesis on good software?"},{"role":"assistant","content":"My core thesis is that good tools have strong opinions and few settings..."}]π What's JSONL? "JSON Lines", a text file where each line is its own valid JSON object. Easy to generate and append to. The role must start with user and strictly alternate.
Data hygiene (do this: it matters)
Small models memorize at this scale, so clean your data first:
Getting your data onto the box
Build the .jsonl files (next sub-steps), then copy them from your laptop to the rented box with scp:
# run this on your LAPTOP, not the box
scp mybrain_chats.jsonl ubuntu@209.20.xxx.xxx:~/.cache/nanochat/Or run the converter scripts directly on the box if your raw data is there. Either works.
7a. ChatGPT/Claude export β JSONL
Your AI history is already conversations, so this is the easiest source. Export your data from ChatGPT (Settings β Data Controls β Export), you'll get a conversations.json. Then run this script (tools/convert_chat_export.py):
import json, sys
# ChatGPT's export stores each conversation as a tree of message "nodes".
# This flattens each into a clean alternating [user, assistant, ...] list.
def flatten_chatgpt(path):
convos = json.load(open(path))
for convo in convos:
msgs = []
nodes = [n for n in convo["mapping"].values() if n.get("message")]
nodes.sort(key=lambda n: n["message"].get("create_time") or 0) # chronological
for n in nodes:
m = n["message"]
role = m["author"]["role"]
parts = m.get("content", {}).get("parts", [])
text = "".join(p for p in parts if isinstance(p, str)).strip()
if not text or role == "system":
continue
role = "user" if role == "user" else "assistant"
# roles must alternate: merge consecutive same-role turns
if msgs and msgs[-1]["role"] == role:
msgs[-1]["content"] += "\n\n" + text
else:
msgs.append({"role": role, "content": text})
if len(msgs) >= 2 and msgs[0]["role"] == "user": # must start with user
yield msgs
with open("mybrain_chats.jsonl", "w") as out:
for msgs in flatten_chatgpt(sys.argv[1]):
out.write(json.dumps(msgs) + "\n")Run it:
python tools/convert_chat_export.py conversations.jsonThat gave me ~3,800 real conversations in my own question style.
7b. Your writing β question/answer pairs
Raw notes aren't conversations, so turn each note into Q&A pairs using a bigger LLM, the same "self-instruct" trick nanochat uses internally. This script (tools/notes_to_qa.py) calls an API to do it:
import os, glob, json, requests
PROMPT = """Turn my personal note into training data for MY assistant.
Write 3 natural user questions and the answer I would give, in MY voice.
Keep my opinions and phrasing. Output JSON: a list of
[{{"role":"user","content":...}},{{"role":"assistant","content":...}}] pairs.
NOTE:
{note}"""
def llm(note):
r = requests.post(
"https://openrouter.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}"},
json={"model": "google/gemini-3-flash-preview",
"messages": [{"role": "user", "content": PROMPT.format(note=note)}],
"response_format": {"type": "json_object"}},
)
return json.loads(r.json()["choices"][0]["message"]["content"])
with open("mybrain_notes.jsonl", "w") as out:
for path in glob.glob("obsidian/**/*.md", recursive=True): # point at your notes folder
note = open(path, encoding="utf-8").read().strip()
if len(note) < 200: # skip tiny stub notes
continue
for convo in llm(note): # -> list of [user, assistant] pairs
out.write(json.dumps(convo) + "\n")Before running it, set your API key:
export OPENROUTER_API_KEY="your-key-here"
python tools/notes_to_qa.py
π The key idea: I let the LLM invent the questions, but the answers stay grounded in my real text. That's what keeps my voice instead of turning everything into generic AI-speak.
π No OpenRouter? Swap the URL/model for any LLM API you have (OpenAI, Anthropic, a local model). The script just needs something that turns a note into Q&A JSON.
7c. Your identity β persona data
So the model knows it's your second brain, reuse nanochat's identity generator with your own bio. Edit the knowledge file
knowledge/self_knowledge.md:
- You are "AvidBrain", a personal AI trained on Avid's notes and writing.
- You were built by Avid by fine-tuning a nanochat model on personal data.
- You speak in Avid's voice: direct, concrete, slightly informal.
- You know Avid's projects: a newsletter about building with AI, and a few open-source side tools.
- The thesis Avid keeps coming back to: good tools have strong opinions and few settings.
- When unsure, you say so, you do not invent facts about Avid's life.Then generate the conversations:
python -m dev.gen_synthetic_data --num 1500 --output mybrain_identity.jsonlThis produces 1,500 varied little conversations (different topics, personas, and phrasings) all teaching the model who it is.
Step 8: Inject your data into fine-tuning (the actual code edit)
Here's the single most important modification. Open scripts/chat_sft.py and find the train_tasks list. This list defines the mixture of data the model fine-tunes on.
π The trick: the mixture is just a Python list. Listing a dataset multiple times trains on it multiple times ("oversampling"). That's how you make your small personal dataset count against the huge general one.
This is the stock list:
train_tasks = [
SmolTalk(split="train"), # 460K general conversations
CustomJSON(filepath=identity_conversations_filepath), # nanochat's own identity
CustomJSON(filepath=identity_conversations_filepath),
*[MMLU(subset="all", split="auxiliary_train") for _ in range(args.mmlu_epochs)],
*[GSM8K(subset="main", split="train") for _ in range(args.gsm8k_epochs)],
SimpleSpelling(size=200000, split="train"),
SpellingBee(size=80000, split="train"),
]Change it to point at your three files and oversample them (this is my version):
base_dir = get_base_dir()
mybrain_chats = os.path.join(base_dir, "mybrain_chats.jsonl")
mybrain_notes = os.path.join(base_dir, "mybrain_notes.jsonl")
mybrain_identity = os.path.join(base_dir, "mybrain_identity.jsonl")
train_tasks = [
SmolTalk(split="train"), # KEEP this (see below)
*[CustomJSON(filepath=mybrain_chats) for _ in range(3)], # my chats Γ3
*[CustomJSON(filepath=mybrain_notes) for _ in range(4)], # my notes Γ4
*[CustomJSON(filepath=mybrain_identity) for _ in range(2)], # who it is Γ2
*[MMLU(subset="all", split="auxiliary_train") for _ in range(args.mmlu_epochs)],
*[GSM8K(subset="main", split="train") for _ in range(args.gsm8k_epochs)],
SimpleSpelling(size=200000, split="train"),
SpellingBee(size=80000, split="train"),
]Three decisions to understand, because they're the heart of personalization:
π How I knew the ratio was right: two signals. My held-out questions (does it sound like me or a parrot?) and val_bpb (is it still generalizing?). When recall improved but answers got stiff and repetitive, I dialed the multiplier down.
One more thing nanochat handles for you, worth knowing: during fine-tuning it only trains on the assistant's replies, not the user's questions (this is called loss masking). So the model learns to produce your answers, never to predict the prompts. Exactly what you want, and you don't have to do anything to get it.
Step 9: Run the fine-tune
Make sure your three .jsonl files are in the base folder, then fine-tune:
cp mybrain_*.jsonl ~/.cache/nanochat/ # if not already there
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
--device-batch-size=16 \
--run=mybrain_sft
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sftWhat to expect:
Then run your own test: ask it those 50 held-out questions and judge whether the answers actually sound like you. That's the real measure of success.
Step 10 (optional, advanced): bake in facts before fine-tuning
Skip this on your first build. Come back if your model gets your voice right but fuzzes facts (misremembers your projects, dates, specifics).
The fix is a short continued-pretraining pass on your raw notes before fine-tuning, so the facts soak into the weights.
β οΈ Don't try to train the base model from scratch on only your data. A few MB of notes against 400B web tokens just gets memorized. Personal data belongs in fine-tuning (and this light pass), not from-scratch pretraining.
Convert your raw notes into nanochat's data format (tools/notes_to_shards.py):
import os, glob, pyarrow as pa, pyarrow.parquet as pq
docs = [open(p, encoding="utf-8").read() for p in glob.glob("obsidian/**/*.md", recursive=True)]
out_dir = os.path.expanduser("~/.cache/nanochat/base_data_mybrain")
os.makedirs(out_dir, exist_ok=True)
table = pa.Table.from_pydict({"text": docs}) # the column MUST be named "text"
pq.write_table(table, os.path.join(out_dir, "shard_00000.parquet"),
row_group_size=1024, compression="zstd")Then point the data loader at that folder and resume from the base checkpoint at a low learning rate for a few hundred steps. It nudges your vocabulary and facts into the model before fine-tuning polishes behavior.
This is a power-user squeeze; SFT alone already gets you a convincing second brain.
Step 11: Talk to your second brain
The fun payoff. Two ways.
Command line (quickest test):
python -m scripts.chat_cli -i sft -p "Summarize my argument about RAG versus fine-tuning."Web UI (the ChatGPT-style experience):
python -m scripts.chat_webπ Serving runs on one GPU: that's fine; inference is cheap. You don't need all eight to chat.
Your model can even do little calculations: when it needs arithmetic, nanochat lets it call a built-in calculator tool and feeds the answer back, so it doesn't fumble numbers.
Step 12: Save your model, then SHUT DOWN the box
Your trained model lives in ~/.cache/nanochat/chatsft_checkpoints/d26/ on the rented box. If you terminate the box without copying it, it's gone.
Download it to your laptop first (run on your laptop):
scp -r ubuntu@209.20.xxx.xxx:~/.cache/nanochat/chatsft_checkpoints ./my-second-brain
scp -r ubuntu@209.20.xxx.xxx:~/.cache/nanochat/tokenizer ./my-second-brain-tokenizerThen terminate the instance in your provider's dashboard. Terminate, don't just "stop." On many providers a "stopped" instance still charges for storage, and a running one you forgot about is how people get surprise $500 bills. When you're truly done, terminate/destroy it. Double-check the dashboard shows no running instances.
The calls I made, and why (quick reference)
| Decision | Why |
|---|---|
| Train from scratch, not LoRA a 7B model | Ownership, transparency, a small private model, and no built-in "assistant personality" fighting my voice |
| depth-26, lightly undertrained | Smallest clearly GPT-2-grade size; undertraining a bigger model beats overtraining a smaller one; even depth keeps the math clean |
| Debug on a tiny d12/d6 first | A 5-minute run catches broken data before a 3-hour run wastes $40 |
| Keep SmolTalk in the mixture | Without it the model forgets how to hold a normal conversation |
| Oversample my data Γ3β4 | Enough to matter against 460K rows; Γ6 memorized, Γ2 stayed generic |
| Let an LLM write questions, keep my real answers | Preserves my voice instead of laundering it into generic-AI-speak |
| Hold out 50 of my own Q&As | The only test that proves it learned *me*, not just fluent English |Troubleshooting (the stuff that actually goes wrong)
Glossary (plain English)
The whole thing, copy-paste
# === ON THE RENTED 8xH100 BOX ===
# 0. setup
git clone https://github.com/karpathy/nanochat.git && cd nanochat
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv && uv sync --extra gpu && source .venv/bin/activate
# 1. data + tokenizer
python -m nanochat.dataset -n 8
python -m nanochat.dataset -n 240 &
python -m scripts.tok_train && python -m scripts.tok_eval
# 2. base model (~3 hrs): run inside: screen -S train
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
--depth=26 --target-param-data-ratio=8 --device-batch-size=16 --fp8 --run=base_d26
torchrun --standalone --nproc_per_node=8 -m scripts.base_eval -- --device-batch-size=16
# 3. build your data (your scripts) and copy into the base folder
python tools/convert_chat_export.py conversations.json # -> mybrain_chats.jsonl
python tools/notes_to_qa.py # -> mybrain_notes.jsonl
python -m dev.gen_synthetic_data --num 1500 --output mybrain_identity.jsonl
cp mybrain_*.jsonl ~/.cache/nanochat/
# 4. fine-tune on you (after editing scripts/chat_sft.py train_tasks)
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- --device-batch-size=16 --run=mybrain_sft
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft
# 5. talk to it
python -m scripts.chat_web # open http://<YOUR_PUBLIC_IP>:8000/
# === ON YOUR LAPTOP: save the model, then TERMINATE the box ===
# scp -r ubuntu@<IP>:~/.cache/nanochat/chatsft_checkpoints ./my-second-brainBuilt on karpathy/nanochat. The ~8,000 lines that make the model are his. The data that makes it you is the part you write.
Stuck on a step? The nanochat Discussions and its DeepWiki are good places to ask questions about the repo.
This article was written by my own notes and edited by Claude Opus 4.8.
github.com/karpathy/nanocβ¦
This is the GitHub repo.
