How To Train Your Own ChatGPT from Scratch (Complete Builder's Guide)

You don't need billions to train the next ChatGPT

All you need is a $100 and Andrej's Karpathy's Nanochat

I used it for the last week here is what I found

Disclaimer: the cost of compute is expected to go down the next decade. Even though my statement is hyperbolic , you can get a usable version in less than $100. This is a build and not a bold statement. I do agree that the capex right now to train these AI models is insanely high, but I'm expecting that one day will come a time where we will be able to train awesome frontier models at really economical prices.

I spent ~$100 and one weekend training a ChatGPT-style model from scratch on my own notes, writing, and exported AI chats.

It now answers in my voice and recalls my own ideas, with no API and no rented brain.

This guide is the version I wish I'd had: every command, every code change, and plain-English explanations of the jargon so you don't get stuck.

If you've never trained a model before, you're the target reader. Take it one step at a time.

Read this first (what you're signing up for)

What you'll end up with: a small GPT, roughly as capable as OpenAI's original GPT-2 (2019), fine-tuned on your own data so it sounds like you and knows your stuff. You can chat with it in a ChatGPT-style web page.

Honest expectations: this is not GPT-4. It's "a kindergartener with your memories", charming, useful for recall and drafting, and confidently wrong sometimes. The magic isn't raw IQ; it's that it's yours, it's private, and you understand every part of it.

What it costs: about $48–$100 in rented GPU time for the full run. You can learn the entire pipeline for ~$0 first (more on that below).

Skills you need:

Comfort typing commands into a terminal (copy-paste is fine).

Basic Python literacy helps for the data step, but I'll give you working scripts.

No machine-learning background required. I'll explain the concepts as we go.

Time: budget a weekend. The actual training is ~3 hours; the rest is setup and preparing your data.

The 60-second mental model

Training a chatbot happens in two big phases. Keep these straight and everything else makes sense.

Pretraining → produces the base model. The model reads a huge pile of internet text and learns one skill: predict the next word. This is where it learns grammar, facts, and reasoning. It's expensive (this is the ~3 hours of GPU time). The result talks like the internet, it can complete text but can't chat.

Fine-tuning (SFT) → produces the chat model. You show the base model thousands of example conversations so it learns to answer like an assistant. This is cheap and fast (minutes). This is where your personal data goes in.

🔑 Analogy: Pretraining is sending the model to school to learn language and the world. Fine-tuning is the apprenticeship where it learns a specific job, in our case, being your second brain.

There's an optional third phase, RL (reinforcement learning), that sharpens specific skills like math. We'll skip it for the main build and mention it at the end.

Words you'll keep seeing (quick hits)

You don't need to memorize these, there's a full glossary at the bottom, but here are the big ones up front:

GPU: the specialized chip that does the math. Training needs powerful ones (H100s). You rent them by the hour.

Token: a chunk of text (a word or word-piece). Models read and write tokens, not letters.

Tokenizer: the tool that chops text into tokens. You train one first.

Parameters: the model's "knobs" (numbers it learns). More parameters = bigger, smarter, slower model.

Depth: nanochat's single dial for model size: the number of layers. Bigger depth = bigger model.

Batch / batch size: how many examples the model looks at before updating itself.

Checkpoint: a saved copy of the model's weights on disk.

Loss: a number measuring how wrong the model is. Training drives it down.

The tool: nanochat

We're using nanochat, an open-source project by Andrej Karpathy (the same person behind nanoGPT and a lot of foundational LLM teaching).

It's ~8,000 lines of clean, readable PyTorch that covers the entire pipeline: tokenizer, pretraining, fine-tuning, RL, inference, and a web chat UI, designed to run start-to-finish on one machine.

Why it's perfect for a first-time builder:

One dial. You set --depth (how many layers) and it automatically computes every other setting, width, learning rate, batch size, training length, using scaling laws. You don't tune dozens of knobs.

Readable. No giant configuration system or framework magic. Each step is a script you can open and understand.

Cheap. GPT-2-grade capability for ~$100 instead of the ~$43,000 it cost in 2019.

🔑 In plain English: "scaling laws" are well-measured rules of thumb that say, for a model of a given size, how much data and how big a batch you need for best results. nanochat bakes these in so you don't have to know them.

Before you spend a dollar: do a free dry run

I strongly recommend this. Run the whole pipeline once on a tiny model on your own laptop before renting any GPUs. You'll learn the commands and catch mistakes for free.

nanochat includes a script for exactly this. On a Mac (Apple Silicon) or any computer:

git clone https://github.com/karpathy/nanochat.git
cd nanochat
bash runs/runcpu.sh

What this does:

Installs everything, downloads a little data, trains a tokenizer, then trains a depth-6 toy model (tiny) and fine-tunes it.

Takes ~30–60 minutes on a laptop. The model will be dumb: that's expected and fine.

The point is to see every stage run, end to end, and get comfortable.

🔑 Why bother: the #1 way beginners waste money is renting an expensive GPU and then fumbling setup for an hour while the meter runs. Do the fumbling for free first.

What you need before the real run

A short checklist:

A GPU cloud account. I used Lambda; RunPod, Vast.ai, Paperspace, AWS, etc. also work. You want an 8×H100 node (a single machine with eight H100 GPUs).

~$100 of credit on that account.

Your data, gathered: exported AI chats, notes, anything you want it to learn (we'll prep this in Step 7).

(Optional)A Weights & Biases account (free) to watch nice training graphs.

(Optional)An OpenRouter API key + a few dollars of credit, used to generate synthetic "identity" data. You can substitute any LLM API.

🔑 What's an "8×H100 node"? A single rented computer ("node") that has eight H100 GPUs in it. nanochat is tuned to split the work across all eight, which is why training is only ~3 hours instead of ~24.

Step 1: Rent and connect to a GPU box

This is the part that intimidates newcomers. It's just a few clicks and one command.

1a. Launch the instance.

Log into your GPU provider and start a new instance. Choose the 8×H100 option (sometimes labeled "8x H100 SXM" or "H100 80GB ×8").

Pick a region close to you. Add your SSH key when prompted (the provider walks you through this, it's how your laptop proves it's you).

Launch it. You'll get a public IP address like 209.20.xxx.xxx.

🔑 What's SSH? A way to securely control another computer from your terminal. You "SSH in" to the rented box and type commands as if you were sitting at it.

1b. Connect from your laptop's terminal:

ssh ubuntu@209.20.xxx.xxx      # use your instance's username + IP

If it connects, you're now typing commands on the rented machine. Everything from here happens on the box.

1c. Check the GPUs are there:

nvidia-smi

You should see a table listing 8 GPUs. If you see eight H100s, you're good.

⚠️ The money rule: this box bills every minute it's on (~$24/hour). When you're done, terminate it (not just "stop") or you'll keep paying. More on shutdown in Step 12.

Step 2: Install nanochat

On the box, get the code and install dependencies.

# get the code
git clone https://github.com/karpathy/nanochat.git
cd nanochat

# install 'uv', a fast Python package manager
curl -LsSf https://astral.sh/uv/install.sh | sh

# create an isolated Python environment and install everything
uv venv
uv sync --extra gpu

# 'activate' the environment so 'python' uses it
source .venv/bin/activate

What just happened, in plain English:

git clone downloads the nanochat code.

uv is a tool that installs the exact Python libraries nanochat needs (PyTorch, etc.).

uv venv makes a virtual environment, a sandboxed Python just for this project, so it doesn't collide with anything else.

source .venv/bin/activate switches your terminal into that sandbox. You'll re-run this line every time you open a new terminal on the box.

🔑 --extra gpu vs --extra cpu: tells the installer to grab the GPU (CUDA) build of PyTorch. On the H100 box you want gpu. On a Mac dry-run you'd use cpu.

Step 3: Know where things live

nanochat writes everything to one folder: ~/.cache/nanochat. Knowing this saves confusion later.

Inside it, over the course of the run, you'll get:

base_data_climbmix/, the downloaded training text (parquet files).

tokenizer/, your trained tokenizer.

base_checkpoints/d26/, saved base-model weights.

chatsft_checkpoints/d26/, saved chat-model weights (the final thing you talk to).

report/, an auto-generated report.md summarizing the whole run.

🔑 Tip: d26 is just the folder name nanochat uses for a depth-26 model. If you train a different depth, the folder name changes to match.

Step 4: Download the training data

The base model learns from a big public text dataset called ClimbMix (a 400-billion-token web corpus, cleaned and shuffled). You download it in shards (chunks).

# download 8 shards now (enough to train the tokenizer)
python -m nanochat.dataset -n 8

# download ~240 shards in the background for pretraining
python -m nanochat.dataset -n 240 &

Explanation:

-n 8 = download 8 shards. Each shard is ~100MB, ~250 million characters.

The & at the end runs that second command in the background so you can keep working while ~24GB downloads.

~150 shards is enough for GPT-2-grade capability; 240 gives comfortable headroom.

nanochat automatically reserves the last shard as a validation set (data it never trains on, used to measure honest progress).

🔑 What's a "shard"? Just a numbered file holding a slice of the dataset. Splitting a huge dataset into shards lets you download and stream it piece by piece instead of all at once.

🔑 Disk space: 240 shards ≈ ~24GB, plus checkpoints. An 8×H100 box normally has plenty of disk, but if you ever see "no space left," download fewer shards.

Step 5: Train the tokenizer

Before the model can read text, you need a tokenizer to chop text into tokens.

python -m scripts.tok_train      # trains it (vocab of 32,768 tokens)
python -m scripts.tok_eval       # shows how efficiently it compresses text

What's happening:

The tokenizer learns the 32,768 most useful text chunks ("tokens") from a couple billion characters of the data you just downloaded. Common words become single tokens; rare words get split into pieces.

This takes a few minutes. It saves to ~/.cache/nanochat/tokenizer/.

tok_eval just reports a "compression ratio" (how many characters per token), higher is more efficient. You don't need to act on it.

🔑 Why train your own tokenizer? A smaller, custom vocabulary is more efficient for a small model than reusing a giant one built for huge models.

🔑 Special tokens: the tokenizer also reserves a few special markers like <|user_start|> and <|assistant_start|>. These are how the model will later tell who's talking in a conversation. You don't touch these, just know they exist.

Step 6: Pretrain the base model (the expensive step)

This is the big one: ~3 hours, most of your cost. You're creating the base model, the general "internet brain" you'll later personalize.

6a. Start it in a "screen" session first

Because it runs for hours, run it inside screen so it survives if your connection drops.

screen -S train      # opens a persistent session named "train"

(If your SSH disconnects, reconnect and type screen -r train to reattach. To leave it running and detach: press Ctrl-A then D.)

🔑 Why screen? If you just ran the command normally and your laptop slept or WiFi blipped, the training would die. screen keeps it alive on the server regardless.

6b. Run pretraining

torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=26 \
    --target-param-data-ratio=8 \
    --device-batch-size=16 \
    --fp8 \
    --run=base_d26

Let's decode that command piece by piece, because beginners get tripped up here:

torchrun: launches the training across multiple GPUs at once.

--nproc_per_node=8: use 8 GPUs (one process each). On a 1-GPU machine you'd set this to 1, or skip torchrunentirely (see below).

--standalone: this is all on one machine (not a cluster).

-m scripts.base_train: run nanochat's pretraining script.

--: a separator. Everything before it configures torchrun; everything after it configures the training script. Forgetting this -- is a common mistake.

--depth=26: the one dial: a 26-layer model (GPT-2-grade). Everything else (width, learning rate, batch size, training length) is auto-derived from this.

--target-param-data-ratio=8: how long to train. 8 means "train on 8 tokens per parameter," which lightly undertrainsa d26 to land right at GPT-2 capability efficiently.

--device-batch-size=16: how many sequences each GPU processes at once. If you run out of GPU memory (an "OOM" error), lower this to 8, then 4, etc. (powers of two).

--fp8: use ultra-fast 8-bit math (H100s support it). If your GPU doesn't, just remove this flag; it'll use slightly slower 16-bit math and still work.

--run=base_d26: a name for this run (used by Weights & Biases if you set it up). Leave it; or to enable graphs, first run wandb login.

🔑 Single GPU instead of 8? Run python -m scripts.base_train --depth=26 --device-batch-size=16 (no torchrun, no --). nanochat automatically compensates with "gradient accumulation." It produces the same result but takes ~8× longer.

🔑 What's "OOM"? "Out of memory", the GPU ran out of room. The fix is almost always lowering --device-batch-size. The math still works out because nanochat keeps the effective batch size constant behind the scenes.

6c. What you'll see while it runs

The screen fills with lines like step 00500/16704 | loss: 2.81 | .... Reading them:

step X/Y: progress (current step out of total).

loss: how wrong the model is. It should trend down over time (starts ~10, falls toward ~3). Down = learning.

ETA / time: how long until it finishes.

If you set up Weights & Biases, open its web page to watch two key graphs:

val_bpb: validation loss in "bits per byte." Lower = better. The clean progress signal.

core_metric: the CORE score, an industry benchmark. Your goal: beat 0.256525, which is GPT-2's score. Cross that line and you've matched a model that cost $43K in 2019.

6d. When it finishes, grade it

torchrun --standalone --nproc_per_node=8 -m scripts.base_eval -- --device-batch-size=16

This prints a final CORE score. If it's above 0.256525, your base brain is GPT-2-grade. 🎉

🔑 What if loss goes up or becomes "NaN"? "NaN" means the math blew up (rare). Usually it's a too-high setting; for a first run, just use the exact command above, which is known-good. If it happens, restart the step.

Step 7: Prepare YOUR data (the part that makes it yours)

This is 90% of the real work. The model is only as "you" as the data you feed it. The base model gives it language; your data gives it you.

You're going to convert your personal stuff into the format nanochat fine-tunes on: conversations, saved as a .jsonl file (one conversation per line).

I used three sources, each teaching a different thing:

Exported AI chat history (ChatGPT + Claude), teaches the questions you ask and the answer style you like. Already in conversation form, so it's the easiest.

Your writing (notes, blog posts, journals), teaches your knowledge and voice.

Your bio/identity facts: teaches it who it is and whose brain it carries.

🔑 Why all three? Each fixes a different failure. Skip the chats and it doesn't know how you talk; skip the notes and it doesn't know what you know; skip the identity and it has no sense of self.

The exact format nanochat wants

nanochat has a built-in loader called CustomJSON. Its rule: a .jsonl file where every line is one conversation, written as a JSON list of messages that alternate user, assistant, user, assistant…

One line looks like this:

[{"role":"user","content":"What's my thesis on good software?"},{"role":"assistant","content":"My core thesis is that good tools have strong opinions and few settings..."}]

🔑 What's JSONL? "JSON Lines", a text file where each line is its own valid JSON object. Easy to generate and append to. The role must start with user and strictly alternate.

Data hygiene (do this: it matters)

Small models memorize at this scale, so clean your data first:

Deduplicate near-identical notes (they cause overfitting).

Scrub secrets and others' private info: passwords, API keys, anything you wouldn't want the model to blurt out. It will memorize them.

Hold out ~50 of your own Q&As as a private test set you don't train on, so you can later check it actually learned you.

Getting your data onto the box

Build the .jsonl files (next sub-steps), then copy them from your laptop to the rented box with scp:

# run this on your LAPTOP, not the box
scp mybrain_chats.jsonl ubuntu@209.20.xxx.xxx:~/.cache/nanochat/

Or run the converter scripts directly on the box if your raw data is there. Either works.

7a. ChatGPT/Claude export → JSONL

Your AI history is already conversations, so this is the easiest source. Export your data from ChatGPT (Settings → Data Controls → Export), you'll get a conversations.json. Then run this script (tools/convert_chat_export.py):

import json, sys

# ChatGPT's export stores each conversation as a tree of message "nodes".
# This flattens each into a clean alternating [user, assistant, ...] list.
def flatten_chatgpt(path):
    convos = json.load(open(path))
    for convo in convos:
        msgs = []
        nodes = [n for n in convo["mapping"].values() if n.get("message")]
        nodes.sort(key=lambda n: n["message"].get("create_time") or 0)  # chronological
        for n in nodes:
            m = n["message"]
            role = m["author"]["role"]
            parts = m.get("content", {}).get("parts", [])
            text = "".join(p for p in parts if isinstance(p, str)).strip()
            if not text or role == "system":
                continue
            role = "user" if role == "user" else "assistant"
            # roles must alternate: merge consecutive same-role turns
            if msgs and msgs[-1]["role"] == role:
                msgs[-1]["content"] += "\n\n" + text
            else:
                msgs.append({"role": role, "content": text})
        if len(msgs) >= 2 and msgs[0]["role"] == "user":   # must start with user
            yield msgs

with open("mybrain_chats.jsonl", "w") as out:
    for msgs in flatten_chatgpt(sys.argv[1]):
        out.write(json.dumps(msgs) + "\n")

Run it:

python tools/convert_chat_export.py conversations.json

That gave me ~3,800 real conversations in my own question style.

7b. Your writing → question/answer pairs

Raw notes aren't conversations, so turn each note into Q&A pairs using a bigger LLM, the same "self-instruct" trick nanochat uses internally. This script (tools/notes_to_qa.py) calls an API to do it:

import os, glob, json, requests

PROMPT = """Turn my personal note into training data for MY assistant.
Write 3 natural user questions and the answer I would give, in MY voice.
Keep my opinions and phrasing. Output JSON: a list of
[{{"role":"user","content":...}},{{"role":"assistant","content":...}}] pairs.

NOTE:
{note}"""

def llm(note):
    r = requests.post(
        "https://openrouter.ai/api/v1/chat/completions",
        headers={"Authorization": f"Bearer {os.environ['OPENROUTER_API_KEY']}"},
        json={"model": "google/gemini-3-flash-preview",
              "messages": [{"role": "user", "content": PROMPT.format(note=note)}],
              "response_format": {"type": "json_object"}},
    )
    return json.loads(r.json()["choices"][0]["message"]["content"])

with open("mybrain_notes.jsonl", "w") as out:
    for path in glob.glob("obsidian/**/*.md", recursive=True):   # point at your notes folder
        note = open(path, encoding="utf-8").read().strip()
        if len(note) < 200:        # skip tiny stub notes
            continue
        for convo in llm(note):    # -> list of [user, assistant] pairs
            out.write(json.dumps(convo) + "\n")

Before running it, set your API key:

export OPENROUTER_API_KEY="your-key-here"
python tools/notes_to_qa.py

🔑 The key idea: I let the LLM invent the questions, but the answers stay grounded in my real text. That's what keeps my voice instead of turning everything into generic AI-speak.

🔑 No OpenRouter? Swap the URL/model for any LLM API you have (OpenAI, Anthropic, a local model). The script just needs something that turns a note into Q&A JSON.

7c. Your identity → persona data

So the model knows it's your second brain, reuse nanochat's identity generator with your own bio. Edit the knowledge file

knowledge/self_knowledge.md:
- You are "AvidBrain", a personal AI trained on Avid's notes and writing.
- You were built by Avid by fine-tuning a nanochat model on personal data.
- You speak in Avid's voice: direct, concrete, slightly informal.
- You know Avid's projects: a newsletter about building with AI, and a few open-source side tools.
- The thesis Avid keeps coming back to: good tools have strong opinions and few settings.
- When unsure, you say so, you do not invent facts about Avid's life.

Then generate the conversations:

python -m dev.gen_synthetic_data --num 1500 --output mybrain_identity.jsonl

This produces 1,500 varied little conversations (different topics, personas, and phrasings) all teaching the model who it is.

Step 8: Inject your data into fine-tuning (the actual code edit)

Here's the single most important modification. Open scripts/chat_sft.py and find the train_tasks list. This list defines the mixture of data the model fine-tunes on.

🔑 The trick: the mixture is just a Python list. Listing a dataset multiple times trains on it multiple times ("oversampling"). That's how you make your small personal dataset count against the huge general one.

This is the stock list:

train_tasks = [
    SmolTalk(split="train"),                                  # 460K general conversations
    CustomJSON(filepath=identity_conversations_filepath),     # nanochat's own identity
    CustomJSON(filepath=identity_conversations_filepath),
    *[MMLU(subset="all", split="auxiliary_train") for _ in range(args.mmlu_epochs)],
    *[GSM8K(subset="main", split="train") for _ in range(args.gsm8k_epochs)],
    SimpleSpelling(size=200000, split="train"),
    SpellingBee(size=80000, split="train"),
]

Change it to point at your three files and oversample them (this is my version):

base_dir = get_base_dir()
mybrain_chats    = os.path.join(base_dir, "mybrain_chats.jsonl")
mybrain_notes    = os.path.join(base_dir, "mybrain_notes.jsonl")
mybrain_identity = os.path.join(base_dir, "mybrain_identity.jsonl")

train_tasks = [
    SmolTalk(split="train"),                                  # KEEP this (see below)
    *[CustomJSON(filepath=mybrain_chats)    for _ in range(3)],   # my chats   ×3
    *[CustomJSON(filepath=mybrain_notes)    for _ in range(4)],   # my notes   ×4
    *[CustomJSON(filepath=mybrain_identity) for _ in range(2)],   # who it is  ×2
    *[MMLU(subset="all", split="auxiliary_train") for _ in range(args.mmlu_epochs)],
    *[GSM8K(subset="main", split="train") for _ in range(args.gsm8k_epochs)],
    SimpleSpelling(size=200000, split="train"),
    SpellingBee(size=80000, split="train"),
]

Three decisions to understand, because they're the heart of personalization:

Keep SmolTalk. It's 460K general conversations that keep the model able to chat normally. Remove it and the model forgets how to hold a conversation: a real phenomenon called catastrophic forgetting. Your data should season the mix, not replace it.

Oversample your data (the ×3, ×4). A few thousand of your conversations would otherwise be a rounding error next to 460K. Repeating them raises their effective weight so the model actually picks up your voice.

Don't overdo it (I capped around ×4). Push the multiplier too high and the model memorizes: it recites your notes word-for-word instead of learning your style (and it can leak private text). I found ×2 too faint, ×6 memorized, ×3–×4 was the sweet spot.

🔑 How I knew the ratio was right: two signals. My held-out questions (does it sound like me or a parrot?) and val_bpb (is it still generalizing?). When recall improved but answers got stiff and repetitive, I dialed the multiplier down.

One more thing nanochat handles for you, worth knowing: during fine-tuning it only trains on the assistant's replies, not the user's questions (this is called loss masking). So the model learns to produce your answers, never to predict the prompts. Exactly what you want, and you don't have to do anything to get it.

Step 9: Run the fine-tune

Make sure your three .jsonl files are in the base folder, then fine-tune:

cp mybrain_*.jsonl ~/.cache/nanochat/      # if not already there

torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- \
    --device-batch-size=16 \
    --run=mybrain_sft
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft

What to expect:

Fine-tuning is fast: tens of minutes, not hours. It starts from your base model and gently adjusts it.

chat_eval -i sft grades the chat model on standard tests (it prints accuracy numbers). Useful as a sanity check.

The final chat model is saved to ~/.cache/nanochat/chatsft_checkpoints/d26/.

Then run your own test: ask it those 50 held-out questions and judge whether the answers actually sound like you. That's the real measure of success.

Step 10 (optional, advanced): bake in facts before fine-tuning

Skip this on your first build. Come back if your model gets your voice right but fuzzes facts (misremembers your projects, dates, specifics).

The fix is a short continued-pretraining pass on your raw notes before fine-tuning, so the facts soak into the weights.

⚠️ Don't try to train the base model from scratch on only your data. A few MB of notes against 400B web tokens just gets memorized. Personal data belongs in fine-tuning (and this light pass), not from-scratch pretraining.

Convert your raw notes into nanochat's data format (tools/notes_to_shards.py):

import os, glob, pyarrow as pa, pyarrow.parquet as pq

docs = [open(p, encoding="utf-8").read() for p in glob.glob("obsidian/**/*.md", recursive=True)]
out_dir = os.path.expanduser("~/.cache/nanochat/base_data_mybrain")
os.makedirs(out_dir, exist_ok=True)
table = pa.Table.from_pydict({"text": docs})          # the column MUST be named "text"
pq.write_table(table, os.path.join(out_dir, "shard_00000.parquet"),
               row_group_size=1024, compression="zstd")

Then point the data loader at that folder and resume from the base checkpoint at a low learning rate for a few hundred steps. It nudges your vocabulary and facts into the model before fine-tuning polishes behavior.

This is a power-user squeeze; SFT alone already gets you a convincing second brain.

Step 11: Talk to your second brain

The fun payoff. Two ways.

Command line (quickest test):

python -m scripts.chat_cli -i sft -p "Summarize my argument about RAG versus fine-tuning."

-i sft = use the fine-tuned chat model. -p "..." = a one-shot prompt (leave it off to chat interactively).

Web UI (the ChatGPT-style experience):

python -m scripts.chat_web

It starts a web server and prints a URL. On a cloud box, open http://<YOUR_PUBLIC_IP>:8000/ in your laptop's browser (use the box's public IP and the port shown).

If the page won't load, your provider may be blocking the port, open/allow port 8000 in the instance's firewall/security settings.

🔑 Serving runs on one GPU: that's fine; inference is cheap. You don't need all eight to chat.

Your model can even do little calculations: when it needs arithmetic, nanochat lets it call a built-in calculator tool and feeds the answer back, so it doesn't fumble numbers.

Step 12: Save your model, then SHUT DOWN the box

Your trained model lives in ~/.cache/nanochat/chatsft_checkpoints/d26/ on the rented box. If you terminate the box without copying it, it's gone.

Download it to your laptop first (run on your laptop):

scp -r ubuntu@209.20.xxx.xxx:~/.cache/nanochat/chatsft_checkpoints ./my-second-brain
scp -r ubuntu@209.20.xxx.xxx:~/.cache/nanochat/tokenizer ./my-second-brain-tokenizer

Then terminate the instance in your provider's dashboard. Terminate, don't just "stop." On many providers a "stopped" instance still charges for storage, and a running one you forgot about is how people get surprise $500 bills. When you're truly done, terminate/destroy it. Double-check the dashboard shows no running instances.

The calls I made, and why (quick reference)

| Decision | Why |
|---|---|
| Train from scratch, not LoRA a 7B model | Ownership, transparency, a small private model, and no built-in "assistant personality" fighting my voice |
| depth-26, lightly undertrained | Smallest clearly GPT-2-grade size; undertraining a bigger model beats overtraining a smaller one; even depth keeps the math clean |
| Debug on a tiny d12/d6 first | A 5-minute run catches broken data before a 3-hour run wastes $40 |
| Keep SmolTalk in the mixture | Without it the model forgets how to hold a normal conversation |
| Oversample my data ×3–4 | Enough to matter against 460K rows; ×6 memorized, ×2 stayed generic |
| Let an LLM write questions, keep my real answers | Preserves my voice instead of laundering it into generic-AI-speak |
| Hold out 50 of my own Q&As | The only test that proves it learned *me*, not just fluent English |

Troubleshooting (the stuff that actually goes wrong)

"CUDA out of memory" / OOM → lower --device-batch-size (16 → 8 → 4). Effective batch size stays the same; it just runs a touch slower.

torchrun: command not found → you forgot source .venv/bin/activate in this terminal.

The training died when my laptop slept → you didn't use screen. Restart inside screen -S train and detach with Ctrl-A then D.

My flags are being ignored → you probably dropped the -- separator in the torchrun ... -- --depth=... command.

No dataset parquet files found → the data download (python -m nanochat.dataset -n ...) didn't finish. Re-run it; it skips files already downloaded.

The web UI won't load → open port 8000 in your instance's firewall, and use the public IP, not localhost.

Loss became NaN → math blew up; just restart the step with the known-good command above.

It sounds generic, not like me → increase your oversample multiplier (×4 → ×5) and add more personal data. Data quantity/quality is the ceiling.

It recites my notes word-for-word → you oversampled too hard; lower the multiplier and add more variety.

Glossary (plain English)

Base model: the model after pretraining; good at completing text, can't chat yet.

Batch size: how many examples the model processes before each update. Bigger = smoother but more memory.

bpb (bits per byte): a loss measurement that's fair to compare across tokenizers. Lower = better.

Catastrophic forgetting: when fine-tuning too hard on new data makes the model forget old skills (e.g., normal conversation).

Checkpoint: a saved snapshot of the model's weights on disk.

CORE score: a standard benchmark (ensemble of 22 tests). Beating 0.256525 = GPT-2-grade.

DDP / distributed: splitting training across multiple GPUs at once.

Depth: number of transformer layers; nanochat's single size dial.

Fine-tuning / SFT: training the base model on example conversations so it becomes a helpful chat assistant.

FP8 / bf16: number formats. Fewer bits = faster, slightly less precise. FP8 needs an H100-class GPU.

Gradient accumulation: a trick to simulate a big batch on limited memory by adding up several small batches before updating.

Loss: how wrong the model is right now; training drives it down.

Loss masking: training only on certain tokens (here, the assistant's replies).

OOM: "out of memory"; lower the batch size.

Oversampling: repeating a dataset in the mix so it counts for more.

Parameters: the model's learnable numbers; more = bigger model.

Pretraining: the long first phase where the model learns language from web text.

Shard: one numbered file holding a slice of the dataset.

Token / tokenizer: text chunks the model reads/writes, and the tool that makes them.

torchrun: the launcher that runs training across multiple GPUs.

uv / venv: a fast package installer, and the isolated Python environment it sets up.

wandb (Weights & Biases): a website for watching live training graphs.

The whole thing, copy-paste

# === ON THE RENTED 8xH100 BOX ===

# 0. setup
git clone https://github.com/karpathy/nanochat.git && cd nanochat
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv && uv sync --extra gpu && source .venv/bin/activate

# 1. data + tokenizer
python -m nanochat.dataset -n 8
python -m nanochat.dataset -n 240 &
python -m scripts.tok_train && python -m scripts.tok_eval

# 2. base model (~3 hrs): run inside: screen -S train
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \
    --depth=26 --target-param-data-ratio=8 --device-batch-size=16 --fp8 --run=base_d26
torchrun --standalone --nproc_per_node=8 -m scripts.base_eval -- --device-batch-size=16

# 3. build your data (your scripts) and copy into the base folder
python tools/convert_chat_export.py conversations.json   # -> mybrain_chats.jsonl
python tools/notes_to_qa.py                              # -> mybrain_notes.jsonl
python -m dev.gen_synthetic_data --num 1500 --output mybrain_identity.jsonl
cp mybrain_*.jsonl ~/.cache/nanochat/

# 4. fine-tune on you (after editing scripts/chat_sft.py train_tasks)
torchrun --standalone --nproc_per_node=8 -m scripts.chat_sft -- --device-batch-size=16 --run=mybrain_sft
torchrun --standalone --nproc_per_node=8 -m scripts.chat_eval -- -i sft

# 5. talk to it
python -m scripts.chat_web        # open http://<YOUR_PUBLIC_IP>:8000/

# === ON YOUR LAPTOP: save the model, then TERMINATE the box ===
# scp -r ubuntu@<IP>:~/.cache/nanochat/chatsft_checkpoints ./my-second-brain

Built on karpathy/nanochat. The ~8,000 lines that make the model are his. The data that makes it you is the part you write.

Stuck on a step? The nanochat Discussions and its DeepWiki are good places to ask questions about the repo.

This article was written by my own notes and edited by Claude Opus 4.8.

github.com/karpathy/nanoc…

This is the GitHub repo.