✨ Visual Editor

close

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Alex Prompter
@alex_prompter
what are large language models actually doing?

i read the 2025 textbook "Foundations of Large Language Models" by tong xiao and jingbo zhu and for the first time, i truly understood how they work.

here’s everything you need to know about llms in 3 minutes↓
Thread image
Alex Prompter
@alex_prompter
to understand LLMs, you first need to know the idea of pre-training.

instead of teaching a model to solve one task with labeled data (like classifying tweets), we train it on massive unlabeled text and let it "figure out" language patterns by itself.

this is called self-supervised learning.
Alex Prompter
@alex_prompter
there are 3 pre-training strategies:

→ unsupervised: models learn patterns without any labels
→ supervised: models learn from labeled tasks
→ self-supervised: models generate their own labels from unlabeled data (e.g., predicting masked words)

LLMs use the third one. it's the most powerful.
Alex Prompter
@alex_prompter
self-supervised pretraining works like this:

take a sentence like
“the early bird catches the worm.”

mask some words:
“the [MASK] bird catches the [MASK]”

ask the model to fill in the blanks.
no labels needed the text itself is the supervision.
Thread image
Alex Prompter
@alex_prompter
this idea leads to three main architecture types:

→ encoder-only (BERT): reads and understands text
→ decoder-only (GPT): generates the next word
→ encoder-decoder (T5, BART): reads input, generates output

each has strengths. for example:

• GPT is great at generation
• BERT is great at classification
• T5 can do both via a "text-to-text" framework
Alex Prompter
@alex_prompter
let’s dive into each.

decoder-only (GPT-style):

trained to predict the next token given previous ones:

“the cat sat on the [MASK]” → “mat”

this is called causal language modeling.

loss is calculated using cross-entropy over predicted vs. actual next words.
Thread image
Alex Prompter
@alex_prompter
encoder-only (BERT-style):

looks at the full sentence at once, masks random tokens, and tries to reconstruct them.

this is masked language modeling, and it allows bidirectional context the model uses both left and right of the [MASK].

better for tasks like sentence classification.
Alex Prompter
@alex_prompter
example:

original: "the early bird catches the worm"
masked: "the [MASK] bird catches the [MASK]"

the model learns to predict "early" and "worm" using all surrounding context.

it's like learning to read by guessing missing words in a book.
Alex Prompter
@alex_prompter
encoder-decoder (T5-style):

treats every NLP task as text-to-text.

examples:

“translate English to German: hello” → “hallo”
“classify sentiment: i hate this” → “negative”

the same model can be used for translation, summarization, QA, etc.
Thread image
Alex Prompter
@alex_prompter
so, what happens after pre-training?

we can either:

→ fine-tune: train the model further on labeled task-specific data
→ prompt: write smart inputs to steer the model

prompting is the reason LLMs feel magical. let's talk about that next.
Alex Prompter
@alex_prompter
prompting is just carefully crafting the input to make the model behave how you want.

example:

"i love this movie. sentiment:"

the model will likely complete: “positive”.

add a few examples before the input, and the model learns the pattern.
this is in-context learning.
Alex Prompter
@alex_prompter
prompting can go deep.

strategies include:

• chain of thought: "let's think step by step…"
• problem decomposition: break big problems into smaller ones
• self-refinement: ask the model to critique and revise its own output
• RAG: let the model look things up externally
Thread image
Alex Prompter
@alex_prompter
this is all possible because of the way these models are trained: predict the next word over and over until they internalize language structure, reasoning patterns, and world knowledge.

it's not magic. it's scale.
Alex Prompter
@alex_prompter
but LLMs aren’t trained just to be smart — they need to be aligned with human values.

how?

→ supervised fine-tuning (SFT): teach the model on human-written responses
→ RLHF (reinforcement learning from human feedback): train a reward model to prefer good outputs

this is how ChatGPT was aligned.
Alex Prompter
@alex_prompter
alignment is hard.

newer methods like Direct Preference Optimization (DPO) avoid the instability of RL and are becoming more popular.

the goal is the same: steer models toward helpful, harmless, honest responses.
Alex Prompter
@alex_prompter
finally, inference matters.

how do you actually run these massive models efficiently?

→ use smart decoding (top-k, nucleus sampling)
→ cache previous results
→ batch multiple requests
→ scale context with better memory and position interpolation

this is how you get fast, low-latency responses.
Alex Prompter
@alex_prompter
in short:

LLMs work because they:

1. learn from massive text via self-supervision
2. use Transformers to model token sequences
3. can be prompted/fine-tuned for any task
4. are aligned with human preferences
5. are optimized for fast inference

they're general-purpose text reasoning machines.
Alex Prompter
@alex_prompter
this was based on the brilliant textbook:

"Foundations of Large Language Models" by Tong Xiao and Jingbo Zhu (NiuTrans Research Lab)

arxiv: arxiv.org/abs/2501.09223…

highly recommend it if you're serious about understanding LLMs deeply.
Alex Prompter
@alex_prompter
The AI prompt library your competitors don't want you to find

→ Unlimited prompts: $15/month
→ Starter pack: $3.99/month
→ Pro bundle: $9.99/month

Grab it before it's gone ↓
godofprompt.ai/pricing
Alex Prompter
@alex_prompter
I hope you've found this thread helpful.

Follow me @alex_prompter for more.

Like/Repost the quote below if you can:
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press + S to quick-export