Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

✨ Visual Editor

close

Thread Truncated

Only the first 20 tweets are shown to ensure high-quality rendering and prevent image size issues.

palette Canvas & Background

Gradient:arrow_forward
Text Color:
135°

style Card Style

40px
16px

text_fields Typography

16px
Avi Chawla
@_avichawla
Every generate() call to an LLM runs two distinct computational phases on the same GPU:
Thread image
Avi Chawla
@_avichawla
• prefill (processing the prompt) is compute-bound
Avi Chawla
@_avichawla
• while decode (generating tokens one at a time) is memory-bound.
Avi Chawla
@_avichawla
Most inference optimizations target one phase or the other, and diagnosing which phase is the bottleneck is the first step in making a deployment faster.
Avi Chawla
@_avichawla
In this article, I'll walk through the full pipeline, from tokenized input to streamed output, and look at where the time goes in each phase.
Avi Chawla
@_avichawla
---
Avi Chawla
@_avichawla
# Tokenization and embedding
Avi Chawla
@_avichawla
Tokenizers like Byte Pair Encoding (BPE) convert raw text into integer IDs from a vocabulary of roughly 50,000 tokens.
Avi Chawla
@_avichawla
prompt = "How does inference work?"
ids = tokenizer.encode(prompt)
# ids -> [2437, 1374, 32278, 670, 30]
Avi Chawla
@_avichawla
Each ID maps to a row in the embedding table, a learned matrix of shape [vocab_size, hidden_dim]. For a model with a hidden dimension of 4,096, each token becomes a 4,096-dimensional vector.
Avi Chawla
@_avichawla
# embedding_table has shape [vocab_size, hidden_dim]
vectors = embedding_table[ids] # shape: [num_tokens, 4096]
Avi Chawla
@_avichawla
Thread image
Avi Chawla
@_avichawla
Position information gets injected at this stage.
Avi Chawla
@_avichawla
Most modern architectures use Rotary Position Embeddings (RoPE), which encode position by rotating the embedding vectors rather than adding a separate positional vector.
Avi Chawla
@_avichawla
---
Avi Chawla
@_avichawla
# Transformer layers
Avi Chawla
@_avichawla
The embedded sequence passes through a stack of transformer layers (typically 32 to 80+, depending on model size).
Avi Chawla
@_avichawla
Each layer applies two operations in sequence:
Avi Chawla
@_avichawla
1) Self-attention computes three projections per token (query Q, key K, value V) via learned weight matrices.
Avi Chawla
@_avichawla
Thread image
Generated by Thread Navigator
100%
view_carousel Carousel Studio NEW
Press ⌘ + S to quick-export