Visualize Thread by @_avichawla

✨ Visual Editor

Thread Truncated

Only the first 20 tweets are shown to ensure high-quality rendering and prevent image size issues.

palette Canvas & Background

Presets

Custom Colors

Gradient:arrow_forward

Text Color:

Gradient Angle135°

Background Pattern

Grain Texture

Aspect Ratio

style Card Style

Preset

Padding40px

Card Radius16px

Enable Card Shadow

Glassmorphism Effect

Show Watermark AGENCY

Show Timestamps

Show X Logo

text_fields Typography

Font Family

Font Size16px

Avi Chawla

@_avichawla

Every generate() call to an LLM runs two distinct computational phases on the same GPU:

Avi Chawla

@_avichawla

• prefill (processing the prompt) is compute-bound

Avi Chawla

@_avichawla

• while decode (generating tokens one at a time) is memory-bound.

Avi Chawla

@_avichawla

Most inference optimizations target one phase or the other, and diagnosing which phase is the bottleneck is the first step in making a deployment faster.

Avi Chawla

@_avichawla

In this article, I'll walk through the full pipeline, from tokenized input to streamed output, and look at where the time goes in each phase.

Avi Chawla

@_avichawla

---

Avi Chawla

@_avichawla

# Tokenization and embedding

Avi Chawla

@_avichawla

Tokenizers like Byte Pair Encoding (BPE) convert raw text into integer IDs from a vocabulary of roughly 50,000 tokens.

Avi Chawla

@_avichawla

prompt = "How does inference work?"
ids = tokenizer.encode(prompt)
# ids -> [2437, 1374, 32278, 670, 30]

Avi Chawla

@_avichawla

Each ID maps to a row in the embedding table, a learned matrix of shape [vocab_size, hidden_dim]. For a model with a hidden dimension of 4,096, each token becomes a 4,096-dimensional vector.

Avi Chawla

@_avichawla

# embedding_table has shape [vocab_size, hidden_dim]
vectors = embedding_table[ids]   # shape: [num_tokens, 4096]

Avi Chawla

@_avichawla

Avi Chawla

@_avichawla

Position information gets injected at this stage.

Avi Chawla

@_avichawla

Most modern architectures use Rotary Position Embeddings (RoPE), which encode position by rotating the embedding vectors rather than adding a separate positional vector.

Avi Chawla

@_avichawla

---

Avi Chawla

@_avichawla

# Transformer layers

Avi Chawla

@_avichawla

The embedded sequence passes through a stack of transformer layers (typically 32 to 80+, depending on model size).

Avi Chawla

@_avichawla

Each layer applies two operations in sequence:

Avi Chawla

@_avichawla

1) Self-attention computes three projections per token (query Q, key K, value V) via learned weight matrices.

Avi Chawla

@_avichawla

Generated by Thread Navigator

100%

view_carousel Carousel Studio NEW

Press ⌘ + S to quick-export