| Thread Navigator

Thread Truncated (Cap Enforced)

Only the first 20 tweets are unrolled into slides to ensure reliable PDF exporting and high server performance.

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

Avi Chawla

@_avichawla

Every generate() call to an LLM runs two distinct computational phases on the same GPU:

Apply Image

Drag Post #2

Avi Chawla

@_avichawla

• prefill (processing the prompt) is compute-bound

Drag Post #3

Avi Chawla

@_avichawla

• while decode (generating tokens one at a time) is memory-bound.

Drag Post #4

Avi Chawla

@_avichawla

Most inference optimizations target one phase or the other, and diagnosing which phase is the bottleneck is the first step in making a deployment faster.

Drag Post #5

Avi Chawla

@_avichawla

In this article, I'll walk through the full pipeline, from tokenized input to streamed output, and look at where the time goes in each phase.

Drag Post #6

Avi Chawla

@_avichawla

---

Drag Post #7

Avi Chawla

@_avichawla

# Tokenization and embedding

Drag Post #8

Avi Chawla

@_avichawla

Tokenizers like Byte Pair Encoding (BPE) convert raw text into integer IDs from a vocabulary of roughly 50,000 tokens.

Drag Post #9

Avi Chawla

@_avichawla

<pre><code lang="python">prompt = "How does inference work?" ids = tokenizer.encode(prompt) # ids -> [2437, 1374, 32278, 670, 30]</code></pre>

Drag Post #10

Avi Chawla

@_avichawla

Each ID maps to a row in the embedding table, a learned matrix of shape [vocab_size, hidden_dim]. For a model with a hidden dimension of 4,096, each token becomes a 4,096-dimensional vector.

Drag Post #11

Avi Chawla

@_avichawla

<pre><code lang="python"># embedding_table has shape [vocab_size, hidden_dim] vectors = embedding_table[ids] # shape: [num_tokens, 4096]</code></pre>

Drag Post #12

Avi Chawla

@_avichawla

Apply Image

Drag Post #13

Avi Chawla

@_avichawla

Position information gets injected at this stage.

Drag Post #14

Avi Chawla

@_avichawla

Most modern architectures use Rotary Position Embeddings (RoPE), which encode position by rotating the embedding vectors rather than adding a separate positional vector.

Drag Post #15

Avi Chawla

@_avichawla

---

Drag Post #16

Avi Chawla

@_avichawla

# Transformer layers

Drag Post #17

Avi Chawla

@_avichawla

The embedded sequence passes through a stack of transformer layers (typically 32 to 80+, depending on model size).

Drag Post #18

Avi Chawla

@_avichawla

Each layer applies two operations in sequence:

Drag Post #19

Avi Chawla

@_avichawla

<b>1) Self-attention</b> computes three projections per token (query Q, key K, value V) via learned weight matrices.

Drag Post #20

Avi Chawla

@_avichawla

Apply Image