Thread Truncated (Cap Enforced)
Only the first 20 tweets are unrolled into slides to ensure reliable PDF exporting and high server performance.
Canvas & Ratio
Choose your destination platform format
Layout Template
Choose a content structure for your slides
Preset Themes
Typography & Sizing
Brand Kit Customization
AGENCYConfigure brand assets for headers & footers
Outro Slide CTA
Customize your closing call-to-action slide
Background Pattern
Build Your Carousel
Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Every generate() call to an LLM runs two distinct computational phases on the same GPU:


• prefill (processing the prompt) is compute-bound

• while decode (generating tokens one at a time) is memory-bound.

Most inference optimizations target one phase or the other, and diagnosing which phase is the bottleneck is the first step in making a deployment faster.

In this article, I'll walk through the full pipeline, from tokenized input to streamed output, and look at where the time goes in each phase.

---

# Tokenization and embedding

Tokenizers like Byte Pair Encoding (BPE) convert raw text into integer IDs from a vocabulary of roughly 50,000 tokens.

<pre><code lang="python">prompt = "How does inference work?" ids = tokenizer.encode(prompt) # ids -> [2437, 1374, 32278, 670, 30]</code></pre>

Each ID maps to a row in the embedding table, a learned matrix of shape [vocab_size, hidden_dim]. For a model with a hidden dimension of 4,096, each token becomes a 4,096-dimensional vector.

<pre><code lang="python"># embedding_table has shape [vocab_size, hidden_dim] vectors = embedding_table[ids] # shape: [num_tokens, 4096]</code></pre>



Position information gets injected at this stage.

Most modern architectures use Rotary Position Embeddings (RoPE), which encode position by rotating the embedding vectors rather than adding a separate positional vector.

---

# Transformer layers

The embedded sequence passes through a stack of transformer layers (typically 32 to 80+, depending on model size).

Each layer applies two operations in sequence:

<b>1) Self-attention</b> computes three projections per token (query Q, key K, value V) via learned weight matrices.

