Hi,👋 we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

Carousel Studio

Repurpose X Threads into LinkedIn & Instagram Carousels

Thread Truncated (Cap Enforced)

Only the first 20 tweets are unrolled into slides to ensure reliable PDF exporting and high server performance.

Canvas & Ratio

Choose your destination platform format


Layout Template

Choose a content structure for your slides


Preset Themes


Typography & Sizing

Title Font Size36px
Body Font Size18px
Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)
AGENCY
SAVE PRESETS (AGENCY)

Outro Slide CTA

Customize your closing call-to-action slide

#1
#2
#3

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1
Avi Chawla
@_avichawla

Every generate() call to an LLM runs two distinct computational phases on the same GPU:

Apply Image
Drag Post #2
Avi Chawla
@_avichawla

• prefill (processing the prompt) is compute-bound

Drag Post #3
Avi Chawla
@_avichawla

• while decode (generating tokens one at a time) is memory-bound.

Drag Post #4
Avi Chawla
@_avichawla

Most inference optimizations target one phase or the other, and diagnosing which phase is the bottleneck is the first step in making a deployment faster.

Drag Post #5
Avi Chawla
@_avichawla

In this article, I'll walk through the full pipeline, from tokenized input to streamed output, and look at where the time goes in each phase.

Drag Post #6
Avi Chawla
@_avichawla

---

Drag Post #7
Avi Chawla
@_avichawla

# Tokenization and embedding

Drag Post #8
Avi Chawla
@_avichawla

Tokenizers like Byte Pair Encoding (BPE) convert raw text into integer IDs from a vocabulary of roughly 50,000 tokens.

Drag Post #9
Avi Chawla
@_avichawla

<pre><code lang="python">prompt = "How does inference work?" ids = tokenizer.encode(prompt) # ids -&gt; [2437, 1374, 32278, 670, 30]</code></pre>

Drag Post #10
Avi Chawla
@_avichawla

Each ID maps to a row in the embedding table, a learned matrix of shape [vocab_size, hidden_dim]. For a model with a hidden dimension of 4,096, each token becomes a 4,096-dimensional vector.

Drag Post #11
Avi Chawla
@_avichawla

<pre><code lang="python"># embedding_table has shape [vocab_size, hidden_dim] vectors = embedding_table[ids] # shape: [num_tokens, 4096]</code></pre>

Drag Post #12
Avi Chawla
@_avichawla

Apply Image
Drag Post #13
Avi Chawla
@_avichawla

Position information gets injected at this stage.

Drag Post #14
Avi Chawla
@_avichawla

Most modern architectures use Rotary Position Embeddings (RoPE), which encode position by rotating the embedding vectors rather than adding a separate positional vector.

Drag Post #15
Avi Chawla
@_avichawla

---

Drag Post #16
Avi Chawla
@_avichawla

# Transformer layers

Drag Post #17
Avi Chawla
@_avichawla

The embedded sequence passes through a stack of transformer layers (typically 32 to 80+, depending on model size).

Drag Post #18
Avi Chawla
@_avichawla

Each layer applies two operations in sequence:

Drag Post #19
Avi Chawla
@_avichawla

<b>1) Self-attention</b> computes three projections per token (query Q, key K, value V) via learned weight matrices.

Drag Post #20
Avi Chawla
@_avichawla

Apply Image