Hi,πŸ‘‹ we have updated the app and fixed multiple bugs. We are lacking funds, request to free user not to use Adblock. Ads are non intrusive. 😊

Carousel Studio

Repurpose X Threads into LinkedIn & Instagram Carousels

Canvas & Ratio

Choose your destination platform format


Layout Template

Choose a content structure for your slides


Preset Themes


Typography & Sizing

Title Font Size36px
Body Font Size18px
Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)
AGENCY
SAVE PRESETS (AGENCY)

Outro Slide CTA

Customize your closing call-to-action slide

#1
#2
#3

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1
Avi Chawla
@_avichawla

KV caching in LLMs, clearly explained (with visuals):

Drag Post #2
Avi Chawla
@_avichawla

KV caching is a technique used to speed up LLM inference. Before understanding the internal details, look at the inference speed difference in the video: - with KV caching β†’ 9 seconds - without KV caching β†’ 42 seconds (~5x slower) Let's dive in!

VIDEO
Apply Image
Drag Post #3
Avi Chawla
@_avichawla

To understand KV caching, we must know how LLMs output tokens. - Transformer produces hidden states for all tokens. - Hidden states are projected to vocab space. - Logits of the last token is used to generate the next token. - Repeat for subsequent tokens. Check thisπŸ‘‡

Drag Post #4
Avi Chawla
@_avichawla

Thus, to generate a new token, we only need the hidden state of the most recent token. None of the other hidden states are required. Next, let's see how the last hidden state is computed within the transformer layer from the attention mechanism.

Drag Post #5
Avi Chawla
@_avichawla

During attention: The last row of query-key-product involves: - the last query vector. - all key vectors. Also, the last row of the final attention result involves: - the last query vector. - all key & value vectors. Check this visual to understand better:

Drag Post #6
Avi Chawla
@_avichawla

The above insight suggests that to generate a new token, every attention operation in the network only needs: - query vector of the last token. - all key & value vectors. But, there's one more key insight here.

Drag Post #7
Avi Chawla
@_avichawla

As we generate new tokens: - The KV vectors used for ALL previous tokens do not change. Thus, we just need to generate a KV vector for the token generated one step before. Rest of the KV vectors can be retrieved from a cache to save compute and time.

Drag Post #8
Avi Chawla
@_avichawla

This is called KV caching! To reiterate, instead of redundantly computing KV vectors of all context tokens, cache them. To generate a token: - Generate QKV vector for the token generated one step before. - Get all other KV vectors from cache. - Compute attention. Check thisπŸ‘‡

Drag Post #9
Avi Chawla
@_avichawla

KV caching saves time during inference. In fact, this is why ChatGPT takes some time to generate the first token than the subsequent tokens. During that time, it is computing the KV cache of the prompt.

Drag Post #10
Avi Chawla
@_avichawla

That said, KV cache also takes a lot of memory. Llama3-70B has: - total layers = 80 - hidden size = 8k - max output size = 4k Here: - Every token takes up ~2.5 MB in KV cache. - 4k tokens will take up 10.5 GB. More users β†’ more memory. I'll cover KV optimization soon.

VIDEO
Apply Image
Drag Post #11
Avi Chawla
@_avichawla

That's a wrap! If you enjoyed this tutorial: Find me β†’ @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.