| Thread Navigator

Canvas & Ratio

Choose your destination platform format

Layout Template

Choose a content structure for your slides

Preset Themes

Typography & Sizing

Font Family

Title Font Size36px

Body Font Size18px

Header & Footer Size12px

Brand Kit Customization

AGENCY

Configure brand assets for headers & footers

MULTI-PROFILES (AGENCY)

Active Brand Profile

Show Brand Watermark

Brand Watermark Text

Social Handle

Brand Logo URL (PNG) AGENCY

SAVE PRESETS (AGENCY)

Save current as Preset

Outro Slide CTA

Customize your closing call-to-action slide

CTA Title

CTA Message & Emojis

Custom CTA Buttons

Background Pattern

Source Content

Build Your Carousel

Drag and drop any post card below onto a slide, or use the quick buttons to insert content/images instantly!

Drag Post #1

Ashutosh Maheshwari

@asmah2107

A friend of mine recently bombed MLE interview at NVIDIA, they asked: "We need to deploy a Llama-3 70B model on hardware with limited VRAM. You propose quantization. When is this a bad idea?" Here's how you break it down:

Drag Post #2

Ashutosh Maheshwari

@asmah2107

Most candidates say: "Quantization is great, it makes models faster and smaller by using lower-precision numbers like INT8 or FP8. It's a win-win." This answer misses the entire point of the question. Quantization is a trade-off, and if you don't know the risks, you will break production.

Drag Post #3

Ashutosh Maheshwari

@asmah2107

The core insight: Quantization reduces the "dynamic range" of the numbers your model can use. This is fine for most weights, but catastrophic for a small subset of values called outliers. Imagine a single weight is 1000x larger than all others. When you shrink the number range, that outlier gets clipped or squashed, destroying information.

Drag Post #4

Ashutosh Maheshwari

@asmah2107

Here's the diagnostic framework for when to be scared of quantization: Emergent Abilities: Large models (100B+) have capabilities that are highly sensitive to tiny changes in weights. Quantizing a model that's good at math might destroy its ability to reason. Fine-tuned Models: Fine-tuning often creates specialized, high-magnitude activation spikes. Naive post-training quantization (PTQ) will crush these, undoing your expensive fine-tuning. Multi-lingual Tasks: Different languages can have vastly different activation distributions. A quantization scheme optimized for English might fail spectacularly on Japanese. Chain-of-Thought Reasoning: Complex, multi-step reasoning relies on subtle numerical signals propagating through layers. Aggressive quantization adds noise that can derail the entire chain.

Drag Post #5

Ashutosh Maheshwari

@asmah2107

The metric that saves you: Perplexity. Before and after quantization, always measure perplexity on a hold-out set. If it spikes significantly, you've lost critical information. Don't just rely on task-specific accuracy scores; perplexity tells you about the fundamental language understanding of the model.

Drag Post #6

Ashutosh Maheshwari

@asmah2107

The workflow that separates juniors from seniors: ❌ Junior: Applies a standard quantization library (like bitsandbytes) and calls it a day. ✅ Senior: Visualizes weight and activation distributions to check for outliers first. Uses advanced techniques like GPTQ or AWQ that are quantization-aware and protect salient weights. Implements a mixed-precision scheme, leaving sensitive layers like the attention heads in FP16 while quantizing the larger FFN layers.

Drag Post #7

Ashutosh Maheshwari

@asmah2107

Pro-tip for the interview: Mention the trade-off with speculative decoding. "Aggressive quantization might make the main model so different from the draft model that the acceptance rate for speculative decoding plummets, negating any performance gains. The whole system must be optimized together."

Drag Post #8

Ashutosh Maheshwari

@asmah2107

So the right answer is: "Quantization is not a free lunch. I'd be cautious with models fine-tuned for specialized tasks or those exhibiting complex reasoning. I would start by analyzing activation distributions, use a method like AWQ, and validate with both perplexity and task-specific evals to ensure we haven't silently crippled a key capability."

Drag Post #9

Ashutosh Maheshwari

@asmah2107

Stop treating quantization as a simple compression trick. It's a complex surgical procedure on your model's brain. Get it wrong, and you're left with a fast, small, and useless model. Follow @asmah2107 for more deep dives into ML engineering.