DevInterviewMasterStart free →
AI & AutomationFree to read

Inference Optimization (Quantization, Batching, KV Cache)

Make Your LLMs 10x Faster and 4x Cheaper

Master the techniques that turn expensive, slow LLM inference into blazing-fast, cost-efficient production systems. From quantization to continuous batching to KV cache optimization.

Why Inference Optimization Matters

The Bottleneck is Not Training - It is Serving

The Real Cost of LLMs

Training a model happens once. But inference happens millions of times. For every Rs 1 spent on training, companies spend Rs 10-100 on inference. A single GPT-4 query costs OpenAI ~Rs 5-10 in compute. Multiply that by billions of daily requests.

Think of it like building a Swiggy kitchen. Building the kitchen (training) is a one-time cost. But cooking every order (inference) is the ongoing expense that determines profitability. Optimizing how fast and efficiently you cook each order is what makes or breaks the business.

The Three Pillars of Inference Optimization:

  • Memory Optimization - Quantization reduces model size so it fits in cheaper hardware. Fewer GPUs = lower cost.
  • Throughput Optimization - Batching processes multiple requests together, maximizing GPU utilization per second.
  • Latency Optimization - KV cache, speculative decoding, and attention optimizations reduce time-to-first-token and time-per-token.

Key Metrics to Understand:

  • TTFT (Time to First Token) - How long until the first token appears. Critical for user experience.
  • TPS (Tokens Per Second) - How fast tokens are generated. Affects perceived speed.
  • Throughput - Total tokens/second across all concurrent requests. Determines cost per query.
  • GPU Utilization - Percentage of GPU compute actually being used. Low utilization = wasted money.

Note: Companies like OpenAI, Anthropic, and Google spend more on inference infrastructure than on training. Inference optimization directly impacts profitability and user experience.

Quantization - Shrink Without Breaking

From 16-bit to 4-bit: 4x Smaller, Nearly Same Quality

What is Quantization?

Models store their knowledge as billions of numbers (weights). By default, each number uses 16 bits (FP16) of precision. Quantization reduces this precision to 8-bit, 4-bit, or even 2-bit, dramatically shrinking model size and memory usage.

Analogy: Imagine a high-resolution photo (16-bit). You can compress it to JPEG (4-bit equivalent) - file size drops 4x, and unless you zoom in very closely, you cannot tell the difference. That is quantization for model weights.

Types of Quantization:

  • Post-Training Quantization (PTQ) - Apply quantization after training. No retraining needed. Most common approach (GPTQ, AWQ, GGUF).
  • Quantization-Aware Training (QAT) - Train the model knowing it will be quantized. Better quality but requires full training run.
  • Dynamic Quantization - Quantize on-the-fly during inference. Simplest but least efficient.

Popular Quantization Methods:

MethodTypeBest ForQuality
GPTQPTQ, GPU-focusedGPU inference with vLLM/TGIVery Good
AWQPTQ, activation-awareGPU inference, better than GPTQExcellent
GGUFPTQ, CPU/GPUllama.cpp, Ollama, local useGood-Excellent
bitsandbytesDynamic, 4/8-bitHuggingFace, easy integrationGood
SmoothQuantPTQ, W8A8Both weights and activationsVery Good

Quality Impact by Bit Width:

At 8-bit (INT8) - less than 1% quality loss on most benchmarks. At 4-bit (INT4) - 1-3% loss, very usable. At 3-bit - 5-10% loss, noticeable on complex tasks. At 2-bit - significant degradation, only for very simple tasks.

Note: AWQ (Activation-Aware Weight Quantization) is currently the best balance of speed and quality for GPU inference. GGUF is best for CPU/local use with llama.cpp.

Batching Strategies - Maximize GPU Utilization

Process 100 Requests at the Cost of 10

Why Batching Matters

A GPU processing one request uses maybe 10-20% of its compute capacity. The rest is wasted. Batching groups multiple requests together so the GPU processes them simultaneously, using 80-90% of its capacity.

Imagine a Flipkart delivery van going to one house at a time vs. batching 20 deliveries in one route. Same fuel cost, 20x more deliveries. That is batching for GPUs.

Types of Batching:

  • Static Batching - Wait for N requests, process all at once. Simple but adds latency (waiting for batch to fill). Old approach.
  • Dynamic Batching - Set a timeout (e.g., 50ms). Batch whatever requests arrive within that window. Better latency than static.
  • Continuous Batching (Iteration-Level) - The game-changer used by vLLM and TGI. New requests join the batch at every token generation step. No waiting, no wasted GPU cycles. As one request finishes, another immediately takes its slot.

Continuous Batching Explained:

Traditional batching waits for all requests in a batch to finish before starting new ones. If Request A needs 10 tokens and Request B needs 1000 tokens, Request A waits 990 tokens for B to finish. Wasted GPU time!

Continuous batching lets Request A leave the batch as soon as it finishes, and a new Request C immediately takes its place. The GPU is never idle as long as there are pending requests.

Impact Numbers:

StrategyGPU UtilizationThroughput vs Single
No Batching10-20%1x (baseline)
Static Batching (BS=8)40-60%3-5x
Continuous Batching80-95%8-20x

Note: Continuous batching is the single biggest throughput optimization for LLM serving. vLLM and TGI implement it by default. If you are serving multiple users, this is non-negotiable.

KV Cache - Memory vs Speed Trade-off

The Hidden Memory Hog That Makes LLMs Fast

What is KV Cache?

When an LLM generates tokens, it computes Key and Value matrices for the attention mechanism at each layer. Without caching, it would recompute these for all previous tokens at every step. The KV cache stores these computed values so they are only calculated once.

Analogy: Imagine writing an exam where at each question, you must re-read the entire question paper from the beginning. That is inference without KV cache. With KV cache, you just remember what you already read and focus on the new part only.

The KV Cache Memory Problem:

KV cache memory grows with: batch size x sequence length x number of layers x hidden dimension. For a 70B model with 128K context, KV cache alone can use 40+ GB of GPU memory per request!

  • Llama 3 8B, 4K context - ~0.5 GB KV cache per request
  • Llama 3 70B, 4K context - ~2.5 GB KV cache per request
  • Llama 3 70B, 128K context - ~40 GB KV cache per request

KV Cache Optimization Techniques:

  • PagedAttention (vLLM) - Allocate KV cache in pages instead of contiguous blocks. Eliminates memory fragmentation.
  • Grouped Query Attention (GQA) - Share KV heads across multiple query heads. Llama 3 uses this, reducing KV cache by 4-8x.
  • Multi-Query Attention (MQA) - All query heads share one KV head. Maximum savings but slight quality trade-off.
  • KV Cache Quantization - Store KV cache in FP8 or INT8 instead of FP16. Halves memory with minimal quality impact.
  • Sliding Window Attention - Only keep KV cache for recent N tokens. Used by Mistral. Limits context but saves massive memory.

Prefill vs Decode Phases:

  • Prefill Phase - Process all input tokens at once (compute-bound). This determines TTFT. Highly parallelizable.
  • Decode Phase - Generate tokens one at a time (memory-bound). This determines TPS. Bottleneck is reading KV cache from memory.
  • Key Insight - Prefill and decode have opposite bottlenecks. Advanced systems like Splitwise and DistServe separate them onto different hardware.

Note: KV cache is often the biggest memory consumer during inference - larger than the model weights themselves for long contexts. Optimizing it is critical for scaling.

Advanced Optimization Techniques

Next-Level Performance Tricks

Speculative Decoding - Draft and Verify

Use a small fast model (draft model, e.g., 1B params) to generate N candidate tokens quickly. Then the large model (e.g., 70B) verifies all N tokens in one forward pass (parallel verification). If the draft model guessed correctly (which it does 60-80% of the time), you get N tokens for the cost of one large model forward pass.

Like a junior dev writing code and a senior dev reviewing it. The junior is fast, the senior only needs to verify, not write from scratch. If the junior is good, the team is faster than the senior working alone.

Flash Attention

Standard attention computes the full NxN attention matrix, which is slow and memory-hungry. Flash Attention (by Tri Dao) restructures the computation to be IO-aware - it tiles the computation to fit in GPU SRAM (fast cache) instead of going to HBM (slow main memory).

  • 2-4x faster attention computation
  • Memory usage goes from O(N squared) to O(N)
  • Now the default in all major frameworks

Other Key Optimizations:

  • Tensor Parallelism - Split model layers across GPUs. Each GPU computes part of each layer. Reduces per-GPU memory and latency.
  • Pipeline Parallelism - Different layers on different GPUs. Less communication but higher latency per request.
  • Prefix Caching - Cache KV for common system prompts. If 1000 users share the same system prompt, compute it once.
  • Token Pruning - Skip less important tokens in attention. Reduces compute for long contexts.
  • Structured Outputs - Constrain generation to valid JSON/format. Reduces wasted tokens and retries.

Note: Speculative decoding can give 2-3x speedup with zero quality loss. It is one of the most impactful recent innovations in LLM inference.

Interview Questions

Q: What is the difference between quantization, pruning, and distillation?

Quantization reduces the precision of model weights (FP16 to INT4). Pruning removes less important weights entirely (making the model sparse). Distillation trains a smaller model to mimic a larger one. Quantization is easiest to apply post-training. Pruning requires careful selection of which weights to remove. Distillation requires training a new model but can achieve the best size/quality ratio.

Q: Explain continuous batching and why it is better than static batching.

Static batching waits for all requests in a batch to complete before processing new ones, wasting GPU cycles when shorter requests finish first. Continuous batching operates at the iteration (token) level - as soon as one request finishes generating, a new request immediately takes its slot. This keeps GPU utilization at 80-95% vs 40-60% for static batching, giving 2-4x higher throughput.

Q: What is the KV cache and why does it consume so much memory?

The KV cache stores pre-computed Key and Value tensors from the attention mechanism. Without it, the model would recompute attention over all previous tokens at each generation step. Memory grows as: batch_size x seq_length x num_layers x hidden_dim. For long contexts (128K tokens), KV cache can exceed model weight memory. Techniques like GQA, MQA, PagedAttention, and KV cache quantization help manage this.

Q: How does speculative decoding work and when would you use it?

A small draft model generates N candidate tokens quickly. The large target model then verifies all N tokens in a single forward pass (since verification is parallel, unlike autoregressive generation). Accepted tokens are kept; rejected ones are regenerated. This gives 2-3x speedup with zero quality loss. Best when you need low latency for a large model and have a good smaller draft model available.

Q: What is Flash Attention and why is it important?

Flash Attention restructures the attention computation to be IO-aware, tiling work into GPU SRAM (fast) instead of repeatedly accessing HBM (slow). It reduces attention memory from O(N squared) to O(N) and is 2-4x faster. It enables training and inference with much longer contexts. Now the default in PyTorch, HuggingFace, and all major serving frameworks.

Frequently Asked Questions

What is Inference Optimization?

Master the techniques that turn expensive, slow LLM inference into blazing-fast, cost-efficient production systems. From quantization to continuous batching to KV cache optimization.

How does Inference Optimization work?

The Bottleneck is Not Training - It is Serving The Real Cost of LLMs Training a model happens once. But inference happens millions of times .

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Inference Optimization (Quantization, Batching, KV Cache) breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.