AI & AutomationFree to read

Context Windows, Parameters & Model Sizes

Understanding the Numbers That Define LLM Capabilities

Learn what context windows, parameter counts, and model sizes mean in practice. Make informed decisions about which model to use for your application.

Context Windows Explained

The Memory Limit of an LLM

What is a Context Window?

The context window is the maximum number of tokens an LLM can process at once, including both input (prompt) and output (response). Think of it as the model's working memory - everything it can see and think about at one time.

Analogy: The context window is like a desk. You can only spread out so many papers on your desk. If you need to work with a 500-page document but your desk only fits 100 pages, you have a problem. Bigger context window = bigger desk.

Current Context Window Sizes:

Model	Context Window	Approx. Pages
GPT-3.5	16K tokens	~20 pages
GPT-4o	128K tokens	~160 pages
Claude 3.5 Sonnet	200K tokens	~250 pages
Gemini 1.5 Pro	1M tokens	~1250 pages
Gemini 2.0	2M tokens	~2500 pages

Important: Context Window is Shared!

If a model has 128K token context:

Your prompt uses 100K tokens (system prompt + user message + context)
Only 28K tokens left for the response
If you ask for a very long response, it gets truncated

Common mistake: Stuffing the entire context window with input, leaving no room for the output.

Note: Context window is input + output combined. A 128K context window does NOT mean 128K input + 128K output. Plan your token budget accordingly.

Parameters - What They Are and Why They Matter

The Numbers That Store a Model's Knowledge

What Are Parameters?

Parameters are the learned numerical weights in a neural network. Every weight and bias in every layer is a parameter. When we say GPT-4 has 1.8 trillion parameters, it means there are 1.8 trillion numbers that were adjusted during training.

Analogy: Think of parameters as the neurons and connections in a brain. More parameters = more capacity to store knowledge and learn complex patterns. A cockroach brain has ~1 million neurons. A human brain has ~86 billion. GPT-4 has 1.8 trillion parameters.

How Parameter Count Affects Capability:

More parameters = more knowledge capacity: Can store more facts, patterns, and relationships
More parameters = better reasoning: Larger models show emergent abilities like chain-of-thought reasoning
More parameters = more expensive: Need more GPU memory, slower inference, higher cost
Diminishing returns: Going from 7B to 70B is a huge jump. 70B to 700B is smaller relative improvement

Parameter Count of Popular Models:

Model	Parameters	GPU Memory (FP16)
Mistral 7B	7.3 billion	~14 GB
LLaMA 3 8B	8 billion	~16 GB
LLaMA 3 70B	70 billion	~140 GB
LLaMA 3 405B	405 billion	~810 GB
GPT-4 (rumored)	~1.8T (MoE)	~1 TB+ (estimated)

Rule of thumb: FP16 model needs ~2 bytes per parameter. So a 7B model needs ~14 GB GPU RAM just for weights, plus more for KV cache during inference.

Note: Parameter count is a rough proxy for capability, but architecture and training data quality matter too. A well-trained 8B model can outperform a poorly trained 70B model on specific tasks.

Model Size, Quantization & Running Models Locally

Making Large Models Fit on Your Hardware

The Memory Challenge:

A 70B parameter model at FP16 (2 bytes per param) needs 140 GB of GPU memory just for weights. The best consumer GPU (RTX 4090) has 24 GB. How do people run these models locally?

The answer: Quantization - reducing the precision of each parameter to use fewer bits.

Quantization Levels:

Precision	Bits/Param	7B Model Size	Quality Loss
FP32 (full)	32 bits	~28 GB	None (reference)
FP16 / BF16	16 bits	~14 GB	Negligible
INT8 (Q8)	8 bits	~7 GB	Minimal
INT4 (Q4)	4 bits	~3.5 GB	Small but noticeable
INT2 (Q2)	2 bits	~1.75 GB	Significant

Sweet spot: Q4 (4-bit) quantization gives ~95% of the original quality at 1/4 the memory. Most local setups use this.

Tools for Running Models Locally:

Ollama: Easiest way to run LLMs locally. One-command install, supports GGUF quantized models
llama.cpp: CPU/GPU inference engine. Supports aggressive quantization. Powers Ollama
vLLM: High-throughput serving for production. PagedAttention for efficient memory use
text-generation-inference (TGI): Hugging Face's production serving solution
LM Studio: Desktop app with GUI for downloading and chatting with local models

What Can Run on What?

Hardware	Max Model (Q4)
8 GB RAM laptop	3B-7B (slow, CPU)
16 GB RAM MacBook	7B-13B (decent on M-series)
RTX 3060 12GB	7B-13B (fast)
RTX 4090 24GB	13B-34B (fast)
2x A100 80GB	70B (production speed)

Note: With 4-bit quantization, you can run a 7B model on most modern laptops and a 13B model on a gaming GPU. Quality is remarkably close to the full-precision model.

Choosing the Right Model for Your Use Case

Practical Decision Framework

When to Use Small Models (7B-13B):

Simple tasks: Summarization, classification, extraction from structured data
Cost-sensitive: High volume, low complexity queries
Latency-critical: Real-time applications needing fast responses
Privacy-required: Run locally, no data leaves your infrastructure
Example: A Swiggy chatbot that answers FAQs about order status, refunds, delivery times

When to Use Large Models (70B+ / Frontier APIs):

Complex reasoning: Multi-step logic, analysis, planning
Code generation: Writing, debugging, explaining complex code
Creative writing: High-quality content that needs nuance
Long context: Processing entire documents, codebases
Example: Code review tool that analyzes entire PRs and suggests improvements

Cost Comparison (API Pricing):

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o mini	$0.15	$0.60
Claude 3.5 Haiku	$0.80	$4.00
GPT-4o	$2.50	$10.00
Claude 3.5 Sonnet	$3.00	$15.00
GPT-4 (legacy)	$30.00	$60.00

Pro tip: Use a small, fast model for initial filtering/classification, then route complex queries to a larger model. This can cut costs by 80%+.

Note: The best strategy is often a model cascade: cheap model for easy tasks, expensive model for hard tasks. Route based on query complexity to optimize cost without sacrificing quality.

Context Window Limitations and Lost-in-the-Middle

The Hidden Problems with Large Context Windows

Problem 1: Lost in the Middle

Research shows that LLMs pay the most attention to the beginning and end of the context window. Information placed in the middle gets overlooked. Even with 128K context, a model may miss critical details buried in the middle.

Implication: When doing RAG, put the most relevant documents first and last, not in the middle. Structure your prompts with key information at the beginning.

Problem 2: Longer Context = Higher Cost + Latency

Self-attention is O(n^2) in context length. Doubling the context quadruples the compute. A 128K prompt costs much more than 4 prompts of 32K each, even with the same total tokens, because the per-token cost increases with context length.

Problem 3: Quality Degrades with Context Length

Even though a model supports 128K tokens, performance on reasoning tasks typically degrades as you approach the limit. Models are most reliable within their "effective context window" which is often smaller than the advertised maximum.

Practical Guidelines:

Use RAG over stuffing: Retrieve relevant chunks instead of dumping entire documents
Prioritize information: Most important context goes first and last
Summarize when possible: Compress long documents before including them
Test at scale: Your model may work great at 4K tokens but fail at 64K
Monitor costs: Large context calls can be surprisingly expensive

Note: A model advertising 128K context does not mean it performs equally well at all lengths. Test your specific use case at realistic context lengths. RAG is usually better than context stuffing.

Interview Questions

Q: What is a context window and why does it matter?

The context window is the maximum tokens (input + output) an LLM can process at once. It matters because it limits how much information the model can reason about simultaneously. Longer context = ability to process larger documents, but also higher cost (O(n^2) attention), potential quality degradation, and the "lost in the middle" problem where the model misses information in the middle of long contexts.

Q: What is quantization and when would you use it?

Quantization reduces the bit-precision of model parameters (e.g., FP16 to INT4) to decrease memory requirements and increase inference speed. A 7B model at Q4 needs only ~3.5 GB instead of 14 GB at FP16. Use it when running models locally, on edge devices, or when you need to reduce serving costs. The trade-off is a small quality degradation, with Q4 retaining about 95% of original quality.

Q: How do you choose between a 7B and a 70B model?

Consider: (1) Task complexity - simple tasks (classification, extraction) work with 7B; complex reasoning needs 70B+. (2) Latency - 7B is 10x faster. (3) Cost - 7B is much cheaper to serve. (4) Privacy - 7B can run locally. Best approach: model cascade - route easy queries to 7B, hard queries to 70B.

Q: What is the "lost in the middle" problem?

LLMs pay more attention to information at the beginning and end of the context window. Critical details placed in the middle are more likely to be overlooked. Mitigations: place important context at start/end, use RAG with reranking to prioritize relevant chunks, summarize long documents, and test retrieval quality at realistic context lengths.

Q: What is Mixture of Experts (MoE) and why is it important?

MoE is an architecture where the model has many "expert" sub-networks but only activates a few for each token. GPT-4 reportedly has 1.8T total parameters but only ~280B are active per token. This gives the knowledge capacity of a huge model with the inference cost of a smaller one. It is key to making large models economically viable to serve.

Frequently Asked Questions

What is Context Windows, Parameters & Model Sizes?

Learn what context windows, parameter counts, and model sizes mean in practice. Make informed decisions about which model to use for your application.

How does Context Windows, Parameters & Model Sizes work?

The Memory Limit of an LLM What is a Context Window? The context window is the maximum number of tokens an LLM can process at once, including both input (prompt) and output (response) .

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Context Windows, Parameters & Model Sizes breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Context Windows, Parameters & Model Sizes

Context Windows Explained

Parameters - What They Are and Why They Matter

Model Size, Quantization & Running Models Locally

Choosing the Right Model for Your Use Case

Context Window Limitations and Lost-in-the-Middle

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster