Context Windows, Parameters & Model Sizes
Understanding the Numbers That Define LLM Capabilities
Learn what context windows, parameter counts, and model sizes mean in practice. Make informed decisions about which model to use for your application.
Context Windows Explained
The Memory Limit of an LLM
What is a Context Window?
The context window is the maximum number of tokens an LLM can process at once, including both input (prompt) and output (response). Think of it as the model's working memory - everything it can see and think about at one time.
Analogy: The context window is like a desk. You can only spread out so many papers on your desk. If you need to work with a 500-page document but your desk only fits 100 pages, you have a problem. Bigger context window = bigger desk.
Current Context Window Sizes:
| Model | Context Window | Approx. Pages |
|---|---|---|
| GPT-3.5 | 16K tokens | ~20 pages |
| GPT-4o | 128K tokens | ~160 pages |
| Claude 3.5 Sonnet | 200K tokens | ~250 pages |
| Gemini 1.5 Pro | 1M tokens | ~1250 pages |
| Gemini 2.0 | 2M tokens | ~2500 pages |
Important: Context Window is Shared!
If a model has 128K token context:
- Your prompt uses 100K tokens (system prompt + user message + context)
- Only 28K tokens left for the response
- If you ask for a very long response, it gets truncated
Common mistake: Stuffing the entire context window with input, leaving no room for the output.
Note: Context window is input + output combined. A 128K context window does NOT mean 128K input + 128K output. Plan your token budget accordingly.
Parameters - What They Are and Why They Matter
The Numbers That Store a Model's Knowledge
What Are Parameters?
Parameters are the learned numerical weights in a neural network. Every weight and bias in every layer is a parameter. When we say GPT-4 has 1.8 trillion parameters, it means there are 1.8 trillion numbers that were adjusted during training.
Analogy: Think of parameters as the neurons and connections in a brain. More parameters = more capacity to store knowledge and learn complex patterns. A cockroach brain has ~1 million neurons. A human brain has ~86 billion. GPT-4 has 1.8 trillion parameters.
How Parameter Count Affects Capability:
- More parameters = more knowledge capacity: Can store more facts, patterns, and relationships
- More parameters = better reasoning: Larger models show emergent abilities like chain-of-thought reasoning
- More parameters = more expensive: Need more GPU memory, slower inference, higher cost
- Diminishing returns: Going from 7B to 70B is a huge jump. 70B to 700B is smaller relative improvement
Parameter Count of Popular Models:
| Model | Parameters | GPU Memory (FP16) |
|---|---|---|
| Mistral 7B | 7.3 billion | ~14 GB |
| LLaMA 3 8B | 8 billion | ~16 GB |
| LLaMA 3 70B | 70 billion | ~140 GB |
| LLaMA 3 405B | 405 billion | ~810 GB |
| GPT-4 (rumored) | ~1.8T (MoE) | ~1 TB+ (estimated) |
Rule of thumb: FP16 model needs ~2 bytes per parameter. So a 7B model needs ~14 GB GPU RAM just for weights, plus more for KV cache during inference.
Note: Parameter count is a rough proxy for capability, but architecture and training data quality matter too. A well-trained 8B model can outperform a poorly trained 70B model on specific tasks.
Model Size, Quantization & Running Models Locally
Making Large Models Fit on Your Hardware
The Memory Challenge:
A 70B parameter model at FP16 (2 bytes per param) needs 140 GB of GPU memory just for weights. The best consumer GPU (RTX 4090) has 24 GB. How do people run these models locally?
The answer: Quantization - reducing the precision of each parameter to use fewer bits.
Quantization Levels:
| Precision | Bits/Param | 7B Model Size | Quality Loss |
|---|---|---|---|
| FP32 (full) | 32 bits | ~28 GB | None (reference) |
| FP16 / BF16 | 16 bits | ~14 GB | Negligible |
| INT8 (Q8) | 8 bits | ~7 GB | Minimal |
| INT4 (Q4) | 4 bits | ~3.5 GB | Small but noticeable |
| INT2 (Q2) | 2 bits | ~1.75 GB | Significant |
Sweet spot: Q4 (4-bit) quantization gives ~95% of the original quality at 1/4 the memory. Most local setups use this.
Tools for Running Models Locally:
- Ollama: Easiest way to run LLMs locally. One-command install, supports GGUF quantized models
- llama.cpp: CPU/GPU inference engine. Supports aggressive quantization. Powers Ollama
- vLLM: High-throughput serving for production. PagedAttention for efficient memory use
- text-generation-inference (TGI): Hugging Face's production serving solution
- LM Studio: Desktop app with GUI for downloading and chatting with local models
What Can Run on What?
| Hardware | Max Model (Q4) |
|---|---|
| 8 GB RAM laptop | 3B-7B (slow, CPU) |
| 16 GB RAM MacBook | 7B-13B (decent on M-series) |
| RTX 3060 12GB | 7B-13B (fast) |
| RTX 4090 24GB | 13B-34B (fast) |
| 2x A100 80GB | 70B (production speed) |
Note: With 4-bit quantization, you can run a 7B model on most modern laptops and a 13B model on a gaming GPU. Quality is remarkably close to the full-precision model.
Choosing the Right Model for Your Use Case
Practical Decision Framework
When to Use Small Models (7B-13B):
- Simple tasks: Summarization, classification, extraction from structured data
- Cost-sensitive: High volume, low complexity queries
- Latency-critical: Real-time applications needing fast responses
- Privacy-required: Run locally, no data leaves your infrastructure
- Example: A Swiggy chatbot that answers FAQs about order status, refunds, delivery times
When to Use Large Models (70B+ / Frontier APIs):
- Complex reasoning: Multi-step logic, analysis, planning
- Code generation: Writing, debugging, explaining complex code
- Creative writing: High-quality content that needs nuance
- Long context: Processing entire documents, codebases
- Example: Code review tool that analyzes entire PRs and suggests improvements
Cost Comparison (API Pricing):
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o mini | $0.15 | $0.60 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| GPT-4o | $2.50 | $10.00 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| GPT-4 (legacy) | $30.00 | $60.00 |
Pro tip: Use a small, fast model for initial filtering/classification, then route complex queries to a larger model. This can cut costs by 80%+.
Note: The best strategy is often a model cascade: cheap model for easy tasks, expensive model for hard tasks. Route based on query complexity to optimize cost without sacrificing quality.
Context Window Limitations and Lost-in-the-Middle
The Hidden Problems with Large Context Windows
Problem 1: Lost in the Middle
Research shows that LLMs pay the most attention to the beginning and end of the context window. Information placed in the middle gets overlooked. Even with 128K context, a model may miss critical details buried in the middle.
Implication: When doing RAG, put the most relevant documents first and last, not in the middle. Structure your prompts with key information at the beginning.
Problem 2: Longer Context = Higher Cost + Latency
Self-attention is O(n^2) in context length. Doubling the context quadruples the compute. A 128K prompt costs much more than 4 prompts of 32K each, even with the same total tokens, because the per-token cost increases with context length.
Problem 3: Quality Degrades with Context Length
Even though a model supports 128K tokens, performance on reasoning tasks typically degrades as you approach the limit. Models are most reliable within their "effective context window" which is often smaller than the advertised maximum.
Practical Guidelines:
- Use RAG over stuffing: Retrieve relevant chunks instead of dumping entire documents
- Prioritize information: Most important context goes first and last
- Summarize when possible: Compress long documents before including them
- Test at scale: Your model may work great at 4K tokens but fail at 64K
- Monitor costs: Large context calls can be surprisingly expensive
Note: A model advertising 128K context does not mean it performs equally well at all lengths. Test your specific use case at realistic context lengths. RAG is usually better than context stuffing.
Interview Questions
Q: What is a context window and why does it matter?
The context window is the maximum tokens (input + output) an LLM can process at once. It matters because it limits how much information the model can reason about simultaneously. Longer context = ability to process larger documents, but also higher cost (O(n^2) attention), potential quality degradation, and the "lost in the middle" problem where the model misses information in the middle of long contexts.
Q: What is quantization and when would you use it?
Quantization reduces the bit-precision of model parameters (e.g., FP16 to INT4) to decrease memory requirements and increase inference speed. A 7B model at Q4 needs only ~3.5 GB instead of 14 GB at FP16. Use it when running models locally, on edge devices, or when you need to reduce serving costs. The trade-off is a small quality degradation, with Q4 retaining about 95% of original quality.
Q: How do you choose between a 7B and a 70B model?
Consider: (1) Task complexity - simple tasks (classification, extraction) work with 7B; complex reasoning needs 70B+. (2) Latency - 7B is 10x faster. (3) Cost - 7B is much cheaper to serve. (4) Privacy - 7B can run locally. Best approach: model cascade - route easy queries to 7B, hard queries to 70B.
Q: What is the "lost in the middle" problem?
LLMs pay more attention to information at the beginning and end of the context window. Critical details placed in the middle are more likely to be overlooked. Mitigations: place important context at start/end, use RAG with reranking to prioritize relevant chunks, summarize long documents, and test retrieval quality at realistic context lengths.
Q: What is Mixture of Experts (MoE) and why is it important?
MoE is an architecture where the model has many "expert" sub-networks but only activates a few for each token. GPT-4 reportedly has 1.8T total parameters but only ~280B are active per token. This gives the knowledge capacity of a huge model with the inference cost of a smaller one. It is key to making large models economically viable to serve.
Frequently Asked Questions
What is Context Windows, Parameters & Model Sizes?
Learn what context windows, parameter counts, and model sizes mean in practice. Make informed decisions about which model to use for your application.
How does Context Windows, Parameters & Model Sizes work?
The Memory Limit of an LLM What is a Context Window? The context window is the maximum number of tokens an LLM can process at once, including both input (prompt) and output (response) .
Related topics
Practice this on DevInterviewMaster
Read the full Context Windows, Parameters & Model Sizes breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.