Running Local LLMs (Ollama, vLLM, llama.cpp)
Your Own AI, Your Own Hardware, Zero API Bills
Learn how to run powerful language models locally on your machine. From Ollama's simplicity to vLLM's production-grade serving and llama.cpp's raw efficiency - master self-hosted AI.
Why Run LLMs Locally?
Complete Privacy, Zero API Costs, Full Control
The Case for Local LLMs
Running LLMs locally means the model runs on your own hardware - your laptop, server, or GPU cluster. No data leaves your machine, no per-token charges, and you have complete control over which model, version, and configuration you use.
Think of it like cooking at home vs ordering from Swiggy. Swiggy (cloud APIs) is convenient but expensive and you have no control over ingredients. Cooking at home (local LLMs) takes more setup but gives you full control over quality, cost, and privacy.
When to Go Local:
- Data Privacy - Sensitive data (medical records, financial data, legal docs) that cannot leave your infrastructure
- Cost Optimization - High-volume use cases where API costs become prohibitive (100K+ requests/day)
- Latency Requirements - Edge deployments where network latency to cloud APIs is unacceptable
- Offline Access - Systems that must work without internet (defense, remote locations, embedded devices)
- Customization - Need to fine-tune or modify the model for specific tasks
Trade-offs to Consider:
- Hardware Cost - You need GPUs (NVIDIA RTX 4090 ~ Rs 1.5L) or powerful CPUs with lots of RAM
- Model Quality - Local models (7B-70B params) are improving but still behind GPT-4/Claude for complex reasoning
- Maintenance - You manage updates, scaling, monitoring, and hardware failures yourself
- Setup Complexity - CUDA drivers, model downloads, quantization - initial setup can be tricky
Note: Local LLMs are not a replacement for cloud APIs in all cases. The best approach is often hybrid - local for sensitive/high-volume tasks, cloud APIs for complex reasoning where quality matters most.
Ollama - The Docker of Local LLMs
One Command to Run Any Model
What is Ollama?
Ollama is the easiest way to run LLMs locally. It packages models with their runtime into a single binary, similar to how Docker packages applications. One command and you have a model running with an OpenAI-compatible API.
It handles model downloading, quantization selection, GPU detection, memory management, and serving - all automatically. Think of it as the "Homebrew for LLMs."
How Ollama Works:
- Model Library - Curated collection of 100+ models (Llama 3, Mistral, Gemma, Phi, CodeLlama, DeepSeek)
- Modelfile - Like a Dockerfile but for LLMs. Defines base model, system prompt, parameters, and template format
- Automatic GPU Detection - Finds your NVIDIA/AMD/Apple Silicon GPU and uses it automatically
- OpenAI-Compatible API - Serves on localhost:11434 with the same API format as OpenAI, so existing code works
- Model Layers - Only downloads the diff when switching between related models, saving bandwidth
Popular Models on Ollama:
| Model | Size | RAM Needed | Best For |
|---|---|---|---|
| Llama 3.1 8B | 4.7 GB | 8 GB | General chat, coding |
| Mistral 7B | 4.1 GB | 8 GB | Fast responses, reasoning |
| DeepSeek Coder V2 | 8.9 GB | 16 GB | Code generation |
| Llama 3.1 70B | 40 GB | 48 GB | Complex reasoning |
| Phi-3 Mini | 2.3 GB | 4 GB | Lightweight, edge devices |
Ollama Limitations:
- Single Request - Default: processes one request at a time (no batching)
- No Distributed Inference - Cannot split a model across multiple GPUs on different machines
- Production Readiness - Great for dev/prototyping, but lacks production features like auth, rate limiting, load balancing
Note: Ollama is perfect for developers getting started with local LLMs. Install it, run one command, and you have a model serving API requests in minutes.
vLLM - Production-Grade LLM Serving
High-Throughput Inference Engine for Real Workloads
What is vLLM?
vLLM (Virtual LLM) is a high-throughput, memory-efficient LLM serving engine developed at UC Berkeley. It is designed for production deployments where you need to serve thousands of concurrent users with minimum latency.
If Ollama is like a home kitchen, vLLM is like a professional restaurant kitchen - designed for high volume, efficiency, and consistency.
PagedAttention - The Secret Sauce:
vLLM introduced PagedAttention, a revolutionary memory management technique inspired by OS virtual memory:
- Problem - Traditional inference wastes GPU memory by pre-allocating fixed-size KV cache for each request
- Solution - PagedAttention allocates memory in small blocks (pages), only as needed, like how OS manages RAM
- Result - 2-4x higher throughput compared to naive serving. Can serve 2-4x more concurrent requests on same hardware
vLLM Key Features:
- Continuous Batching - New requests join the batch as old ones finish, maximizing GPU utilization
- Tensor Parallelism - Split large models across multiple GPUs seamlessly
- Streaming - Token-by-token streaming for responsive UIs
- OpenAI-Compatible API - Drop-in replacement for OpenAI API endpoints
- Speculative Decoding - Use a small draft model to speed up generation from large models
When to Choose vLLM:
- Serving 50+ concurrent users
- Need maximum throughput per GPU dollar
- Running on dedicated GPU servers (cloud or on-prem)
- Production deployment with SLA requirements
Note: vLLM is the industry standard for production LLM serving. Companies like Anyscale, Databricks, and many startups use vLLM to serve their models.
llama.cpp - Raw C++ Efficiency
Run LLMs on CPU - No GPU Required
What is llama.cpp?
llama.cpp is a pure C/C++ implementation of LLM inference created by Georgi Gerganov. It can run models on CPU-only machines, Apple Silicon, and even phones. No Python, no PyTorch, no CUDA required (though GPU acceleration is supported).
This is the engine that powers Ollama under the hood. llama.cpp gives you direct low-level control, while Ollama wraps it with a user-friendly interface.
GGUF Format - The Universal Model Format:
- GGUF (GPT-Generated Unified Format) - Created by the llama.cpp project, now the standard for local model distribution
- Self-Contained - One file has the model weights, tokenizer, and metadata
- Quantization Built-In - GGUF files come pre-quantized (Q4_K_M, Q5_K_S, Q8_0 etc.)
- HuggingFace - TheBloke and other quantizers upload thousands of GGUF models to HuggingFace
Quantization Levels Explained:
| Level | Bits | Size (7B model) | Quality |
|---|---|---|---|
| Q2_K | 2-bit | ~2.8 GB | Poor - significant degradation |
| Q4_K_M | 4-bit | ~4.1 GB | Good - recommended sweet spot |
| Q5_K_S | 5-bit | ~4.8 GB | Very Good - near original |
| Q8_0 | 8-bit | ~7.2 GB | Excellent - minimal loss |
| F16 | 16-bit | ~14.0 GB | Original - no quantization |
llama.cpp Use Cases:
- Edge Devices - Run on Raspberry Pi, phones, IoT devices
- CPU-Only Servers - When GPU budget is not available
- Maximum Efficiency - Best tokens-per-watt ratio
- Embedding Under the Hood - Powers Ollama, LM Studio, GPT4All, and many other tools
Note: llama.cpp democratized local LLMs. Before it existed, running models locally required expensive GPUs and complex PyTorch setups. Now anyone with a decent laptop can run AI models.
Choosing the Right Tool & Hardware Guide
Practical Decision Framework
Ollama vs vLLM vs llama.cpp:
| Factor | Ollama | vLLM | llama.cpp |
|---|---|---|---|
| Setup Difficulty | Very Easy | Medium | Medium-Hard |
| Best For | Dev, prototyping | Production serving | Edge, CPU inference |
| GPU Required | Optional | Yes (NVIDIA) | No |
| Throughput | Low-Medium | Very High | Low-Medium |
| Concurrent Users | 1-5 | 50-1000+ | 1-10 |
| Multi-GPU | Limited | Yes (tensor parallel) | Limited |
Hardware Requirements Guide:
- Laptop (16GB RAM, no GPU) - Use Ollama with 7B Q4 models. Good for experimentation and simple tasks
- Gaming PC (RTX 4070, 16GB VRAM) - Run 13B models at full speed. Great for coding assistants and local RAG
- Workstation (RTX 4090, 24GB VRAM) - Run 33B models or multiple 7B models. Production-quality for most tasks
- Server (2x A100, 80GB each) - Run 70B models with vLLM. Serve hundreds of concurrent users
- Apple Silicon (M2/M3 Pro/Max) - Unified memory is a game-changer. M3 Max with 96GB can run 70B models on a laptop!
Cost Comparison - Local vs Cloud (Monthly, 100K requests):
| Setup | Monthly Cost | Quality |
|---|---|---|
| OpenAI GPT-4o | ~Rs 80,000 | Excellent |
| Local RTX 4090 (Llama 70B) | ~Rs 3,000 (electricity) | Very Good |
| Cloud GPU (A100 spot) | ~Rs 15,000 | Very Good |
Note: Start with Ollama for learning. Move to vLLM when you need production scale. Use llama.cpp when you need maximum efficiency or CPU-only deployment.
Common Pitfalls & Best Practices
Avoid These Mistakes When Running Local LLMs
Pitfall 1: Wrong Model Size for Your Hardware
The most common mistake. A 70B model on 16GB RAM will either crash or run at 1 token/second. Rule of thumb: model size in GB should be less than your available VRAM/RAM. For CPU inference, you need roughly 1.5x the model file size in RAM.
Pitfall 2: Over-Quantizing
Going below Q4 quantization (like Q2 or Q3) saves memory but destroys model quality. The model will give wrong answers confidently. Q4_K_M is the sweet spot for most use cases - good balance of size and quality.
Best Practices:
- Always Benchmark - Test your model on your specific task before deploying. Generic benchmarks do not tell the full story
- Monitor GPU Memory - Use nvidia-smi or Activity Monitor to track VRAM usage
- Set Context Length Wisely - Longer context = more memory. Do not set 128K context if you only need 4K
- Use Streaming - Always enable streaming for better user experience. First token appears faster
- Keep Models Updated - New quantizations and model versions drop regularly. Stay current
Security Considerations:
- Never expose Ollama to internet - By default it binds to localhost. Keep it that way or add authentication
- Model Provenance - Only download models from trusted sources (Ollama library, HuggingFace official repos)
- Prompt Injection - Local models are equally vulnerable to prompt injection attacks. Sanitize inputs
Note: The biggest mistake beginners make is choosing a model too large for their hardware. Start small (7B), verify it works well, then scale up if needed.
Interview Questions
Q: When would you choose local LLM deployment over cloud APIs?
Local LLMs are ideal when: (1) Data privacy is critical - sensitive data cannot leave your infrastructure. (2) Cost optimization at scale - 100K+ daily requests make cloud APIs prohibitively expensive. (3) Low latency requirements - no network round-trip overhead. (4) Offline operation needed. (5) Full model control for fine-tuning. Cloud APIs are better when model quality is paramount, volume is low, or you lack GPU infrastructure.
Q: What is PagedAttention and why does it matter for LLM serving?
PagedAttention (introduced by vLLM) manages KV cache memory using paging, similar to OS virtual memory. Traditional serving pre-allocates maximum possible KV cache per request, wasting memory. PagedAttention allocates small blocks on-demand, reducing memory waste by 60-80%. This allows 2-4x more concurrent requests on the same GPU hardware, dramatically improving throughput and cost efficiency.
Q: What is model quantization and what are the trade-offs?
Quantization reduces model weight precision from 16-bit (FP16) to lower bit widths (8-bit, 4-bit, 2-bit). This shrinks model size and memory usage proportionally. Trade-offs: lower precision means some quality degradation. Q4_K_M (4-bit) is the sweet spot - reduces size by ~4x with minimal quality loss. Below 4-bit, quality drops significantly. GGUF format is the standard for quantized local models.
Q: Compare Ollama, vLLM, and llama.cpp for different use cases.
Ollama: Best for developers - one-command setup, model library, great for prototyping. Single-user focused. vLLM: Production serving engine - PagedAttention, continuous batching, tensor parallelism. Handles 50-1000+ concurrent users. Requires NVIDIA GPUs. llama.cpp: Raw C++ efficiency - runs on CPU, phones, edge devices. Powers Ollama under the hood. Best for resource-constrained environments.
Q: How do you estimate hardware requirements for a local LLM deployment?
Rule of thumb: VRAM needed is approximately the model file size (for quantized models). A 7B model at Q4 is ~4GB, needing at least 6GB VRAM. For CPU inference, need 1.5x the model size in RAM. Context length matters too - longer contexts need more KV cache memory. For production: multiply by concurrent users. A 70B model at Q4 (~40GB) needs 2x A100 (80GB each) for comfortable serving with vLLM.
Frequently Asked Questions
What is Running Local LLMs?
Learn how to run powerful language models locally on your machine. From Ollama's simplicity to vLLM's production-grade serving and llama.cpp's raw efficiency - master self-hosted AI.
How does Running Local LLMs work?
Complete Privacy, Zero API Costs, Full Control The Case for Local LLMs Running LLMs locally means the model runs on your own hardware - your laptop, server, or GPU cluster. No data leaves your machine, no per-token charges, and you have complete control over which model, version, and configuration you use.
Related topics
Practice this on DevInterviewMaster
Read the full Running Local LLMs (Ollama, vLLM, llama.cpp) breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.