AI & AutomationFree to read

Running Local LLMs (Ollama, vLLM, llama.cpp)

Your Own AI, Your Own Hardware, Zero API Bills

Learn how to run powerful language models locally on your machine. From Ollama's simplicity to vLLM's production-grade serving and llama.cpp's raw efficiency - master self-hosted AI.

Why Run LLMs Locally?

Complete Privacy, Zero API Costs, Full Control

The Case for Local LLMs

Running LLMs locally means the model runs on your own hardware - your laptop, server, or GPU cluster. No data leaves your machine, no per-token charges, and you have complete control over which model, version, and configuration you use.

Think of it like cooking at home vs ordering from Swiggy. Swiggy (cloud APIs) is convenient but expensive and you have no control over ingredients. Cooking at home (local LLMs) takes more setup but gives you full control over quality, cost, and privacy.

When to Go Local:

Data Privacy - Sensitive data (medical records, financial data, legal docs) that cannot leave your infrastructure
Cost Optimization - High-volume use cases where API costs become prohibitive (100K+ requests/day)
Latency Requirements - Edge deployments where network latency to cloud APIs is unacceptable
Offline Access - Systems that must work without internet (defense, remote locations, embedded devices)
Customization - Need to fine-tune or modify the model for specific tasks

Trade-offs to Consider:

Hardware Cost - You need GPUs (NVIDIA RTX 4090 ~ Rs 1.5L) or powerful CPUs with lots of RAM
Model Quality - Local models (7B-70B params) are improving but still behind GPT-4/Claude for complex reasoning
Maintenance - You manage updates, scaling, monitoring, and hardware failures yourself
Setup Complexity - CUDA drivers, model downloads, quantization - initial setup can be tricky

Note: Local LLMs are not a replacement for cloud APIs in all cases. The best approach is often hybrid - local for sensitive/high-volume tasks, cloud APIs for complex reasoning where quality matters most.

Ollama - The Docker of Local LLMs

One Command to Run Any Model

What is Ollama?

Ollama is the easiest way to run LLMs locally. It packages models with their runtime into a single binary, similar to how Docker packages applications. One command and you have a model running with an OpenAI-compatible API.

It handles model downloading, quantization selection, GPU detection, memory management, and serving - all automatically. Think of it as the "Homebrew for LLMs."

How Ollama Works:

Model Library - Curated collection of 100+ models (Llama 3, Mistral, Gemma, Phi, CodeLlama, DeepSeek)
Modelfile - Like a Dockerfile but for LLMs. Defines base model, system prompt, parameters, and template format
Automatic GPU Detection - Finds your NVIDIA/AMD/Apple Silicon GPU and uses it automatically
OpenAI-Compatible API - Serves on localhost:11434 with the same API format as OpenAI, so existing code works
Model Layers - Only downloads the diff when switching between related models, saving bandwidth

Popular Models on Ollama:

Model	Size	RAM Needed	Best For
Llama 3.1 8B	4.7 GB	8 GB	General chat, coding
Mistral 7B	4.1 GB	8 GB	Fast responses, reasoning
DeepSeek Coder V2	8.9 GB	16 GB	Code generation
Llama 3.1 70B	40 GB	48 GB	Complex reasoning
Phi-3 Mini	2.3 GB	4 GB	Lightweight, edge devices

Ollama Limitations:

Single Request - Default: processes one request at a time (no batching)
No Distributed Inference - Cannot split a model across multiple GPUs on different machines
Production Readiness - Great for dev/prototyping, but lacks production features like auth, rate limiting, load balancing

Note: Ollama is perfect for developers getting started with local LLMs. Install it, run one command, and you have a model serving API requests in minutes.

vLLM - Production-Grade LLM Serving

High-Throughput Inference Engine for Real Workloads

What is vLLM?

vLLM (Virtual LLM) is a high-throughput, memory-efficient LLM serving engine developed at UC Berkeley. It is designed for production deployments where you need to serve thousands of concurrent users with minimum latency.

If Ollama is like a home kitchen, vLLM is like a professional restaurant kitchen - designed for high volume, efficiency, and consistency.

PagedAttention - The Secret Sauce:

vLLM introduced PagedAttention, a revolutionary memory management technique inspired by OS virtual memory:

Problem - Traditional inference wastes GPU memory by pre-allocating fixed-size KV cache for each request
Solution - PagedAttention allocates memory in small blocks (pages), only as needed, like how OS manages RAM
Result - 2-4x higher throughput compared to naive serving. Can serve 2-4x more concurrent requests on same hardware

vLLM Key Features:

Continuous Batching - New requests join the batch as old ones finish, maximizing GPU utilization
Tensor Parallelism - Split large models across multiple GPUs seamlessly
Streaming - Token-by-token streaming for responsive UIs
OpenAI-Compatible API - Drop-in replacement for OpenAI API endpoints
Speculative Decoding - Use a small draft model to speed up generation from large models

When to Choose vLLM:

Serving 50+ concurrent users
Need maximum throughput per GPU dollar
Running on dedicated GPU servers (cloud or on-prem)
Production deployment with SLA requirements

Note: vLLM is the industry standard for production LLM serving. Companies like Anyscale, Databricks, and many startups use vLLM to serve their models.

llama.cpp - Raw C++ Efficiency

Run LLMs on CPU - No GPU Required

What is llama.cpp?

llama.cpp is a pure C/C++ implementation of LLM inference created by Georgi Gerganov. It can run models on CPU-only machines, Apple Silicon, and even phones. No Python, no PyTorch, no CUDA required (though GPU acceleration is supported).

This is the engine that powers Ollama under the hood. llama.cpp gives you direct low-level control, while Ollama wraps it with a user-friendly interface.

GGUF Format - The Universal Model Format:

GGUF (GPT-Generated Unified Format) - Created by the llama.cpp project, now the standard for local model distribution
Self-Contained - One file has the model weights, tokenizer, and metadata
Quantization Built-In - GGUF files come pre-quantized (Q4_K_M, Q5_K_S, Q8_0 etc.)
HuggingFace - TheBloke and other quantizers upload thousands of GGUF models to HuggingFace

Quantization Levels Explained:

Level	Bits	Size (7B model)	Quality
Q2_K	2-bit	~2.8 GB	Poor - significant degradation
Q4_K_M	4-bit	~4.1 GB	Good - recommended sweet spot
Q5_K_S	5-bit	~4.8 GB	Very Good - near original
Q8_0	8-bit	~7.2 GB	Excellent - minimal loss
F16	16-bit	~14.0 GB	Original - no quantization

llama.cpp Use Cases:

Edge Devices - Run on Raspberry Pi, phones, IoT devices
CPU-Only Servers - When GPU budget is not available
Maximum Efficiency - Best tokens-per-watt ratio
Embedding Under the Hood - Powers Ollama, LM Studio, GPT4All, and many other tools

Note: llama.cpp democratized local LLMs. Before it existed, running models locally required expensive GPUs and complex PyTorch setups. Now anyone with a decent laptop can run AI models.

Choosing the Right Tool & Hardware Guide

Practical Decision Framework

Ollama vs vLLM vs llama.cpp:

Factor	Ollama	vLLM	llama.cpp
Setup Difficulty	Very Easy	Medium	Medium-Hard
Best For	Dev, prototyping	Production serving	Edge, CPU inference
GPU Required	Optional	Yes (NVIDIA)	No
Throughput	Low-Medium	Very High	Low-Medium
Concurrent Users	1-5	50-1000+	1-10
Multi-GPU	Limited	Yes (tensor parallel)	Limited

Hardware Requirements Guide:

Laptop (16GB RAM, no GPU) - Use Ollama with 7B Q4 models. Good for experimentation and simple tasks
Gaming PC (RTX 4070, 16GB VRAM) - Run 13B models at full speed. Great for coding assistants and local RAG
Workstation (RTX 4090, 24GB VRAM) - Run 33B models or multiple 7B models. Production-quality for most tasks
Server (2x A100, 80GB each) - Run 70B models with vLLM. Serve hundreds of concurrent users
Apple Silicon (M2/M3 Pro/Max) - Unified memory is a game-changer. M3 Max with 96GB can run 70B models on a laptop!

Cost Comparison - Local vs Cloud (Monthly, 100K requests):

Setup	Monthly Cost	Quality
OpenAI GPT-4o	~Rs 80,000	Excellent
Local RTX 4090 (Llama 70B)	~Rs 3,000 (electricity)	Very Good
Cloud GPU (A100 spot)	~Rs 15,000	Very Good

Note: Start with Ollama for learning. Move to vLLM when you need production scale. Use llama.cpp when you need maximum efficiency or CPU-only deployment.

Common Pitfalls & Best Practices

Avoid These Mistakes When Running Local LLMs

Pitfall 1: Wrong Model Size for Your Hardware

The most common mistake. A 70B model on 16GB RAM will either crash or run at 1 token/second. Rule of thumb: model size in GB should be less than your available VRAM/RAM. For CPU inference, you need roughly 1.5x the model file size in RAM.

Pitfall 2: Over-Quantizing

Going below Q4 quantization (like Q2 or Q3) saves memory but destroys model quality. The model will give wrong answers confidently. Q4_K_M is the sweet spot for most use cases - good balance of size and quality.

Best Practices:

Always Benchmark - Test your model on your specific task before deploying. Generic benchmarks do not tell the full story
Monitor GPU Memory - Use nvidia-smi or Activity Monitor to track VRAM usage
Set Context Length Wisely - Longer context = more memory. Do not set 128K context if you only need 4K
Use Streaming - Always enable streaming for better user experience. First token appears faster
Keep Models Updated - New quantizations and model versions drop regularly. Stay current

Security Considerations:

Never expose Ollama to internet - By default it binds to localhost. Keep it that way or add authentication
Model Provenance - Only download models from trusted sources (Ollama library, HuggingFace official repos)
Prompt Injection - Local models are equally vulnerable to prompt injection attacks. Sanitize inputs

Note: The biggest mistake beginners make is choosing a model too large for their hardware. Start small (7B), verify it works well, then scale up if needed.

Interview Questions

Q: When would you choose local LLM deployment over cloud APIs?

Local LLMs are ideal when: (1) Data privacy is critical - sensitive data cannot leave your infrastructure. (2) Cost optimization at scale - 100K+ daily requests make cloud APIs prohibitively expensive. (3) Low latency requirements - no network round-trip overhead. (4) Offline operation needed. (5) Full model control for fine-tuning. Cloud APIs are better when model quality is paramount, volume is low, or you lack GPU infrastructure.

Q: What is PagedAttention and why does it matter for LLM serving?

PagedAttention (introduced by vLLM) manages KV cache memory using paging, similar to OS virtual memory. Traditional serving pre-allocates maximum possible KV cache per request, wasting memory. PagedAttention allocates small blocks on-demand, reducing memory waste by 60-80%. This allows 2-4x more concurrent requests on the same GPU hardware, dramatically improving throughput and cost efficiency.

Q: What is model quantization and what are the trade-offs?

Quantization reduces model weight precision from 16-bit (FP16) to lower bit widths (8-bit, 4-bit, 2-bit). This shrinks model size and memory usage proportionally. Trade-offs: lower precision means some quality degradation. Q4_K_M (4-bit) is the sweet spot - reduces size by ~4x with minimal quality loss. Below 4-bit, quality drops significantly. GGUF format is the standard for quantized local models.

Q: Compare Ollama, vLLM, and llama.cpp for different use cases.

Ollama: Best for developers - one-command setup, model library, great for prototyping. Single-user focused. vLLM: Production serving engine - PagedAttention, continuous batching, tensor parallelism. Handles 50-1000+ concurrent users. Requires NVIDIA GPUs. llama.cpp: Raw C++ efficiency - runs on CPU, phones, edge devices. Powers Ollama under the hood. Best for resource-constrained environments.

Q: How do you estimate hardware requirements for a local LLM deployment?

Rule of thumb: VRAM needed is approximately the model file size (for quantized models). A 7B model at Q4 is ~4GB, needing at least 6GB VRAM. For CPU inference, need 1.5x the model size in RAM. Context length matters too - longer contexts need more KV cache memory. For production: multiply by concurrent users. A 70B model at Q4 (~40GB) needs 2x A100 (80GB each) for comfortable serving with vLLM.

Frequently Asked Questions

What is Running Local LLMs?

Learn how to run powerful language models locally on your machine. From Ollama's simplicity to vLLM's production-grade serving and llama.cpp's raw efficiency - master self-hosted AI.

How does Running Local LLMs work?

Complete Privacy, Zero API Costs, Full Control The Case for Local LLMs Running LLMs locally means the model runs on your own hardware - your laptop, server, or GPU cluster. No data leaves your machine, no per-token charges, and you have complete control over which model, version, and configuration you use.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Running Local LLMs (Ollama, vLLM, llama.cpp) breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Running Local LLMs (Ollama, vLLM, llama.cpp)

Why Run LLMs Locally?

Ollama - The Docker of Local LLMs

vLLM - Production-Grade LLM Serving

llama.cpp - Raw C++ Efficiency

Choosing the Right Tool & Hardware Guide

Common Pitfalls & Best Practices

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster