DevInterviewMasterStart free →
AI & AutomationFree to read

Running Local LLMs (Ollama, vLLM, llama.cpp)

Your Own AI, Your Own Hardware, Zero API Bills

Learn how to run powerful language models locally on your machine. From Ollama's simplicity to vLLM's production-grade serving and llama.cpp's raw efficiency - master self-hosted AI.

Why Run LLMs Locally?

Complete Privacy, Zero API Costs, Full Control

The Case for Local LLMs

Running LLMs locally means the model runs on your own hardware - your laptop, server, or GPU cluster. No data leaves your machine, no per-token charges, and you have complete control over which model, version, and configuration you use.

Think of it like cooking at home vs ordering from Swiggy. Swiggy (cloud APIs) is convenient but expensive and you have no control over ingredients. Cooking at home (local LLMs) takes more setup but gives you full control over quality, cost, and privacy.

When to Go Local:

  • Data Privacy - Sensitive data (medical records, financial data, legal docs) that cannot leave your infrastructure
  • Cost Optimization - High-volume use cases where API costs become prohibitive (100K+ requests/day)
  • Latency Requirements - Edge deployments where network latency to cloud APIs is unacceptable
  • Offline Access - Systems that must work without internet (defense, remote locations, embedded devices)
  • Customization - Need to fine-tune or modify the model for specific tasks

Trade-offs to Consider:

  • Hardware Cost - You need GPUs (NVIDIA RTX 4090 ~ Rs 1.5L) or powerful CPUs with lots of RAM
  • Model Quality - Local models (7B-70B params) are improving but still behind GPT-4/Claude for complex reasoning
  • Maintenance - You manage updates, scaling, monitoring, and hardware failures yourself
  • Setup Complexity - CUDA drivers, model downloads, quantization - initial setup can be tricky

Note: Local LLMs are not a replacement for cloud APIs in all cases. The best approach is often hybrid - local for sensitive/high-volume tasks, cloud APIs for complex reasoning where quality matters most.

Ollama - The Docker of Local LLMs

One Command to Run Any Model

What is Ollama?

Ollama is the easiest way to run LLMs locally. It packages models with their runtime into a single binary, similar to how Docker packages applications. One command and you have a model running with an OpenAI-compatible API.

It handles model downloading, quantization selection, GPU detection, memory management, and serving - all automatically. Think of it as the "Homebrew for LLMs."

How Ollama Works:

  • Model Library - Curated collection of 100+ models (Llama 3, Mistral, Gemma, Phi, CodeLlama, DeepSeek)
  • Modelfile - Like a Dockerfile but for LLMs. Defines base model, system prompt, parameters, and template format
  • Automatic GPU Detection - Finds your NVIDIA/AMD/Apple Silicon GPU and uses it automatically
  • OpenAI-Compatible API - Serves on localhost:11434 with the same API format as OpenAI, so existing code works
  • Model Layers - Only downloads the diff when switching between related models, saving bandwidth

Popular Models on Ollama:

ModelSizeRAM NeededBest For
Llama 3.1 8B4.7 GB8 GBGeneral chat, coding
Mistral 7B4.1 GB8 GBFast responses, reasoning
DeepSeek Coder V28.9 GB16 GBCode generation
Llama 3.1 70B40 GB48 GBComplex reasoning
Phi-3 Mini2.3 GB4 GBLightweight, edge devices

Ollama Limitations:

  • Single Request - Default: processes one request at a time (no batching)
  • No Distributed Inference - Cannot split a model across multiple GPUs on different machines
  • Production Readiness - Great for dev/prototyping, but lacks production features like auth, rate limiting, load balancing

Note: Ollama is perfect for developers getting started with local LLMs. Install it, run one command, and you have a model serving API requests in minutes.

vLLM - Production-Grade LLM Serving

High-Throughput Inference Engine for Real Workloads

What is vLLM?

vLLM (Virtual LLM) is a high-throughput, memory-efficient LLM serving engine developed at UC Berkeley. It is designed for production deployments where you need to serve thousands of concurrent users with minimum latency.

If Ollama is like a home kitchen, vLLM is like a professional restaurant kitchen - designed for high volume, efficiency, and consistency.

PagedAttention - The Secret Sauce:

vLLM introduced PagedAttention, a revolutionary memory management technique inspired by OS virtual memory:

  • Problem - Traditional inference wastes GPU memory by pre-allocating fixed-size KV cache for each request
  • Solution - PagedAttention allocates memory in small blocks (pages), only as needed, like how OS manages RAM
  • Result - 2-4x higher throughput compared to naive serving. Can serve 2-4x more concurrent requests on same hardware

vLLM Key Features:

  • Continuous Batching - New requests join the batch as old ones finish, maximizing GPU utilization
  • Tensor Parallelism - Split large models across multiple GPUs seamlessly
  • Streaming - Token-by-token streaming for responsive UIs
  • OpenAI-Compatible API - Drop-in replacement for OpenAI API endpoints
  • Speculative Decoding - Use a small draft model to speed up generation from large models

When to Choose vLLM:

  • Serving 50+ concurrent users
  • Need maximum throughput per GPU dollar
  • Running on dedicated GPU servers (cloud or on-prem)
  • Production deployment with SLA requirements

Note: vLLM is the industry standard for production LLM serving. Companies like Anyscale, Databricks, and many startups use vLLM to serve their models.

llama.cpp - Raw C++ Efficiency

Run LLMs on CPU - No GPU Required

What is llama.cpp?

llama.cpp is a pure C/C++ implementation of LLM inference created by Georgi Gerganov. It can run models on CPU-only machines, Apple Silicon, and even phones. No Python, no PyTorch, no CUDA required (though GPU acceleration is supported).

This is the engine that powers Ollama under the hood. llama.cpp gives you direct low-level control, while Ollama wraps it with a user-friendly interface.

GGUF Format - The Universal Model Format:

  • GGUF (GPT-Generated Unified Format) - Created by the llama.cpp project, now the standard for local model distribution
  • Self-Contained - One file has the model weights, tokenizer, and metadata
  • Quantization Built-In - GGUF files come pre-quantized (Q4_K_M, Q5_K_S, Q8_0 etc.)
  • HuggingFace - TheBloke and other quantizers upload thousands of GGUF models to HuggingFace

Quantization Levels Explained:

LevelBitsSize (7B model)Quality
Q2_K2-bit~2.8 GBPoor - significant degradation
Q4_K_M4-bit~4.1 GBGood - recommended sweet spot
Q5_K_S5-bit~4.8 GBVery Good - near original
Q8_08-bit~7.2 GBExcellent - minimal loss
F1616-bit~14.0 GBOriginal - no quantization

llama.cpp Use Cases:

  • Edge Devices - Run on Raspberry Pi, phones, IoT devices
  • CPU-Only Servers - When GPU budget is not available
  • Maximum Efficiency - Best tokens-per-watt ratio
  • Embedding Under the Hood - Powers Ollama, LM Studio, GPT4All, and many other tools

Note: llama.cpp democratized local LLMs. Before it existed, running models locally required expensive GPUs and complex PyTorch setups. Now anyone with a decent laptop can run AI models.

Choosing the Right Tool & Hardware Guide

Practical Decision Framework

Ollama vs vLLM vs llama.cpp:

FactorOllamavLLMllama.cpp
Setup DifficultyVery EasyMediumMedium-Hard
Best ForDev, prototypingProduction servingEdge, CPU inference
GPU RequiredOptionalYes (NVIDIA)No
ThroughputLow-MediumVery HighLow-Medium
Concurrent Users1-550-1000+1-10
Multi-GPULimitedYes (tensor parallel)Limited

Hardware Requirements Guide:

  • Laptop (16GB RAM, no GPU) - Use Ollama with 7B Q4 models. Good for experimentation and simple tasks
  • Gaming PC (RTX 4070, 16GB VRAM) - Run 13B models at full speed. Great for coding assistants and local RAG
  • Workstation (RTX 4090, 24GB VRAM) - Run 33B models or multiple 7B models. Production-quality for most tasks
  • Server (2x A100, 80GB each) - Run 70B models with vLLM. Serve hundreds of concurrent users
  • Apple Silicon (M2/M3 Pro/Max) - Unified memory is a game-changer. M3 Max with 96GB can run 70B models on a laptop!

Cost Comparison - Local vs Cloud (Monthly, 100K requests):

SetupMonthly CostQuality
OpenAI GPT-4o~Rs 80,000Excellent
Local RTX 4090 (Llama 70B)~Rs 3,000 (electricity)Very Good
Cloud GPU (A100 spot)~Rs 15,000Very Good

Note: Start with Ollama for learning. Move to vLLM when you need production scale. Use llama.cpp when you need maximum efficiency or CPU-only deployment.

Common Pitfalls & Best Practices

Avoid These Mistakes When Running Local LLMs

Pitfall 1: Wrong Model Size for Your Hardware

The most common mistake. A 70B model on 16GB RAM will either crash or run at 1 token/second. Rule of thumb: model size in GB should be less than your available VRAM/RAM. For CPU inference, you need roughly 1.5x the model file size in RAM.

Pitfall 2: Over-Quantizing

Going below Q4 quantization (like Q2 or Q3) saves memory but destroys model quality. The model will give wrong answers confidently. Q4_K_M is the sweet spot for most use cases - good balance of size and quality.

Best Practices:

  • Always Benchmark - Test your model on your specific task before deploying. Generic benchmarks do not tell the full story
  • Monitor GPU Memory - Use nvidia-smi or Activity Monitor to track VRAM usage
  • Set Context Length Wisely - Longer context = more memory. Do not set 128K context if you only need 4K
  • Use Streaming - Always enable streaming for better user experience. First token appears faster
  • Keep Models Updated - New quantizations and model versions drop regularly. Stay current

Security Considerations:

  • Never expose Ollama to internet - By default it binds to localhost. Keep it that way or add authentication
  • Model Provenance - Only download models from trusted sources (Ollama library, HuggingFace official repos)
  • Prompt Injection - Local models are equally vulnerable to prompt injection attacks. Sanitize inputs

Note: The biggest mistake beginners make is choosing a model too large for their hardware. Start small (7B), verify it works well, then scale up if needed.

Interview Questions

Q: When would you choose local LLM deployment over cloud APIs?

Local LLMs are ideal when: (1) Data privacy is critical - sensitive data cannot leave your infrastructure. (2) Cost optimization at scale - 100K+ daily requests make cloud APIs prohibitively expensive. (3) Low latency requirements - no network round-trip overhead. (4) Offline operation needed. (5) Full model control for fine-tuning. Cloud APIs are better when model quality is paramount, volume is low, or you lack GPU infrastructure.

Q: What is PagedAttention and why does it matter for LLM serving?

PagedAttention (introduced by vLLM) manages KV cache memory using paging, similar to OS virtual memory. Traditional serving pre-allocates maximum possible KV cache per request, wasting memory. PagedAttention allocates small blocks on-demand, reducing memory waste by 60-80%. This allows 2-4x more concurrent requests on the same GPU hardware, dramatically improving throughput and cost efficiency.

Q: What is model quantization and what are the trade-offs?

Quantization reduces model weight precision from 16-bit (FP16) to lower bit widths (8-bit, 4-bit, 2-bit). This shrinks model size and memory usage proportionally. Trade-offs: lower precision means some quality degradation. Q4_K_M (4-bit) is the sweet spot - reduces size by ~4x with minimal quality loss. Below 4-bit, quality drops significantly. GGUF format is the standard for quantized local models.

Q: Compare Ollama, vLLM, and llama.cpp for different use cases.

Ollama: Best for developers - one-command setup, model library, great for prototyping. Single-user focused. vLLM: Production serving engine - PagedAttention, continuous batching, tensor parallelism. Handles 50-1000+ concurrent users. Requires NVIDIA GPUs. llama.cpp: Raw C++ efficiency - runs on CPU, phones, edge devices. Powers Ollama under the hood. Best for resource-constrained environments.

Q: How do you estimate hardware requirements for a local LLM deployment?

Rule of thumb: VRAM needed is approximately the model file size (for quantized models). A 7B model at Q4 is ~4GB, needing at least 6GB VRAM. For CPU inference, need 1.5x the model size in RAM. Context length matters too - longer contexts need more KV cache memory. For production: multiply by concurrent users. A 70B model at Q4 (~40GB) needs 2x A100 (80GB each) for comfortable serving with vLLM.

Frequently Asked Questions

What is Running Local LLMs?

Learn how to run powerful language models locally on your machine. From Ollama's simplicity to vLLM's production-grade serving and llama.cpp's raw efficiency - master self-hosted AI.

How does Running Local LLMs work?

Complete Privacy, Zero API Costs, Full Control The Case for Local LLMs Running LLMs locally means the model runs on your own hardware - your laptop, server, or GPU cluster. No data leaves your machine, no per-token charges, and you have complete control over which model, version, and configuration you use.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Running Local LLMs (Ollama, vLLM, llama.cpp) breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.