AI & AutomationFree to read

Fine-tuning LLMs (LoRA, QLoRA, RLHF, DPO)

Teach AI Models Your Domain, Your Style, Your Rules

Master the art of customizing large language models for your specific use case. From parameter-efficient LoRA to alignment techniques like RLHF and DPO - make any model truly yours.

Why Fine-tune an LLM?

When Prompting Is Not Enough

The Fine-tuning Decision

Base LLMs like Llama or Mistral are trained on general internet text. They know a lot but are not specialized. Fine-tuning teaches them your specific domain - your company data, your writing style, your task format, your domain terminology.

Think of it like hiring a fresh IIT graduate (base model). They are smart and know fundamentals, but you need to train them on your company processes, coding standards, and domain expertise. That training period is fine-tuning.

When to Fine-tune vs When to Prompt:

Use Prompting/RAG When: You need factual accuracy from specific documents, the task is straightforward, you want flexibility to change behavior quickly
Use Fine-tuning When: You need a specific output format consistently, the model needs domain-specific reasoning patterns, you want to reduce token usage (shorter prompts), the task requires specialized knowledge embedded in weights

Types of Fine-tuning:

Supervised Fine-tuning (SFT) - Train on input-output pairs. "Given this input, produce this output." Most common approach.
Instruction Tuning - SFT specifically on instruction-following data. Makes base models chat-ready.
Alignment (RLHF/DPO) - Teach the model human preferences - which responses are better, safer, more helpful.
Domain Adaptation - Continue pre-training on domain-specific text (medical, legal, finance) to build domain knowledge.

Note: Fine-tuning is not always the answer. 80% of use cases can be solved with good prompting and RAG. Fine-tune only when you have clear evidence that prompting is insufficient.

LoRA - The Revolution in Efficient Fine-tuning

Train 0.1% of Parameters, Get 95% of Full Fine-tuning Quality

What is LoRA?

LoRA (Low-Rank Adaptation) is a technique that freezes the original model weights and adds small trainable "adapter" matrices alongside them. Instead of updating billions of parameters, you train only millions - a 100-1000x reduction in trainable parameters.

Analogy: Instead of rebuilding an entire Flipkart warehouse (full fine-tuning), you just add a small specialized section for a new product category (LoRA). The main warehouse stays untouched, and you can add/remove sections easily.

How LoRA Works (Simplified):

A weight matrix W (say 4096 x 4096 = 16M params) gets a low-rank decomposition: W + BA where B is 4096 x r and A is r x 4096. If rank r = 16, the adapter has only 4096 x 16 x 2 = ~131K params instead of 16M. That is 99.2% fewer parameters!

Original weights - Frozen, not modified during training
Adapter weights (A, B) - Small matrices that are trained
At inference - Adapters are merged into original weights. Zero additional latency!

LoRA Key Parameters:

Rank (r) - Size of the adapter. r=8 for simple tasks, r=64 for complex. Higher = more capacity but more memory.
Alpha - Scaling factor. Usually alpha = 2 x rank. Controls how much the adapter influences the output.
Target Modules - Which layers to apply LoRA to. Usually attention layers (q_proj, v_proj, k_proj, o_proj). Some tasks benefit from also targeting MLP layers.
Dropout - Regularization. 0.05-0.1 typical. Prevents overfitting on small datasets.

LoRA Benefits:

Memory - Fine-tune 7B model on a single RTX 4090 (24GB) instead of needing 4x A100s
Speed - 2-5x faster training than full fine-tuning
Storage - Adapter is 10-100 MB vs 14+ GB for full model. Serve multiple task-specific adapters from one base model
No Catastrophic Forgetting - Base model knowledge is preserved since original weights are frozen

Note: LoRA is the default fine-tuning method in 2025-26. It is supported by HuggingFace PEFT, Axolotl, Unsloth, and every major training framework.

QLoRA - Fine-tune 70B Models on Consumer GPUs

4-bit Quantization + LoRA = Magic

What is QLoRA?

QLoRA (Quantized LoRA) combines 4-bit quantization of the base model with LoRA adapters trained in full precision. The base model is loaded in 4-bit (NF4 format), reducing memory by 4x, while LoRA adapters are trained in BF16 for accuracy.

This means you can fine-tune a 65B parameter model on a single 48GB GPU. Before QLoRA, this required 4-8 A100s (costing lakhs per month in cloud compute).

QLoRA Innovations:

NF4 (Normal Float 4-bit) - A new data type optimized for normally distributed neural network weights. Better quality than standard INT4.
Double Quantization - Quantizes the quantization constants themselves, saving an additional 0.4 bits per parameter.
Paged Optimizers - Uses unified memory to handle memory spikes during training. Automatically moves optimizer states to CPU when GPU runs out.

Memory Comparison:

Method	7B Model	13B Model	70B Model
Full Fine-tuning (FP16)	~120 GB	~220 GB	~1.2 TB
LoRA (FP16 base)	~30 GB	~55 GB	~300 GB
QLoRA (4-bit base)	~10 GB	~18 GB	~48 GB

Practical Guidance:

For most users - Use QLoRA. The quality difference vs full LoRA is negligible for most tasks.
If you have the GPU budget - Full LoRA with BF16 gives slightly better results for complex tasks.
Unsloth library - 2x faster QLoRA training with 50% less memory. Highly recommended.

Note: QLoRA made fine-tuning accessible to everyone. You can fine-tune a Llama 3 8B model on a single RTX 3090/4090 in a few hours. No expensive cloud compute needed.

RLHF and DPO - Aligning Models with Human Preferences

Teaching Models What Humans Actually Want

The Alignment Problem

A model trained on internet text can generate text that is factual but unhelpful, or helpful but unsafe, or technically correct but poorly formatted. Alignment teaches the model human preferences - what a "good" response looks like vs a "bad" one.

Think of it like training a new Zomato delivery partner. They know how to ride a bike (base capability) but you need to teach them: be polite, do not ring the bell at midnight, handle food carefully, notify if late. These are "preferences" that make the difference between a good and bad experience.

RLHF (Reinforcement Learning from Human Feedback):

The original alignment method used by ChatGPT. Three-step process:

Step 1: SFT - Fine-tune base model on high-quality demonstrations
Step 2: Reward Model - Train a separate model on human preference data (response A is better than B for this prompt). This model scores responses.
Step 3: PPO Training - Use the reward model to give feedback while training the main model with PPO (Proximal Policy Optimization) reinforcement learning algorithm.

Problems with RLHF: Complex (3 models: SFT, Reward, Policy), expensive, unstable training, reward hacking (model games the reward model).

DPO (Direct Preference Optimization) - The Simpler Alternative:

DPO skips the reward model entirely. It directly optimizes the model to prefer the "chosen" response over the "rejected" response using a clever mathematical trick.

Input - Pairs of (prompt, chosen_response, rejected_response)
Training - One stage, directly on the preference pairs. No reward model needed.
Result - Equivalent or better quality than RLHF with much simpler training

RLHF vs DPO:

Aspect	RLHF	DPO
Complexity	High (3 models)	Low (1 model)
Training Stability	Unstable (RL)	Stable (supervised)
Data Needed	Demonstrations + Comparisons	Only Comparisons
Quality	Excellent	Excellent
Industry Trend	Declining	Growing (preferred)

Note: DPO has largely replaced RLHF for most open-source model alignment. It is simpler, more stable, and produces comparable results. Llama 3, Mistral, and most modern models use DPO or its variants.

Practical Fine-tuning Guide

From Data to Deployed Model

Step-by-Step Fine-tuning Pipeline:

1. Data Preparation - Collect 1K-50K high-quality examples in the format your model will be used. For chat: system/user/assistant turns. For specific tasks: input/output pairs. Quality over quantity!
2. Choose Base Model - Llama 3 8B for most tasks. 70B if you need complex reasoning. Mistral for speed. DeepSeek for code.
3. Training Config - QLoRA with r=16, alpha=32, learning rate 2e-4, 3 epochs, cosine scheduler. This works for 90% of cases.
4. Train - Use Unsloth (fastest, lowest memory), Axolotl (most configurable), or HuggingFace TRL (most documented).
5. Evaluate - Test on held-out data. Check for overfitting (training loss low but eval loss high). Manual quality check on 50-100 examples.
6. Merge & Deploy - Merge LoRA into base model, quantize to GGUF/AWQ, deploy with Ollama/vLLM.

Data Quality Tips:

Garbage In, Garbage Out - 1000 high-quality examples beat 100K noisy ones
Diversity - Cover edge cases, different phrasings, varying complexity levels
Format Consistency - Use the exact chat template your base model expects (ChatML, Llama format, etc.)
Use AI to Bootstrap - Generate initial data with GPT-4/Claude, then manually review and correct. Much faster than writing from scratch.

Common Mistakes:

Overfitting - Too many epochs on small datasets. Model memorizes instead of learning patterns. Fix: fewer epochs, more dropout, more data.
Wrong Chat Template - Using Llama template for a Mistral model causes gibberish. Always match the template.
Too High Learning Rate - Destroys base model knowledge. Start low (1e-5 to 2e-4) and use warmup.
Catastrophic Forgetting - Full fine-tuning can erase general knowledge. LoRA prevents this naturally.

Note: Always start with the smallest dataset that shows improvement. Fine-tuning on too much noisy data can actually make the model worse than the base model.

Emerging Techniques & When NOT to Fine-tune

Stay Current and Avoid Wasted Effort

Newer Alignment Techniques:

ORPO (Odds Ratio Preference Optimization) - Combines SFT and alignment in one step. Even simpler than DPO.
KTO (Kahneman-Tversky Optimization) - Works with just thumbs-up/thumbs-down data instead of requiring preference pairs. Easier data collection.
SPIN (Self-Play Fine-tuning) - Model improves by playing against previous versions of itself. No human preference data needed.

When NOT to Fine-tune:

Factual Knowledge - Fine-tuning is bad at injecting facts. Use RAG instead. Fine-tuning teaches patterns, not facts.
You Have Less Than 100 Examples - Not enough data for meaningful fine-tuning. Few-shot prompting will work better.
The Task Changes Frequently - Fine-tuning is slow to update. Prompts can be changed instantly.
You Cannot Evaluate Properly - Without good evaluation metrics, you will not know if fine-tuning actually helped.
Cloud API Quality is Sufficient - If GPT-4 with good prompting works, fine-tuning a smaller model may not match the quality.

The Modern Fine-tuning Stack:

Unsloth - 2x faster training, 50% less memory. Best for QLoRA.
Axolotl - Most features, YAML config-driven. Best for advanced users.
HuggingFace TRL - Official library for SFT, DPO, RLHF. Best documentation.
LLaMA-Factory - WebUI for fine-tuning. Best for beginners who prefer GUI.
OpenAI Fine-tuning API - Upload data, click train. Easiest but only for OpenAI models and expensive.

Note: Fine-tuning is a powerful tool but not a magic wand. Always try prompting and RAG first. Fine-tune only when you have data showing these approaches are not enough.

Interview Questions

Q: What is LoRA and why is it preferred over full fine-tuning?

LoRA (Low-Rank Adaptation) freezes original model weights and adds small trainable adapter matrices. It trains only 0.1-1% of parameters but achieves 95%+ of full fine-tuning quality. Benefits: 100x less memory, 2-5x faster training, adapter is only 10-100 MB, prevents catastrophic forgetting since base weights are frozen, and you can serve multiple task-specific adapters from one base model.

Q: What is the difference between RLHF and DPO?

RLHF uses a three-stage pipeline: SFT, reward model training, and PPO-based RL training. DPO simplifies this to a single stage by directly optimizing on preference pairs (chosen vs rejected responses) without needing a separate reward model. DPO is simpler, more stable, requires less compute, and produces comparable results. DPO has largely replaced RLHF in the open-source ecosystem.

Q: When should you fine-tune vs use RAG?

Fine-tune when you need: consistent output formatting, domain-specific reasoning patterns, reduced prompt length, or specialized behavior embedded in weights. Use RAG when you need: factual accuracy from specific documents, frequently updated knowledge, or traceability to sources. Fine-tuning teaches patterns and style; RAG provides facts. Many production systems combine both.

Q: What is QLoRA and how does it differ from LoRA?

QLoRA loads the base model in 4-bit quantization (NF4 format) while training LoRA adapters in full precision (BF16). This reduces memory by 4x compared to standard LoRA, enabling fine-tuning of 70B models on a single 48GB GPU. Key innovations: NF4 data type, double quantization, and paged optimizers. Quality difference vs full LoRA is negligible for most tasks.

Q: What are common fine-tuning mistakes and how to avoid them?

(1) Overfitting on small datasets - use fewer epochs, more dropout. (2) Wrong chat template - always match the base model expected format. (3) Too high learning rate - start at 1e-5 to 2e-4 with warmup. (4) Noisy training data - quality over quantity, manually review samples. (5) No proper evaluation - always have a held-out test set and do manual quality checks.

Frequently Asked Questions

What is Fine-tuning LLMs?

Master the art of customizing large language models for your specific use case. From parameter-efficient LoRA to alignment techniques like RLHF and DPO - make any model truly yours.

How does Fine-tuning LLMs work?

When Prompting Is Not Enough The Fine-tuning Decision Base LLMs like Llama or Mistral are trained on general internet text. They know a lot but are not specialized.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Fine-tuning LLMs (LoRA, QLoRA, RLHF, DPO) breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Fine-tuning LLMs (LoRA, QLoRA, RLHF, DPO)

Why Fine-tune an LLM?

LoRA - The Revolution in Efficient Fine-tuning

QLoRA - Fine-tune 70B Models on Consumer GPUs

RLHF and DPO - Aligning Models with Human Preferences

Practical Fine-tuning Guide

Emerging Techniques & When NOT to Fine-tune

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster