RAG Evaluation (RAGAS, Faithfulness, Relevancy)
You Cannot Improve What You Cannot Measure
Building a RAG pipeline is easy. Knowing if it actually works well is hard. Learn the metrics, frameworks, and techniques to rigorously evaluate your RAG system - from retrieval quality to generation faithfulness.
Why RAG Evaluation Matters
The Silent Killer of RAG Projects
The Problem:
Most teams build a RAG pipeline, test it with 5-10 questions, say "looks good enough", and ship it. Three weeks later, users complain about wrong answers, hallucinations, and missing information. Without proper evaluation, you are flying blind.
RAG evaluation tells you: Is the retrieval finding the right documents? Is the LLM being faithful to the retrieved context? Is the final answer actually relevant to the user question?
Real-World Analogy - Restaurant Quality:
Imagine evaluating a restaurant:
- Retrieval = Kitchen finding ingredients: Did the kitchen pull the right ingredients from the pantry? (Precision, Recall)
- Context Relevancy = Ingredient quality: Are the fetched ingredients actually needed for this dish? (No extra random stuff)
- Faithfulness = Chef following recipe: Did the chef use ONLY the ingredients available, or did they add imaginary spices? (Hallucination check)
- Answer Relevancy = Final dish matching order: Does the served dish actually answer what the customer ordered? (Not a perfect biryani when they asked for dosa)
The Two Dimensions of RAG Evaluation:
| Dimension | What It Measures | Key Metrics |
|---|---|---|
| Retrieval Quality | Are we finding the right documents? | Precision, Recall, MRR, NDCG |
| Generation Quality | Is the LLM answer correct and faithful? | Faithfulness, Relevancy, Correctness |
Note: Most RAG failures are retrieval failures in disguise. Always evaluate retrieval quality separately before blaming the LLM.
RAGAS Framework - The Gold Standard
Automated RAG Evaluation Without Human Labels
What is RAGAS?
RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that evaluates RAG pipelines using LLM-as-a-judge. The key innovation: it can evaluate your RAG system without needing human-labeled ground truth for every question. It uses the LLM itself to assess quality.
The Four Core RAGAS Metrics:
- Faithfulness (0-1): Can every claim in the answer be traced back to the retrieved context? High faithfulness = no hallucination. Measured by extracting claims from the answer and checking each against the context.
- Answer Relevancy (0-1): Is the answer actually addressing the question? Measured by generating hypothetical questions from the answer and comparing them to the original question via embedding similarity.
- Context Precision (0-1): Are the top-ranked retrieved chunks actually relevant? Measures if relevant chunks appear before irrelevant ones in the retrieval results.
- Context Recall (0-1): Does the retrieved context contain all the information needed to answer the question? Needs ground truth answers to measure. Checks if every sentence in the ground truth can be attributed to the context.
How Faithfulness is Calculated:
- LLM extracts all factual claims from the generated answer
- For each claim, LLM checks: "Can this claim be inferred from the retrieved context?"
- Faithfulness = Number of supported claims / Total claims
Example: Answer has 5 claims. 4 are found in context, 1 is hallucinated. Faithfulness = 4/5 = 0.80
Typical Score Ranges:
- Excellent RAG: Faithfulness > 0.9, Relevancy > 0.85
- Good RAG: Faithfulness 0.75-0.9, Relevancy 0.7-0.85
- Needs Work: Faithfulness < 0.75, Relevancy < 0.7
Note: RAGAS is the most widely used RAG evaluation framework. It requires no manual labeling for faithfulness and answer relevancy - making it practical for real projects.
Retrieval Metrics Deep Dive
Measuring How Well Your Retriever Finds the Right Documents
Classical IR Metrics:
These metrics come from Information Retrieval research and measure retrieval quality independently of the LLM.
Key Metrics Explained:
- Precision@k: Of the top-k retrieved chunks, how many are actually relevant? If you retrieve 10 chunks and 7 are relevant: Precision@10 = 0.7. High precision = less noise for the LLM.
- Recall@k: Of all relevant chunks in the database, how many did we retrieve? If there are 10 relevant chunks and we found 6: Recall@10 = 0.6. High recall = we are not missing important information.
- MRR (Mean Reciprocal Rank): How high is the first relevant result ranked? If the first relevant result is at position 3, reciprocal rank = 1/3. Average across all queries. MRR = 1.0 means the first result is always relevant.
- NDCG (Normalized Discounted Cumulative Gain): Measures overall ranking quality - relevant docs at higher positions get more credit. More nuanced than precision because it considers the position of each relevant result.
Practical Example:
Query: "What is the GST rate for smartphones in India?"
Ground truth: Chunks #3, #7, #12 are relevant.
| Metric | Retrieved: [#3, #5, #7, #9, #12] | Retrieved: [#1, #5, #7, #9, #3] |
|---|---|---|
| Precision@5 | 3/5 = 0.60 | 2/5 = 0.40 |
| Recall@5 | 3/3 = 1.00 | 2/3 = 0.67 |
| MRR | 1/1 = 1.00 | 1/3 = 0.33 |
Which Metric to Prioritize:
- Chatbot (quick answers): MRR - you want the first result to be right
- Research (completeness): Recall - you cannot afford to miss information
- LLM context window: Precision - less noise = better LLM output
Note: Retrieval metrics require ground truth labels (which chunks are relevant for which queries). Build a test set of at least 50-100 query-relevance pairs for meaningful evaluation.
Generation Quality and Hallucination Detection
Is the LLM Making Things Up?
Types of Generation Failures:
- Extrinsic Hallucination: LLM adds facts NOT present in the retrieved context. Most dangerous in RAG.
- Intrinsic Hallucination: LLM contradicts the retrieved context.
- Irrelevant Answer: Answer is factually correct but does not address the question.
- Incomplete Answer: Answer addresses the question but misses key information from context.
Hallucination Detection Methods:
- Claim-Level Verification (RAGAS approach): Break answer into atomic claims, check each against context. Most thorough but expensive.
- NLI-Based: Use Natural Language Inference models to check if context entails the answer. Cheaper, less accurate.
- SelfCheckGPT: Generate multiple answers, check consistency. If the model gives different answers each time, it is probably hallucinating.
- LLM-as-Judge: Ask a separate LLM: "Given this context, is the following answer faithful?" Simple but surprisingly effective.
Beyond RAGAS - Other Evaluation Frameworks:
| Framework | Strength | Use Case |
|---|---|---|
| RAGAS | No labels needed, comprehensive | General RAG evaluation |
| DeepEval | More metrics, CI/CD integration | Production monitoring |
| TruLens | Feedback functions, dashboards | Development iteration |
| LangSmith | End-to-end tracing + eval | LangChain ecosystem |
Note: Extrinsic hallucination is the most dangerous RAG failure - the LLM confidently states facts that are not in the context. Always measure faithfulness before deploying.
Building a RAG Evaluation Pipeline
Practical Steps to Evaluate Your RAG System
Step 1: Create an Evaluation Dataset
- Collect 50-100 representative questions your users actually ask
- For each question, identify the relevant source chunks (ground truth for retrieval)
- Optionally, write ideal answers (ground truth for generation)
- Include edge cases: questions with no answer in the corpus, ambiguous questions, multi-part questions
Step 2: Evaluate Retrieval Separately
- Run each question through your retriever only (no LLM generation)
- Compare retrieved chunks against ground truth
- Calculate Precision@k, Recall@k, and MRR
- If retrieval is bad (Recall < 0.7), fix retrieval first - no LLM can compensate
Step 3: Evaluate End-to-End with RAGAS
- Run full RAG pipeline (retrieve + generate) on each question
- Use RAGAS to compute: Faithfulness, Answer Relevancy, Context Precision
- Low faithfulness? LLM is hallucinating - improve your prompt or use CRAG
- Low context precision? Retrieval is noisy - add re-ranking or improve chunking
- Low answer relevancy? LLM is going off-topic - improve system prompt
Step 4: Continuous Monitoring
- Log every production query, retrieved context, and generated answer
- Run RAGAS evaluation on a random sample daily or weekly
- Set up alerts for faithfulness drops (could indicate knowledge base staleness)
- Track metrics over time to detect regression
Note: Evaluation is not a one-time activity. Build it into your CI/CD pipeline and run it on every change to your retrieval, chunking, or prompting logic.
Interview Questions
Q: What are the four core RAGAS metrics and what does each measure?
(1) Faithfulness: Can every claim in the answer be traced to the retrieved context? Measures hallucination. (2) Answer Relevancy: Does the answer actually address the question asked? (3) Context Precision: Are the relevant chunks ranked higher than irrelevant ones in retrieval? (4) Context Recall: Does the retrieved context contain all information needed to answer fully? Together they cover both retrieval and generation quality.
Q: How would you diagnose and fix a RAG system with low faithfulness scores?
Low faithfulness means the LLM is adding information not present in the retrieved context (hallucinating). Diagnostic steps: (1) Check if retrieval is fetching relevant chunks - if not, fix retrieval first. (2) Review the system prompt - add explicit instructions like "Only answer based on the given context. Say you do not know if the context lacks the answer." (3) Reduce temperature to 0-0.1 for more deterministic output. (4) Implement Corrective RAG to filter out irrelevant chunks before generation. (5) Use a more capable model that follows instructions better.
Q: Why should you evaluate retrieval quality separately from generation quality?
Because most RAG failures are retrieval failures in disguise. If retrieval returns irrelevant chunks, even the best LLM cannot produce a good answer. By evaluating retrieval metrics (Precision, Recall, MRR) separately, you can pinpoint whether the problem is in finding the right documents or in generating the answer from them. This avoids wasting time tuning prompts when the real issue is chunking strategy or embedding model choice.
Frequently Asked Questions
What is RAG Evaluation?
Building a RAG pipeline is easy. Knowing if it actually works well is hard.
How does RAG Evaluation work?
The Silent Killer of RAG Projects The Problem: Most teams build a RAG pipeline, test it with 5-10 questions, say "looks good enough", and ship it. Three weeks later, users complain about wrong answers, hallucinations, and missing information.
Related topics
Practice this on DevInterviewMaster
Read the full RAG Evaluation (RAGAS, Faithfulness, Relevancy) breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.