AI & AutomationFree to read

RAG Evaluation (RAGAS, Faithfulness, Relevancy)

You Cannot Improve What You Cannot Measure

Building a RAG pipeline is easy. Knowing if it actually works well is hard. Learn the metrics, frameworks, and techniques to rigorously evaluate your RAG system - from retrieval quality to generation faithfulness.

Why RAG Evaluation Matters

The Silent Killer of RAG Projects

The Problem:

Most teams build a RAG pipeline, test it with 5-10 questions, say "looks good enough", and ship it. Three weeks later, users complain about wrong answers, hallucinations, and missing information. Without proper evaluation, you are flying blind.

RAG evaluation tells you: Is the retrieval finding the right documents? Is the LLM being faithful to the retrieved context? Is the final answer actually relevant to the user question?

Real-World Analogy - Restaurant Quality:

Imagine evaluating a restaurant:

Retrieval = Kitchen finding ingredients: Did the kitchen pull the right ingredients from the pantry? (Precision, Recall)
Context Relevancy = Ingredient quality: Are the fetched ingredients actually needed for this dish? (No extra random stuff)
Faithfulness = Chef following recipe: Did the chef use ONLY the ingredients available, or did they add imaginary spices? (Hallucination check)
Answer Relevancy = Final dish matching order: Does the served dish actually answer what the customer ordered? (Not a perfect biryani when they asked for dosa)

The Two Dimensions of RAG Evaluation:

Dimension	What It Measures	Key Metrics
Retrieval Quality	Are we finding the right documents?	Precision, Recall, MRR, NDCG
Generation Quality	Is the LLM answer correct and faithful?	Faithfulness, Relevancy, Correctness

Note: Most RAG failures are retrieval failures in disguise. Always evaluate retrieval quality separately before blaming the LLM.

RAGAS Framework - The Gold Standard

Automated RAG Evaluation Without Human Labels

What is RAGAS?

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that evaluates RAG pipelines using LLM-as-a-judge. The key innovation: it can evaluate your RAG system without needing human-labeled ground truth for every question. It uses the LLM itself to assess quality.

The Four Core RAGAS Metrics:

Faithfulness (0-1): Can every claim in the answer be traced back to the retrieved context? High faithfulness = no hallucination. Measured by extracting claims from the answer and checking each against the context.
Answer Relevancy (0-1): Is the answer actually addressing the question? Measured by generating hypothetical questions from the answer and comparing them to the original question via embedding similarity.
Context Precision (0-1): Are the top-ranked retrieved chunks actually relevant? Measures if relevant chunks appear before irrelevant ones in the retrieval results.
Context Recall (0-1): Does the retrieved context contain all the information needed to answer the question? Needs ground truth answers to measure. Checks if every sentence in the ground truth can be attributed to the context.

How Faithfulness is Calculated:

LLM extracts all factual claims from the generated answer
For each claim, LLM checks: "Can this claim be inferred from the retrieved context?"
Faithfulness = Number of supported claims / Total claims

Example: Answer has 5 claims. 4 are found in context, 1 is hallucinated. Faithfulness = 4/5 = 0.80

Typical Score Ranges:

Excellent RAG: Faithfulness > 0.9, Relevancy > 0.85
Good RAG: Faithfulness 0.75-0.9, Relevancy 0.7-0.85
Needs Work: Faithfulness < 0.75, Relevancy < 0.7

Note: RAGAS is the most widely used RAG evaluation framework. It requires no manual labeling for faithfulness and answer relevancy - making it practical for real projects.

Retrieval Metrics Deep Dive

Measuring How Well Your Retriever Finds the Right Documents

Classical IR Metrics:

These metrics come from Information Retrieval research and measure retrieval quality independently of the LLM.

Key Metrics Explained:

Precision@k: Of the top-k retrieved chunks, how many are actually relevant? If you retrieve 10 chunks and 7 are relevant: Precision@10 = 0.7. High precision = less noise for the LLM.
Recall@k: Of all relevant chunks in the database, how many did we retrieve? If there are 10 relevant chunks and we found 6: Recall@10 = 0.6. High recall = we are not missing important information.
MRR (Mean Reciprocal Rank): How high is the first relevant result ranked? If the first relevant result is at position 3, reciprocal rank = 1/3. Average across all queries. MRR = 1.0 means the first result is always relevant.
NDCG (Normalized Discounted Cumulative Gain): Measures overall ranking quality - relevant docs at higher positions get more credit. More nuanced than precision because it considers the position of each relevant result.

Practical Example:

Query: "What is the GST rate for smartphones in India?"

Ground truth: Chunks #3, #7, #12 are relevant.

Metric	Retrieved: [#3, #5, #7, #9, #12]	Retrieved: [#1, #5, #7, #9, #3]
Precision@5	3/5 = 0.60	2/5 = 0.40
Recall@5	3/3 = 1.00	2/3 = 0.67
MRR	1/1 = 1.00	1/3 = 0.33

Which Metric to Prioritize:

Chatbot (quick answers): MRR - you want the first result to be right
Research (completeness): Recall - you cannot afford to miss information
LLM context window: Precision - less noise = better LLM output

Note: Retrieval metrics require ground truth labels (which chunks are relevant for which queries). Build a test set of at least 50-100 query-relevance pairs for meaningful evaluation.

Generation Quality and Hallucination Detection

Is the LLM Making Things Up?

Types of Generation Failures:

Extrinsic Hallucination: LLM adds facts NOT present in the retrieved context. Most dangerous in RAG.
Intrinsic Hallucination: LLM contradicts the retrieved context.
Irrelevant Answer: Answer is factually correct but does not address the question.
Incomplete Answer: Answer addresses the question but misses key information from context.

Hallucination Detection Methods:

Claim-Level Verification (RAGAS approach): Break answer into atomic claims, check each against context. Most thorough but expensive.
NLI-Based: Use Natural Language Inference models to check if context entails the answer. Cheaper, less accurate.
SelfCheckGPT: Generate multiple answers, check consistency. If the model gives different answers each time, it is probably hallucinating.
LLM-as-Judge: Ask a separate LLM: "Given this context, is the following answer faithful?" Simple but surprisingly effective.

Beyond RAGAS - Other Evaluation Frameworks:

Framework	Strength	Use Case
RAGAS	No labels needed, comprehensive	General RAG evaluation
DeepEval	More metrics, CI/CD integration	Production monitoring
TruLens	Feedback functions, dashboards	Development iteration
LangSmith	End-to-end tracing + eval	LangChain ecosystem

Note: Extrinsic hallucination is the most dangerous RAG failure - the LLM confidently states facts that are not in the context. Always measure faithfulness before deploying.

Building a RAG Evaluation Pipeline

Practical Steps to Evaluate Your RAG System

Step 1: Create an Evaluation Dataset

Collect 50-100 representative questions your users actually ask
For each question, identify the relevant source chunks (ground truth for retrieval)
Optionally, write ideal answers (ground truth for generation)
Include edge cases: questions with no answer in the corpus, ambiguous questions, multi-part questions

Step 2: Evaluate Retrieval Separately

Run each question through your retriever only (no LLM generation)
Compare retrieved chunks against ground truth
Calculate Precision@k, Recall@k, and MRR
If retrieval is bad (Recall < 0.7), fix retrieval first - no LLM can compensate

Step 3: Evaluate End-to-End with RAGAS

Run full RAG pipeline (retrieve + generate) on each question
Use RAGAS to compute: Faithfulness, Answer Relevancy, Context Precision
Low faithfulness? LLM is hallucinating - improve your prompt or use CRAG
Low context precision? Retrieval is noisy - add re-ranking or improve chunking
Low answer relevancy? LLM is going off-topic - improve system prompt

Step 4: Continuous Monitoring

Log every production query, retrieved context, and generated answer
Run RAGAS evaluation on a random sample daily or weekly
Set up alerts for faithfulness drops (could indicate knowledge base staleness)
Track metrics over time to detect regression

Note: Evaluation is not a one-time activity. Build it into your CI/CD pipeline and run it on every change to your retrieval, chunking, or prompting logic.

Interview Questions

Q: What are the four core RAGAS metrics and what does each measure?

(1) Faithfulness: Can every claim in the answer be traced to the retrieved context? Measures hallucination. (2) Answer Relevancy: Does the answer actually address the question asked? (3) Context Precision: Are the relevant chunks ranked higher than irrelevant ones in retrieval? (4) Context Recall: Does the retrieved context contain all information needed to answer fully? Together they cover both retrieval and generation quality.

Q: How would you diagnose and fix a RAG system with low faithfulness scores?

Low faithfulness means the LLM is adding information not present in the retrieved context (hallucinating). Diagnostic steps: (1) Check if retrieval is fetching relevant chunks - if not, fix retrieval first. (2) Review the system prompt - add explicit instructions like "Only answer based on the given context. Say you do not know if the context lacks the answer." (3) Reduce temperature to 0-0.1 for more deterministic output. (4) Implement Corrective RAG to filter out irrelevant chunks before generation. (5) Use a more capable model that follows instructions better.

Q: Why should you evaluate retrieval quality separately from generation quality?

Because most RAG failures are retrieval failures in disguise. If retrieval returns irrelevant chunks, even the best LLM cannot produce a good answer. By evaluating retrieval metrics (Precision, Recall, MRR) separately, you can pinpoint whether the problem is in finding the right documents or in generating the answer from them. This avoids wasting time tuning prompts when the real issue is chunking strategy or embedding model choice.

Frequently Asked Questions

What is RAG Evaluation?

Building a RAG pipeline is easy. Knowing if it actually works well is hard.

How does RAG Evaluation work?

The Silent Killer of RAG Projects The Problem: Most teams build a RAG pipeline, test it with 5-10 questions, say "looks good enough", and ship it. Three weeks later, users complain about wrong answers, hallucinations, and missing information.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full RAG Evaluation (RAGAS, Faithfulness, Relevancy) breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

RAG Evaluation (RAGAS, Faithfulness, Relevancy)

Why RAG Evaluation Matters

RAGAS Framework - The Gold Standard

Retrieval Metrics Deep Dive

Generation Quality and Hallucination Detection

Building a RAG Evaluation Pipeline

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster