AI & AutomationFree to read

Advanced RAG

Beyond Basic Retrieval - Making RAG Actually Work in Production

Basic RAG gets you 60% accuracy. Advanced RAG techniques like hybrid search, re-ranking, and corrective RAG push that to 90%+. Learn the techniques that separate toy demos from production systems.

What is Advanced RAG?

Why Basic RAG Falls Short in Production

The Core Problem:

Basic RAG is simple - embed documents, store in vector DB, retrieve top-k chunks, pass to LLM. But in real production scenarios, this naive approach fails badly. Why? Because semantic similarity alone is not enough to find the right information.

Think of it like Google Search in 2005 vs 2025. Early Google just matched keywords. Modern Google understands intent, context, freshness, authority, and re-ranks results intelligently. Advanced RAG does the same for your LLM applications.

Real-World Analogy - Library Research:

Imagine you are researching "GST impact on small businesses in India" in a massive library:

Basic RAG: Like searching only by book title similarity. You might get a book about "GST Laws" but miss the chapter on small business impact buried in an economics textbook.
Advanced RAG: Like having a librarian who searches by topic AND keywords, checks multiple catalogs, verifies each source is relevant, and arranges them in the most useful order for you.

The Advanced RAG Pipeline:

Pre-Retrieval: Query transformation, expansion, decomposition
Retrieval: Hybrid search combining dense + sparse methods
Post-Retrieval: Re-ranking, filtering, compression
Generation: Context-aware prompting with verified chunks
Validation: Corrective RAG to check and self-correct

Basic vs Advanced RAG Comparison:

Aspect	Basic RAG	Advanced RAG
Retrieval	Vector search only	Hybrid (vector + keyword)
Ranking	Cosine similarity	Cross-encoder re-ranking
Query	Raw user query	Transformed + expanded
Accuracy	60-70%	85-95%

Note: Advanced RAG is not one technique - it is a collection of improvements at every stage of the retrieval pipeline. Most production RAG systems use at least 3-4 of these techniques together.

Hybrid Search - Best of Both Worlds

Combining Semantic Understanding with Keyword Precision

Why Hybrid Search?

Vector (dense) search understands meaning but misses exact terms. Keyword (sparse) search catches exact terms but misses synonyms. Hybrid search combines both to get the best of both worlds.

Example: If you search "Aadhaar card linking deadline" - vector search might return general identity verification docs. BM25 keyword search catches the exact term "Aadhaar." Together, they nail it.

How Hybrid Search Works:

Dense Retrieval (Vector): Uses embedding models (like text-embedding-3-large) to capture semantic meaning. Great for "how to reduce server costs" matching "cloud infrastructure optimization."
Sparse Retrieval (BM25/SPLADE): Traditional keyword matching with TF-IDF scoring. Great for exact names, codes, product IDs, and technical terms.
Score Fusion: Results from both are combined using Reciprocal Rank Fusion (RRF) or weighted linear combination. RRF formula: score = sum of 1/(k + rank) across methods.

Reciprocal Rank Fusion (RRF) - The Magic Formula:

For each document, RRF score = Sum over all retrievers of 1/(k + rank_i), where k is typically 60.

If a document is ranked #1 by vector and #3 by BM25: RRF = 1/(60+1) + 1/(60+3) = 0.0164 + 0.0159 = 0.0323

RRF is popular because it does not need score normalization - it only uses rank positions, so you can mix any retrieval methods.

Supported Vector Databases:

Weaviate: Native hybrid search with alpha parameter (0=BM25, 1=vector)
Qdrant: Supports sparse+dense via named vectors
Pinecone: Hybrid search with sparse-dense vectors
Elasticsearch: kNN + BM25 in single query

Note: In most benchmarks, hybrid search outperforms pure vector search by 10-20% on retrieval accuracy. It is the single biggest improvement you can make to a basic RAG pipeline.

Re-Ranking - The Quality Filter

Why Retrieved Documents Need a Second Opinion

The Re-Ranking Concept:

Initial retrieval is fast but approximate. Re-ranking takes the top-k results and re-scores them using a more powerful model that looks at the query-document pair together. It is like a two-stage hiring process - initial screening is fast (keyword match), final interview is thorough (cross-encoder).

Bi-Encoder vs Cross-Encoder:

Feature	Bi-Encoder (Retrieval)	Cross-Encoder (Re-ranking)
Speed	Very fast (pre-computed embeddings)	Slow (processes pair together)
Accuracy	Good	Excellent
How it works	Embeds query and doc separately	Processes query+doc as single input
Scalability	Millions of docs	Only top 20-50 results

Popular Re-Ranking Models:

Cohere Rerank: API-based, easy to integrate, production-ready
BGE Reranker v2: Open-source, runs locally, great accuracy
ColBERT: Late interaction model - faster than cross-encoders with similar quality
LLM-as-Reranker: Use GPT-4 or Claude to judge relevance (expensive but powerful)

Indian E-commerce Example:

Search: "best phone under 15000 with good camera"

Initial retrieval: 50 product descriptions
Re-ranker sees query + each doc together
Promotes: Realme Narzo with 50MP camera review at Rs 13,999
Demotes: Samsung S24 Ultra review (great camera but Rs 1,29,999)

Note: Re-ranking typically improves retrieval quality by 5-15% and is computationally cheap since it only processes the top 20-50 results. Always use it in production.

Corrective RAG (CRAG) - Self-Healing Retrieval

When RAG Gets It Wrong - Detect and Correct Automatically

What is CRAG?

Corrective RAG adds a self-evaluation step after retrieval. It checks whether the retrieved documents are actually relevant to the question. If they are not, it triggers a corrective action - like falling back to web search or reformulating the query.

Analogy: Imagine a doctor ordering a blood test. If results look abnormal, they do not just prescribe blindly - they order a second test or a different type of test to confirm. CRAG does the same for retrieved documents.

CRAG Decision Flow:

Retrieve: Get top-k documents from vector store
Grade: Use an LLM or classifier to check relevance of each document (Correct / Incorrect / Ambiguous)
If Correct: Proceed with generation using the retrieved context
If Incorrect: Trigger web search or alternative retrieval to find better sources
If Ambiguous: Combine original retrieval with web search results
Refine: Remove irrelevant sentences from retrieved passages before generation

Why CRAG Matters:

Reduces Hallucination: Bad retrievals lead to confident but wrong answers. CRAG catches this.
Handles Knowledge Gaps: If your vector store does not have the answer, CRAG falls back to web search instead of hallucinating.
Self-Improving: The grading step creates a feedback loop that helps identify weak areas in your knowledge base.

CRAG vs Self-RAG vs Standard RAG:

Feature	Standard RAG	CRAG	Self-RAG
Retrieval check	None	After retrieval	Before + after
Fallback	None	Web search	Regenerate
Complexity	Low	Medium	High

Note: CRAG adds latency (extra LLM call for grading) but significantly reduces hallucination. Use it when accuracy matters more than speed - medical, legal, financial applications.

Query Transformation & Pre-Retrieval Techniques

Fix the Question Before Searching for Answers

Why Transform Queries?

Users ask vague, ambiguous, or complex questions. Passing these raw queries to retrieval gives poor results. Query transformation rewrites the question to be more retrieval-friendly.

Like when you search on Google - you naturally simplify your question. "Why is my stomach hurting after eating paneer last night?" becomes "paneer food poisoning symptoms." LLMs can do this automatically.

Key Techniques:

Query Rewriting: LLM rewrites the user query to be more specific. "Tell me about that new tax rule" becomes "India new income tax regime 2025 changes."
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer first, then use that answer to search. The fake answer is closer in embedding space to real answers than the question is.
Step-Back Prompting: For specific questions, ask a broader question first. "What is the melting point of gold at 5 atm?" becomes "What are the physical properties of gold?"
Multi-Query: Generate 3-5 different versions of the same question from different angles. Union the results for better recall.
Query Decomposition: Break a complex question into sub-questions. "Compare TCS and Infosys revenue and employee count" becomes two separate queries.

HyDE Deep Dive:

HyDE is counterintuitive but brilliant. Instead of embedding the question, you:

Ask LLM to generate a hypothetical answer (it will hallucinate, that is ok)
Embed that hypothetical answer
Use that embedding to search your vector store

Why this works: The hypothetical answer uses similar vocabulary and structure as real answers in your knowledge base, making vector similarity more effective.

Note: Query transformation is the most underrated Advanced RAG technique. It is cheap (one LLM call), fast, and can improve retrieval quality by 15-25%.

Common Advanced RAG Pitfalls

Mistakes That Break Your RAG Pipeline

Pitfall 1: Over-Engineering

Do not implement all techniques at once. Start with hybrid search + re-ranking (biggest bang for effort), then add others based on where your pipeline fails. Measure before and after each change.

Pitfall 2: Wrong Chunking Strategy

No amount of advanced retrieval fixes bad chunking. If your chunks are too large, re-ranking cannot isolate the answer. Too small, and context is lost. Use recursive character splitting with 512-1024 token chunks and 10-20% overlap for most cases.

Pitfall 3: Ignoring Metadata Filtering

If your knowledge base has documents from different departments, time periods, or categories - always filter by metadata BEFORE vector search. Vector search across 1 million unfiltered docs is worse than searching 10,000 relevant docs.

Pitfall 4: Not Evaluating Retrieval Separately

Most teams only evaluate the final LLM output. But if retrieval is broken, no LLM can fix it. Always measure retrieval metrics (precision, recall, MRR) separately from generation metrics.

Pitfall 5: Latency Explosion

Each advanced technique adds latency: query rewriting (1-2s), hybrid search (+200ms), re-ranking (+500ms), CRAG grading (+2s). For a chatbot needing sub-3s responses, pick only 1-2 techniques.

Note: Advanced RAG is about choosing the right techniques for YOUR use case, not implementing everything. Always benchmark each addition against your accuracy and latency requirements.

Interview Questions

Q: What is hybrid search and why is it better than pure vector search?

Hybrid search combines dense retrieval (vector/semantic search) with sparse retrieval (BM25/keyword search). Vector search captures meaning but misses exact terms. BM25 catches exact keywords but misses synonyms. Together, they cover both gaps. Results are combined using Reciprocal Rank Fusion (RRF) which merges ranked lists without needing score normalization. Benchmarks show 10-20% improvement over pure vector search.

Q: Explain how Corrective RAG (CRAG) works and when you would use it.

CRAG adds a grading step after retrieval. An LLM or classifier evaluates whether each retrieved document is relevant to the query (Correct/Incorrect/Ambiguous). If documents are irrelevant, CRAG triggers a fallback like web search. If ambiguous, it combines both sources. Use CRAG when accuracy is critical (medical, legal, financial) and you can tolerate the extra latency of the grading step. It significantly reduces hallucination from bad retrievals.

Q: What is HyDE and why does generating a fake answer help retrieval?

HyDE (Hypothetical Document Embeddings) generates a hypothetical answer to the query using an LLM, then embeds that answer for vector search. This works because the hypothetical answer uses vocabulary and structure similar to real documents in the knowledge base, making it closer in embedding space than the raw question. The fake answer may hallucinate facts, but its embedding neighborhood matches real answers better than the question embedding does.

Frequently Asked Questions

What is Advanced RAG?

Basic RAG gets you 60% accuracy. Advanced RAG techniques like hybrid search, re-ranking, and corrective RAG push that to 90%+.

How does Advanced RAG work?

Why Basic RAG Falls Short in Production The Core Problem: Basic RAG is simple - embed documents, store in vector DB, retrieve top-k chunks, pass to LLM. But in real production scenarios, this naive approach fails badly.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Advanced RAG breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Advanced RAG

What is Advanced RAG?

Hybrid Search - Best of Both Worlds

Re-Ranking - The Quality Filter

Corrective RAG (CRAG) - Self-Healing Retrieval

Query Transformation & Pre-Retrieval Techniques

Common Advanced RAG Pitfalls

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster