RAG Architecture & Pipeline Design
Build Production-Grade Retrieval Augmented Generation Systems
RAG is how you make LLMs answer questions about YOUR data. Learn to design, build, and optimize RAG pipelines that actually work in production - from document ingestion to answer generation.
What is RAG & Why It Matters
RAG = Give LLMs Access to Your Private Data
The Fundamental Problem:
LLMs like GPT-4 and Claude are trained on public internet data. They know about Python, history, and cooking recipes. But they do NOT know about your company's internal docs, your product manuals, your customer data, or anything private. RAG solves this.
RAG (Retrieval Augmented Generation) = Instead of fine-tuning the model on your data (expensive, slow, stale), you RETRIEVE relevant documents at query time and pass them as context to the LLM. The LLM then generates answers grounded in YOUR data.
The Swiggy Customer Support Analogy:
Imagine you call Swiggy customer support about a missing item. The support agent (LLM) does not memorize every order ever placed. Instead, they SEARCH your order history (retrieval), FIND the relevant order (context), and then ANSWER your question based on that specific order data (generation). That is RAG.
- Without RAG: "I am an AI, I don't have access to your orders" (useless)
- With RAG: "I can see your order #12345 from Dominos is missing a garlic bread. Let me process your refund." (actually helpful)
RAG vs Fine-tuning vs Prompt Engineering:
| Approach | Cost | Data Freshness | Best For |
|---|---|---|---|
| Prompt Engineering | Free | Static | Small context, few examples |
| RAG | Low-Medium | Real-time | Large knowledge bases, changing data |
| Fine-tuning | High | Stale (retrain needed) | Style/behavior changes, specialized domains |
Note: RAG is the #1 most used pattern in production AI applications. Over 80% of enterprise AI deployments use some form of RAG. Master this and you can build almost any AI application.
RAG Pipeline Architecture - The 5 Stages
End-to-End RAG Pipeline Design
Stage 1: Document Ingestion
This is where raw documents (PDFs, websites, databases, APIs) are loaded into the system. Think of it as the "loading dock" of a warehouse.
- Document Loaders: PyPDF, Unstructured, Docling, web scrapers
- Formats: PDF, DOCX, HTML, Markdown, CSV, JSON, databases
- Challenges: Tables in PDFs, images with text, scanned documents
- Indian Example: Loading all SEBI circulars, RBI notifications, and GST updates for a compliance chatbot
Stage 2: Chunking & Processing
Documents are split into smaller pieces (chunks) because LLMs have context limits and smaller chunks give more precise retrieval.
- Fixed Size: Split every 500 tokens with 50 token overlap
- Recursive: Split by paragraphs, then sentences, then words
- Semantic: Split based on meaning boundaries (topic changes)
- Document-aware: Respect headers, sections, tables as boundaries
Golden Rule: Chunk size = what a human would consider "one complete thought"
Stage 3: Embedding & Storage
Each chunk is converted to a vector (array of numbers) that captures its semantic meaning, then stored in a vector database.
- Embedding Models: OpenAI text-embedding-3-small, Cohere embed-v3, sentence-transformers
- Vector DBs: Pinecone (managed), ChromaDB (local), Weaviate (hybrid), Qdrant (fast)
- Metadata: Store source, page number, section, date alongside vectors
Stage 4: Retrieval
When a user asks a question, find the most relevant chunks from the vector store.
- Semantic Search: Find chunks with similar meaning (cosine similarity)
- Keyword Search: BM25 for exact term matching
- Hybrid: Combine both for best results
- Top-k: Usually retrieve 3-10 most relevant chunks
Stage 5: Generation
Pass retrieved chunks as context to the LLM along with the user question. The LLM generates an answer grounded in the provided context.
- Context Window: Fit retrieved chunks + system prompt + user query within token limit
- Prompt Template: "Based on the following context, answer the user question. If the answer is not in the context, say so."
- Citations: Ask the LLM to cite which chunks it used for each claim
Note: The most impactful stages to optimize are Chunking (Stage 2) and Retrieval (Stage 4). Bad chunking = irrelevant retrieval = hallucinated answers, no matter how good your LLM is.
Choosing the Right Components
Technology Stack Decisions for RAG
Embedding Model Selection:
| Model | Dimensions | Cost | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02/1M tokens | General purpose, English |
| Cohere embed-v3 | 1024 | $0.1/1M tokens | Multilingual, search |
| BGE-M3 (open source) | 1024 | Free (self-host) | Multi-lingual, multi-granularity |
| all-MiniLM-L6-v2 | 384 | Free (local) | Fast, lightweight, prototyping |
Vector Database Selection:
| DB | Type | Best For | Pricing |
|---|---|---|---|
| ChromaDB | Embedded | Prototyping, small data | Free (open source) |
| Pinecone | Managed cloud | Production, zero-ops | Free tier + pay per use |
| Weaviate | Self-host/Cloud | Hybrid search, multi-modal | Open source + cloud |
| Qdrant | Self-host/Cloud | High performance, filtering | Open source + cloud |
| pgvector | Postgres extension | Already using Postgres | Free (part of Postgres) |
Framework Selection:
- LangChain: Most popular, huge ecosystem, good for prototyping. Can be overly complex for simple use cases.
- LlamaIndex: Purpose-built for RAG, excellent data connectors, better for data-heavy apps.
- Haystack: Production-focused, strong pipeline abstraction, good for enterprise.
- Custom: Just use the embedding API + vector DB SDK directly. Simplest, most control, recommended for production.
Note: For most Indian startups: Start with ChromaDB + OpenAI embeddings + custom code. When you scale past 100K documents, move to Qdrant/Pinecone. Do not over-engineer early.
Building a Production RAG System Step by Step
Real-World Example: Company Knowledge Base Chatbot
Scenario: HR Policy Chatbot for a 5000-employee Indian IT Company
The company has 200+ HR policy documents (leave policy, medical insurance, appraisal process, WFH guidelines, etc.). Employees ask 500+ questions daily to HR. Goal: Build an AI chatbot that answers HR questions accurately using these documents.
Step 1: Document Ingestion Pipeline
- Load all HR PDFs using PyPDF2 or Unstructured
- Extract tables separately (leave balance tables, salary bands)
- Clean extracted text (remove headers/footers, fix OCR errors)
- Add metadata: document name, category, last updated date, applicable to (all/managers/freshers)
Step 2: Smart Chunking Strategy
- Use section-based chunking (each policy section = one chunk)
- Keep tables as separate chunks with table context
- Chunk size: 300-500 tokens with 50 token overlap
- Preserve hierarchy: "Leave Policy > Casual Leave > Eligibility"
Step 3: Retrieval Strategy
- Use hybrid search: semantic + keyword (for policy numbers like "HR-POL-042")
- Retrieve top 5 chunks
- Filter by metadata: if user asks about "maternity leave", filter to leave policy docs first
- Re-rank with a cross-encoder for final ordering
Step 4: Generation with Guardrails
- System prompt: "You are an HR assistant. Only answer from the provided context. If unsure, direct to HR team."
- Include chunk sources in the response for transparency
- Add guardrails: refuse salary info to non-authorized users, escalate sensitive topics
- Log all queries and responses for compliance audit
Note: Start simple: 1 document type, 1 embedding model, 1 vector DB. Get it working end-to-end first, then iterate. Most RAG failures come from over-engineering before validating the basic pipeline.
Common RAG Pitfalls & How to Avoid Them
RAG Anti-Patterns That Kill Production Systems
Pitfall 1: Garbage In, Garbage Out
If your documents are poorly formatted, have OCR errors, or contain outdated information, your RAG will confidently give wrong answers. Solution: Invest 60% of your time in data quality and preprocessing. Clean, well-structured documents are worth more than any fancy retrieval algorithm.
Pitfall 2: Wrong Chunk Size
Too small chunks = missing context. Too large chunks = irrelevant noise. There is no universal "best" chunk size. Solution: Experiment with 256, 512, and 1024 token chunks. Evaluate retrieval quality for your specific documents and queries.
Pitfall 3: No Evaluation Framework
Building RAG without evaluation is like driving with your eyes closed. Solution: Create a test set of 50-100 question-answer pairs. Measure retrieval accuracy (are the right chunks found?) and answer quality (is the final answer correct?).
Pitfall 4: Ignoring Metadata Filtering
Semantic search alone is not enough. If a user asks about "2024 leave policy", you should filter by year BEFORE doing semantic search. Solution: Always store and use metadata (date, category, source, permissions) for pre-filtering.
Pitfall 5: Not Handling "I Don't Know"
RAG systems that always give an answer (even when the context does not contain the information) lose user trust fast. Solution: Set a retrieval confidence threshold. If top chunk similarity is below 0.7, respond with "I could not find this information in our knowledge base. Please contact the team directly."
Note: The #1 reason RAG systems fail in production is poor data quality, not bad algorithms. Spend more time cleaning your documents than tweaking your retrieval parameters.
Interview Questions
RAG Architecture Interview Questions
Q1: Explain the RAG architecture and when would you choose it over fine-tuning?
Answer: RAG retrieves relevant documents at query time and passes them as context to the LLM. Choose RAG when: data changes frequently, you need citations/sources, data is large (cannot fit in context), privacy matters (data stays in your infra). Choose fine-tuning when: you need to change the model behavior/style, domain-specific language patterns, or very specialized tasks where retrieval is not practical.
Q2: How would you handle a RAG system where users report incorrect answers?
Answer: Systematic debugging approach: (1) Check retrieval - are the right chunks being found? Log and inspect retrieved chunks. (2) Check chunking - are chunks too small/large or splitting in wrong places? (3) Check embedding quality - try different embedding models. (4) Check prompt - is the system prompt clear about using only provided context? (5) Add evaluation metrics (RAGAS) to monitor retrieval quality over time. (6) Implement user feedback loop to continuously improve.
Q3: Design a RAG system for a bank with 10,000 regulatory documents. What are the key challenges?
Answer: Key challenges: (1) Document parsing - financial PDFs have complex tables, charts, legal formatting. Use specialized parsers like Unstructured or Docling. (2) Access control - not all employees should see all documents. Implement per-user metadata filtering. (3) Regulatory accuracy - zero tolerance for hallucination. Add citations, confidence scores, and human review for critical queries. (4) Versioning - regulations change. Track document versions and ensure retrieval uses latest. (5) Scale - 10K docs means millions of chunks. Choose a scalable vector DB (Pinecone/Qdrant). (6) Audit trail - log every query, retrieved context, and generated answer for compliance.
Note: RAG architecture questions are the #1 most asked topic in AI engineer interviews. Be prepared to design a complete RAG system on a whiteboard, including chunking strategy, retrieval method, and evaluation approach.
Frequently Asked Questions
What is RAG Architecture & Pipeline Design?
RAG is how you make LLMs answer questions about YOUR data. Learn to design, build, and optimize RAG pipelines that actually work in production - from document ingestion to answer generation.
How does RAG Architecture & Pipeline Design work?
RAG = Give LLMs Access to Your Private Data The Fundamental Problem: LLMs like GPT-4 and Claude are trained on public internet data. They know about Python, history, and cooking recipes.
Related topics
Practice this on DevInterviewMaster
Read the full RAG Architecture & Pipeline Design breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.