Agentic AI PatternsFree to read

RAG: Retrieval-Augmented Generation

Open-book exam vs guessing from memory

Imagine taking an exam. If you answer purely from memory you might misremember and make things up (LLMs call this 'hallucinating'). But if it's an open-book exam , you look up the real page first, then answer. RAG (Retrieval-Augmented Generation) turns your agent into an open-book student: before answering, it retrieves the relevant pages from your knowledge base and answers from them.

Key points

RAG = retrieve real documents, then generate the answer from them.
It fixes hallucinations and lets agents use private, up-to-date info.
Flow: chunk → embed → store → search → stuff into prompt → answer.

The one-line definition

RAG means: before the LLM answers, your code fetches the most relevant chunks of text from a knowledge base and adds them to the prompt, so the model answers from real, retrieved facts instead of relying on its (possibly outdated or wrong) memory.

Note: Retrieve first, then generate. Open-book, not from-memory.

What's an 'embedding' and a 'vector database'?

To find relevant text by meaning (not just exact words), we turn each chunk of text into a list of numbers called an embedding. Texts with similar meaning get similar numbers. A vector database stores these number-lists and can quickly find the ones closest to your question's numbers — that's similarity search. So 'How do I reset my password?' can match a doc titled 'Account recovery steps' even with no shared words.

Phase 1 — Indexing (done once, ahead of time)

📄 Your documents (PDFs, wiki, notes) │ ▼ 1. CHUNK into small pieces ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ c1 │ │ c2 │ │ c3 │ │ c4 │ ... └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ │ │ │ │ 2. EMBED each chunk (text -> numbers) ▼ ▼ ▼ ▼ [0.1..] [0.8..] [0.2..] [0.5..] │ │ │ │ 3. STORE the vectors └──────┴──────┴──────┘ │ ▼ ┌─────────────────┐ │ VECTOR DATABASE │ (searchable by meaning) └─────────────────┘

Phase 2 — Querying (every time the user asks)

❓ User question: 'How do I reset my password?' │ ▼ 1. EMBED the question (same way as chunks) [0.79..] │ ▼ 2. SIMILARITY SEARCH in the vector DB ┌─────────────────┐ finds the closest chunks │ VECTOR DATABASE │ ──────────────┐ └─────────────────┘ ▼ top matches: c2, c7, c3 │ ┌────────────────────────────┘ ▼ 3. STUFF chunks + question into the prompt ┌──────────────────────────────────────┐ │ 'Using ONLY these docs: [c2][c7][c3], │ │ answer: How do I reset my password?' │ └──────────────────┬───────────────────┘ ▼ 4. GENERATE ┌─────────┐ │ 🧠 LLM │ ──► ✅ grounded answer └─────────┘ (with real steps)

A tiny code example (read it like English)

The whole RAG query step is just: embed the question, search the store, paste the results into the prompt, then ask the LLM. The key instruction is 'answer ONLY from these documents'.

def answer_with_rag(question, vector_db):
    # 1. Turn the question into numbers (same model as the chunks)
    q_vector = embed(question)

    # 2. Find the most similar chunks (open the right 'pages')
    top_chunks = vector_db.search(q_vector, top_k=3)

    # 3. Build a grounded prompt from the retrieved text
    context = "\n\n".join(c.text for c in top_chunks)
    prompt = (
        "Use ONLY the context below to answer. "
        "If the answer isn't there, say you don't know.\n\n"
        f"Context:\n{context}\n\nQuestion: {question}"
    )

    # 4. Generate the answer from real facts
    return llm(prompt)

When should you reach for RAG?

Scenario	Recommendation	Why
Answering from your company's private docs	✅ Use RAG	The model was never trained on your internal data.
Facts that change often (prices, policies, news)	✅ Use RAG	You update the knowledge base instead of retraining the model.
General knowledge the model already knows well	❌ Not needed	A plain LLM call is simpler and cheaper.
Reducing made-up answers (hallucinations)	✅ Use RAG	Grounding the answer in retrieved text keeps it honest.

RAG mistakes beginners make

Mistake	Consequence	Fix
Chunks too big or too small.	Too big = irrelevant text dilutes the answer; too small = context gets cut off.	Use sensible chunk sizes (e.g. a few paragraphs) and small overlaps between chunks.
Embedding the query with a different model than the documents.	The numbers aren't comparable, so similarity search returns junk.	Always use the SAME embedding model for both indexing and querying.
Not telling the LLM to answer only from the retrieved context.	The model ignores the docs and hallucinates anyway.	Add a clear instruction: 'answer ONLY from the context; say you don't know otherwise'.

Remember these lines

RAG = retrieve relevant chunks, then generate from them. Open-book, not guessing.
Index: chunk → embed → store. Query: embed → search → stuff → generate.
Same embedding model for docs and queries, and always ground the prompt in retrieved text.

Key takeaways

RAG retrieves real documents before answering, fixing hallucinations and stale knowledge.
Indexing: chunk documents, embed each chunk into numbers, store in a vector database.
Querying: embed the question, similarity-search for top chunks, add them to the prompt, then generate.
Use the same embedding model for both phases and instruct the LLM to answer only from retrieved text.

Frequently Asked Questions

What is RAG?

Imagine taking an exam. If you answer purely from memory you might misremember and make things up (LLMs call this 'hallucinating').

How does RAG work?

RAG means: before the LLM answers, your code fetches the most relevant chunks of text from a knowledge base and adds them to the prompt , so the model answers from real, retrieved facts instead of relying on its (possibly outdated or wrong) memory.

What are the key takeaways about RAG?

RAG retrieves real documents before answering, fixing hallucinations and stale knowledge. Indexing: chunk documents, embed each chunk into numbers, store in a vector database. Querying: embed the question, similarity-search for top chunks, add them to the prompt, then generate. Use the same embedding model for both phases and instruct the LLM to answer only from retrieved text.

Browse all Agentic AI Patterns topics →

Practice this on DevInterviewMaster

Read the full RAG: Retrieval-Augmented Generation breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

RAG: Retrieval-Augmented Generation

Key points

The one-line definition

What's an 'embedding' and a 'vector database'?

Phase 1 — Indexing (done once, ahead of time)

Phase 2 — Querying (every time the user asks)

A tiny code example (read it like English)

When should you reach for RAG?

RAG mistakes beginners make

Remember these lines

Key takeaways

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster