DevInterviewMasterStart free →
AI & AutomationFree to read

Knowledge Graphs

When Vector Search Is Not Enough - Adding Structure to Knowledge

Knowledge Graphs capture entities and their relationships explicitly. Graph RAG combines this structured knowledge with LLMs to answer complex questions that vector search alone cannot handle - like multi-hop reasoning and relationship queries.

What are Knowledge Graphs?

Connecting the Dots Between Information

The Core Idea:

A Knowledge Graph (KG) is a network of entities (people, places, concepts) connected by relationships (works_at, located_in, caused_by). Unlike flat text or vector embeddings, KGs explicitly capture HOW things relate to each other.

Every triple in a KG follows the pattern: (Subject) --[Relationship]--> (Object). For example: (Sundar Pichai) --[CEO_of]--> (Google).

Real-World Analogy - Indian Railways Network:

Think of Indian Railways as a Knowledge Graph:

  • Entities (Nodes): Stations (Mumbai Central, Delhi Junction, Jaipur), Trains (Rajdhani, Shatabdi), Zones (Western, Northern)
  • Relationships (Edges): connects_to, stops_at, operated_by, in_zone
  • Query: "How to go from Jaipur to Chennai?" - This needs graph traversal (multi-hop), not just text similarity!

Vector search would find documents about Jaipur and Chennai separately. A knowledge graph can trace the actual route: Jaipur -> Delhi -> Chennai via connected stations.

Knowledge Graph vs Vector Database:

FeatureVector DBKnowledge Graph
Data ModelFlat embeddingsNodes + Edges
RelationshipsImplicit (similarity)Explicit (typed edges)
Multi-hop queriesPoorExcellent
Fuzzy matchingExcellentPoor
ExplainabilityLow (black-box)High (traversal path)

Where Knowledge Graphs Shine:

  • Multi-hop reasoning: "Who is the CEO of the company that acquired WhatsApp?"
  • Relationship queries: "Which drugs interact with Metformin?"
  • Compliance: "Which suppliers are in countries with trade sanctions?"
  • Recommendations: "Users who liked X also liked Y because of Z"

Note: Knowledge Graphs have been powering Google Search, Siri, and Alexa for years. Combining them with LLMs via Graph RAG is the next frontier for complex question answering.

Building Knowledge Graphs from Documents

From Unstructured Text to Structured Knowledge

The KG Construction Pipeline:

Building a knowledge graph from documents involves extracting entities and their relationships from unstructured text. LLMs have made this dramatically easier compared to traditional NLP approaches.

Step-by-Step Process:

  1. Entity Extraction: Identify key entities from text - people, organizations, products, concepts. LLMs can do this with a simple prompt: "Extract all entities from this text with their types."
  2. Relationship Extraction: Identify how entities relate. "Extract relationships between entities as (subject, predicate, object) triples."
  3. Entity Resolution: Merge duplicates - "PM Modi", "Narendra Modi", "Modi ji" are all the same entity. This is crucial and often the hardest step.
  4. Schema Alignment: Map extracted types and relationships to a predefined ontology for consistency.
  5. Storage: Store in a graph database like Neo4j, Amazon Neptune, or ArangoDB.

LLM-Powered Extraction Example:

Input text: "TCS, headquartered in Mumbai, reported Q3 revenue of Rs 60,583 crore. CEO K Krithivasan announced plans to hire 40,000 freshers."

Extracted triples:

  • (TCS) --[headquartered_in]--> (Mumbai)
  • (TCS) --[reported_revenue]--> (Rs 60,583 crore, Q3)
  • (K Krithivasan) --[CEO_of]--> (TCS)
  • (TCS) --[plans_to_hire]--> (40,000 freshers)

Tools for KG Construction:

  • LlamaIndex KG Index: Automated KG construction from documents
  • Neo4j + LLM: Extract triples, store in Neo4j, query with Cypher
  • Microsoft GraphRAG: Builds hierarchical community summaries from KGs
  • spaCy + Custom NER: For domain-specific entity extraction

Note: Entity resolution (deduplication) is the hidden challenge in KG construction. TCS, Tata Consultancy Services, and TATA CS are all the same entity - your pipeline must handle this.

Graph RAG - Combining Graphs with LLMs

The Best of Both Worlds - Structured + Unstructured

What is Graph RAG?

Graph RAG combines knowledge graph traversal with LLM generation. Instead of (or in addition to) vector search, you query a knowledge graph to find relevant entities and relationships, then pass this structured context to an LLM for answer generation.

Three Patterns of Graph RAG:

  • Pattern 1 - Graph-Only Retrieval: Query the knowledge graph, pass triples to LLM. Best for relationship questions. Query: "Who founded Infosys?" -> Graph returns (Narayana Murthy) --[founded]--> (Infosys) -> LLM generates natural answer.
  • Pattern 2 - Graph + Vector Hybrid: Use vector search for fuzzy matching AND graph traversal for relationships. Combine both contexts for the LLM. Best for questions needing both semantic understanding and structured facts.
  • Pattern 3 - Microsoft GraphRAG: Pre-compute community summaries at different levels of the graph hierarchy. At query time, retrieve relevant community summaries. Best for broad "sensemaking" queries over large document collections.

Microsoft GraphRAG Deep Dive:

Microsoft GraphRAG (2024) introduced a novel approach:

  1. Index Time: Build KG from docs -> Detect communities using Leiden algorithm -> Generate summaries for each community at multiple hierarchy levels
  2. Query Time (Local): Find relevant entities -> Retrieve their community context -> Generate answer
  3. Query Time (Global): Use top-level community summaries to answer broad questions about the entire corpus

Global search is the killer feature - it can answer questions like "What are the main themes across all 10,000 research papers?" which standard RAG simply cannot do.

When to Use Graph RAG:

  • Multi-hop questions: "Which mutual fund holds stocks of companies whose CEO attended IIT?"
  • Global summarization: "What are the main compliance issues across all audit reports?"
  • Relationship exploration: "How are these two companies connected?"
  • Recommendation systems: "Similar companies in the same sector with better margins"

Note: Graph RAG is not a replacement for vector RAG - it is complementary. Use vector search for semantic similarity and graph traversal for relationship-based reasoning.

Challenges and Practical Considerations

The Hard Parts Nobody Talks About

Challenge 1: KG Quality

Your Graph RAG is only as good as your knowledge graph. LLM-extracted triples often have errors - wrong relationships, hallucinated entities, missing connections. You need validation loops, human review for critical domains, and continuous refinement.

Challenge 2: Index Cost

Microsoft GraphRAG is expensive to build. For a corpus of 10,000 documents, the indexing can cost hundreds of dollars in LLM API calls (entity extraction, community summarization). Budget this carefully.

Challenge 3: Schema Design

A good ontology (entity types and relationship types) is crucial. Too broad and your KG is meaningless. Too narrow and you miss important connections. Domain expertise is essential for schema design.

Challenge 4: Query Routing

Not all questions need graph traversal. You need a router that decides: Is this a simple factual question (use vector RAG)? A relationship question (use graph)? A broad summary question (use GraphRAG global search)? Building this router is itself an engineering challenge.

Practical Tip:

Start simple. Build a basic KG with just the most important entity types (3-5) and relationships (5-10). Validate quality. Then expand. Do not try to model everything from day one.

Note: KG construction is an ongoing process, not a one-time task. Plan for continuous extraction, validation, and refinement as your document corpus grows.

Graph Databases and Query Languages

The Storage Layer for Knowledge Graphs

Popular Graph Databases:

DatabaseQuery LanguageBest For
Neo4jCypherGeneral purpose, largest ecosystem
Amazon NeptuneGremlin / SPARQLAWS-native, managed service
ArangoDBAQLMulti-model (graph + document)
FalkorDBCypherIn-memory, ultra-fast for RAG

Natural Language to Graph Query:

A key capability in Graph RAG is converting user questions to graph queries. LLMs can generate Cypher (Neo4j) or Gremlin queries from natural language:

User: "Which companies in Bangalore have more than 10,000 employees?"

Generated Cypher: MATCH (c:Company)-[:LOCATED_IN]->(city) WHERE city.name = "Bangalore" AND c.employees > 10000 RETURN c.name

Neo4j + Vector Search:

Neo4j now supports native vector search alongside graph queries. This means you can:

  • Store embeddings as node properties
  • Run vector similarity search within the graph
  • Combine with graph traversal in a single query
  • No need for a separate vector database

Note: Neo4j is the dominant choice for Graph RAG because of its native vector search support and the LLM ecosystem integration (LangChain, LlamaIndex both have Neo4j connectors).

Interview Questions

Q: When would you choose Graph RAG over standard vector RAG?

Graph RAG excels at: (1) Multi-hop reasoning - questions requiring traversal across multiple entities, like "Who manages the team that built the product with the most revenue?" (2) Relationship queries - explicitly asking about connections between entities. (3) Global summarization - understanding themes across large document collections using community summaries. Standard vector RAG is better for simple semantic similarity and single-hop factual questions.

Q: How does Microsoft GraphRAG handle global queries over large document collections?

Microsoft GraphRAG builds a hierarchical structure at index time: (1) Extract entities and relationships into a knowledge graph. (2) Detect communities using the Leiden algorithm. (3) Generate summaries for communities at multiple hierarchy levels. For global queries like "What are the main themes?", it uses top-level community summaries, enabling answers that span the entire corpus - something standard RAG cannot do since it only retrieves individual chunks.

Q: What is the biggest challenge in building a knowledge graph from documents?

Entity resolution (deduplication) is the hardest challenge. The same entity can appear in many forms - "TCS", "Tata Consultancy Services", "TATA CS". Without proper resolution, your graph has duplicate nodes with split information, leading to incomplete answers. Other challenges include extraction accuracy (LLMs may hallucinate triples), schema design (choosing the right entity types and relationships), and ongoing maintenance as documents change.

Frequently Asked Questions

What is Knowledge Graphs?

Knowledge Graphs capture entities and their relationships explicitly. Graph RAG combines this structured knowledge with LLMs to answer complex questions that vector search alone cannot handle - like multi-hop reasoning and relationship queries.

How does Knowledge Graphs work?

Connecting the Dots Between Information The Core Idea: A Knowledge Graph (KG) is a network of entities (people, places, concepts) connected by relationships (works_at, located_in, caused_by). Unlike flat text or vector embeddings, KGs explicitly capture HOW things relate to each other.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Knowledge Graphs breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.