BM25 & Hybrid Search (Sparse + Dense Retrieval)
Combining Traditional Search with AI Embeddings
Master hybrid search - combining BM25 (keyword-based) with dense vector search (embedding-based) for superior retrieval in RAG pipelines.
The Two Types of Search
Sparse Search (BM25 / TF-IDF):
Keyword-based - matches exact words. Looks at term frequency and document frequency.
- ✅ Great for exact keyword matches, names, codes, IDs
- ✅ No ML model needed - fast and simple
- ✅ Handles rare/technical terms well
- ❌ Cannot understand synonyms or meaning
- ❌ "car" won't match "automobile"
Dense Search (Vector/Embedding Search):
Meaning-based - converts text to vectors and finds semantically similar content.
- ✅ Understands synonyms, paraphrases, meaning
- ✅ "car" matches "automobile" and "vehicle"
- ✅ Works across languages
- ❌ Needs embedding model (compute cost)
- ❌ Can miss exact keyword matches
- ❌ Struggles with rare terms, acronyms, codes
How BM25 Works
BM25 = Best Matching 25 (Okapi BM25)
The gold standard for keyword search since the 1990s. Used by Elasticsearch, Lucene, and most search engines.
BM25(q, D) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
Where:
- IDF(qi) = log((N - n(qi) + 0.5) / (n(qi) + 0.5))
→ Rare words score higher ("Kubernetes" > "the")
- f(qi, D) = term frequency in document
→ More mentions = higher score (with saturation)
- |D|/avgdl = document length normalization
→ Short docs with the term score higher
- k1 (typically 1.2) = term frequency saturation
- b (typically 0.75) = length normalization weight
Hybrid Search - Best of Both Worlds
Why Hybrid?
Neither sparse nor dense search is perfect alone. Research consistently shows that combining both outperforms either one:
- BM25 catches: exact matches, rare terms, codes, proper nouns
- Dense catches: semantic meaning, paraphrases, conceptual matches
- Together: 10-30% better retrieval than either alone
Fusion Methods:
- Reciprocal Rank Fusion (RRF): Most popular. Score = Σ 1/(k + rank_i). Combines rankings without needing score normalization. k=60 is standard.
- Weighted Sum: final_score = α × dense_score + (1-α) × sparse_score. Requires score normalization. α=0.7 is a common default.
- Cross-encoder Re-ranking: Use BM25+dense to get candidates, then re-rank with a cross-encoder for highest quality. Slower but most accurate.
Implementation in Practice
Vector DBs with Built-in Hybrid Search:
- Weaviate: Native hybrid search with BM25 + HNSW, configurable alpha
- Qdrant: Sparse + dense vectors in same collection, RRF fusion
- Pinecone: Sparse-dense vectors via dotproduct
- Elasticsearch: BM25 native + kNN dense search + RRF
DIY Hybrid with rank_bm25 + sentence-transformers:
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np
# Documents
docs = ["Python is great for ML", "JavaScript runs in browsers", ...]
# BM25 index
tokenized = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized)
# Dense index
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(docs)
def hybrid_search(query, k=5, alpha=0.7):
# BM25 scores
bm25_scores = bm25.get_scores(query.lower().split())
bm25_norm = bm25_scores / (bm25_scores.max() + 1e-6)
# Dense scores
q_emb = model.encode([query])
dense_scores = np.dot(embeddings, q_emb.T).flatten()
dense_norm = dense_scores / (dense_scores.max() + 1e-6)
# Fusion
hybrid = alpha * dense_norm + (1-alpha) * bm25_norm
top_k = np.argsort(hybrid)[::-1][:k]
return [(docs[i], hybrid[i]) for i in top_k]
Interview Questions
- Q: Why use hybrid search instead of just embeddings?
A: Dense search misses exact keyword matches and struggles with rare terms/acronyms. BM25 misses semantic meaning. Hybrid combines both for 10-30% better retrieval. - Q: What is Reciprocal Rank Fusion (RRF)?
A: A score fusion method: score = Σ 1/(k + rank). It combines rankings from multiple retrievers without needing score normalization. k=60 is the standard constant. - Q: When would BM25 alone be sufficient?
A: When queries are keyword-heavy (code search, product IDs, technical terms) and the corpus is domain-specific with consistent terminology. No semantic ambiguity. - Q: How does Weaviate implement hybrid search?
A: Weaviate runs BM25 and HNSW dense search in parallel, then combines results using a configurable alpha parameter (0=pure BM25, 1=pure dense).
Frequently Asked Questions
What is BM25 & Hybrid Search?
Master hybrid search - combining BM25 (keyword-based) with dense vector search (embedding-based) for superior retrieval in RAG pipelines.
How does BM25 & Hybrid Search work?
Sparse Search (BM25 / TF-IDF): Keyword-based - matches exact words. Looks at term frequency and document frequency.
Related topics
Practice this on DevInterviewMaster
Read the full BM25 & Hybrid Search (Sparse + Dense Retrieval) breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.