AI & AutomationFree to read

BM25 & Hybrid Search (Sparse + Dense Retrieval)

Combining Traditional Search with AI Embeddings

Master hybrid search - combining BM25 (keyword-based) with dense vector search (embedding-based) for superior retrieval in RAG pipelines.

The Two Types of Search

Sparse Search (BM25 / TF-IDF):

Keyword-based - matches exact words. Looks at term frequency and document frequency.

✅ Great for exact keyword matches, names, codes, IDs
✅ No ML model needed - fast and simple
✅ Handles rare/technical terms well
❌ Cannot understand synonyms or meaning
❌ "car" won't match "automobile"

Dense Search (Vector/Embedding Search):

Meaning-based - converts text to vectors and finds semantically similar content.

✅ Understands synonyms, paraphrases, meaning
✅ "car" matches "automobile" and "vehicle"
✅ Works across languages
❌ Needs embedding model (compute cost)
❌ Can miss exact keyword matches
❌ Struggles with rare terms, acronyms, codes

How BM25 Works

BM25 = Best Matching 25 (Okapi BM25)

The gold standard for keyword search since the 1990s. Used by Elasticsearch, Lucene, and most search engines.

BM25(q, D) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))

Where:
- IDF(qi) = log((N - n(qi) + 0.5) / (n(qi) + 0.5))
  → Rare words score higher ("Kubernetes" > "the")
- f(qi, D) = term frequency in document
  → More mentions = higher score (with saturation)
- |D|/avgdl = document length normalization
  → Short docs with the term score higher
- k1 (typically 1.2) = term frequency saturation
- b (typically 0.75) = length normalization weight

Hybrid Search - Best of Both Worlds

Why Hybrid?

Neither sparse nor dense search is perfect alone. Research consistently shows that combining both outperforms either one:

BM25 catches: exact matches, rare terms, codes, proper nouns
Dense catches: semantic meaning, paraphrases, conceptual matches
Together: 10-30% better retrieval than either alone

Fusion Methods:

Reciprocal Rank Fusion (RRF): Most popular. Score = Σ 1/(k + rank_i). Combines rankings without needing score normalization. k=60 is standard.
Weighted Sum: final_score = α × dense_score + (1-α) × sparse_score. Requires score normalization. α=0.7 is a common default.
Cross-encoder Re-ranking: Use BM25+dense to get candidates, then re-rank with a cross-encoder for highest quality. Slower but most accurate.

Implementation in Practice

Vector DBs with Built-in Hybrid Search:

Weaviate: Native hybrid search with BM25 + HNSW, configurable alpha
Qdrant: Sparse + dense vectors in same collection, RRF fusion
Pinecone: Sparse-dense vectors via dotproduct
Elasticsearch: BM25 native + kNN dense search + RRF

DIY Hybrid with rank_bm25 + sentence-transformers:

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
import numpy as np

# Documents
docs = ["Python is great for ML", "JavaScript runs in browsers", ...]

# BM25 index
tokenized = [doc.lower().split() for doc in docs]
bm25 = BM25Okapi(tokenized)

# Dense index
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(docs)

def hybrid_search(query, k=5, alpha=0.7):
    # BM25 scores
    bm25_scores = bm25.get_scores(query.lower().split())
    bm25_norm = bm25_scores / (bm25_scores.max() + 1e-6)
    
    # Dense scores
    q_emb = model.encode([query])
    dense_scores = np.dot(embeddings, q_emb.T).flatten()
    dense_norm = dense_scores / (dense_scores.max() + 1e-6)
    
    # Fusion
    hybrid = alpha * dense_norm + (1-alpha) * bm25_norm
    top_k = np.argsort(hybrid)[::-1][:k]
    return [(docs[i], hybrid[i]) for i in top_k]

Interview Questions

Q: Why use hybrid search instead of just embeddings?
A: Dense search misses exact keyword matches and struggles with rare terms/acronyms. BM25 misses semantic meaning. Hybrid combines both for 10-30% better retrieval.
Q: What is Reciprocal Rank Fusion (RRF)?
A: A score fusion method: score = Σ 1/(k + rank). It combines rankings from multiple retrievers without needing score normalization. k=60 is the standard constant.
Q: When would BM25 alone be sufficient?
A: When queries are keyword-heavy (code search, product IDs, technical terms) and the corpus is domain-specific with consistent terminology. No semantic ambiguity.
Q: How does Weaviate implement hybrid search?
A: Weaviate runs BM25 and HNSW dense search in parallel, then combines results using a configurable alpha parameter (0=pure BM25, 1=pure dense).

Frequently Asked Questions

What is BM25 & Hybrid Search?

Master hybrid search - combining BM25 (keyword-based) with dense vector search (embedding-based) for superior retrieval in RAG pipelines.

How does BM25 & Hybrid Search work?

Sparse Search (BM25 / TF-IDF): Keyword-based - matches exact words. Looks at term frequency and document frequency.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full BM25 & Hybrid Search (Sparse + Dense Retrieval) breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.