AI & AutomationFree to read

Embedding Selection

Pick the Right Embedding Model - Your RAG is Only as Good as Your Embeddings

Navigate the confusing landscape of 100+ embedding models. Learn the MTEB benchmark, understand the tradeoffs between open-source and API models, and build a framework for choosing the perfect embedding model for your use case.

Why Does Embedding Model Selection Matter?

The Foundation That Determines Everything Above It

The Building Foundation Analogy

Think of your AI system like a building. The embedding model is the foundation. If you choose a weak foundation, no matter how fancy the floors above (vector DB, reranker, LLM) - the whole building will be unstable. A poor embedding model means your retrieval will miss relevant documents, your RAG will hallucinate because it gets wrong context, and your users will lose trust. Choosing the right embedding model is the single most impactful decision in building a retrieval-based AI system.

The Selection Dilemma

100+ models available - OpenAI, Cohere, Voyage, Jina, Hugging Face open-source, Google, and more
Multiple dimensions to compare - quality, speed, cost, model size, context length, language support
No one-size-fits-all - The best model for English legal text is different from the best for Hindi customer support
Benchmarks can mislead - MTEB leaderboard scores may not reflect YOUR specific data distribution

What You Are Actually Comparing

Retrieval Quality - Does the model find the right documents for a given query?
Embedding Dimensions - 384 vs 768 vs 1024 vs 3072 - more dimensions = more nuance but more storage and slower
Max Token Length - 512 vs 2048 vs 8192 - how much text can one embedding represent?
Inference Speed - How fast can you embed new text? Critical for real-time applications
Cost - Self-hosted (GPU cost) vs API (per-token pricing) vs serverless

Note: Your RAG system is only as good as your embeddings. Spending time on model selection pays dividends across every query your system handles.

The MTEB Benchmark - Understanding the Leaderboard

How Models Are Officially Compared

What is MTEB?

MTEB (Massive Text Embedding Benchmark) is the JEE rank list of embedding models. It evaluates models across 8 task types and 58+ datasets. When someone says "this model is state-of-the-art," they usually mean it tops the MTEB leaderboard.

Retrieval: Finding relevant documents for a query (most important for RAG)
Semantic Textual Similarity (STS): How well the model captures meaning similarity
Classification: Using embeddings for text classification tasks
Clustering: Grouping similar texts together
Pair Classification: Determining if two texts are paraphrases, entailments, etc.
Reranking: Re-ordering search results by relevance
Summarization: Evaluating summary quality via embeddings
BitextMining: Finding translation pairs across languages

Why MTEB Scores Can Be Misleading

Average scores hide weaknesses: A model might score 70 average but 90 on classification and 50 on retrieval. If you need retrieval, that 70 is meaningless.
Benchmark overfitting: Some models are optimized specifically for MTEB datasets and may not generalize to your data.
Missing your domain: MTEB datasets are mostly English Wikipedia and academic text. If your data is Hinglish customer support tickets, MTEB scores may not predict real performance.
Size not considered: A 2GB model scoring 72 might be better for you than a 14GB model scoring 75, depending on your infrastructure.

Note: MTEB is a great starting point but not the final word. Always evaluate models on YOUR actual data before making a decision.

Open-Source vs API Models - The Big Decision

Self-Hosted Freedom vs Managed Convenience

API-Based Models (OpenAI, Cohere, Voyage, Google)

OpenAI text-embedding-3-small: 1536 dims, cheap, good quality. Most popular API choice.
OpenAI text-embedding-3-large: 3072 dims, best OpenAI quality. Supports dimension reduction.
Cohere embed-v3: Excellent multilingual, built-in search vs classification modes.
Voyage AI voyage-3: Top MTEB scores, specifically optimized for retrieval and RAG.
Google text-embedding-004: 768 dims, competitive quality, good if already in GCP.

Pros: Zero infra, instant setup, auto-scaling. Cons: Data leaves your system, ongoing cost, vendor lock-in.

Open-Source Models (Hugging Face)

BGE-large-en-v1.5: 1024 dims, top open-source retrieval quality.
E5-mistral-7b-instruct: 4096 dims, LLM-based embeddings, massive quality.
GTE-large: 1024 dims, strong all-round performance from Alibaba.
Nomic-embed-text-v1.5: 768 dims, 8192 token context, open weights with Matryoshka support.
all-MiniLM-L6-v2: 384 dims, tiny and fast, great for prototyping.

Pros: Full control, no per-query cost, data stays private. Cons: GPU needed, infra management, scaling yourself.

Decision Framework

Startup/MVP: Use OpenAI embedding API. Fast setup, good enough quality.
Cost-Sensitive at Scale: Self-host open-source model. No per-query charges.
Data Privacy Required: Self-host. Data never leaves your infrastructure.
Multilingual/Hindi: Cohere embed-v3 or multilingual open-source models.
Maximum Quality: Voyage AI or E5-mistral-7b-instruct.

Note: For most teams starting out, OpenAI embeddings are the fastest path. Switch to self-hosted open-source when scale, cost, or privacy requirements demand it.

Practical Model Selection Framework

A Step-by-Step Process for Choosing Your Embedding Model

Step 1: Define Your Requirements

What language(s) do you need? English only? Hindi? Hinglish?
How long are your documents? Short queries vs long legal contracts?
What is your latency requirement? Real-time chat vs batch processing?
Can data leave your infrastructure? (Privacy/compliance)
What is your budget? (One-time GPU vs ongoing API costs)

Step 2: Shortlist 3-5 Models

Based on requirements, narrow down from 100+ to 3-5 candidates. Example for an Indian e-commerce chatbot:

OpenAI text-embedding-3-small (API baseline)
Cohere embed-v3 (multilingual strength)
BGE-large-en-v1.5 (open-source quality)
paraphrase-multilingual-MiniLM-L12-v2 (multilingual open-source)

Step 3: Evaluate on YOUR Data

Build a test set of 100+ real query-document pairs from your domain. For each model:

Embed all documents and queries
Measure Recall@10 (are the correct documents in top 10 results?)
Measure MRR (Mean Reciprocal Rank - how high is the correct doc ranked?)
Measure latency (embedding time per query)
Calculate cost (API pricing or GPU cost per million embeddings)

Step 4: Make a Tradeoff Table

Model              | Recall@10 | MRR  | Latency | Cost/1M  | Privacy
-------------------|-----------|------|---------|----------|--------
OpenAI small       | 0.82      | 0.71 | 50ms    | $0.02    | No
Cohere v3          | 0.85      | 0.74 | 60ms    | $0.10    | No
BGE-large          | 0.84      | 0.73 | 15ms*   | $0 (GPU) | Yes
Multilingual-Mini  | 0.78      | 0.65 | 8ms*    | $0 (GPU) | Yes

* Self-hosted latency (after initial GPU setup)

Note: Never pick an embedding model based on MTEB scores alone. Always build a test set from YOUR actual data and compare models on YOUR specific use case.

Advanced Selection Considerations

Beyond Basic Quality - What Experts Look For

Matryoshka Embeddings

Some newer models (Nomic, OpenAI v3) support Matryoshka representation learning. This means the embedding quality degrades gracefully as you truncate dimensions. A 1536-dim embedding truncated to 256 dims still works reasonably well. This lets you trade quality for storage and speed dynamically.

Instruction-Tuned Models

Some models (E5, BGE) accept a task instruction prefix. You prepend "Represent this document for retrieval:" or "Classify this text:" before the text. The same model produces different embeddings optimized for the specific task. Huge advantage for multi-purpose systems.

Context Length - The Silent Differentiator

512 tokens: Most traditional models. Fine for short text, queries, paragraphs.
2048 tokens: Newer models. Can handle longer sections without chunking.
8192 tokens: Nomic, Jina. Can embed entire pages or short documents in one shot.
32768+ tokens: Coming soon. Will change how we think about chunking entirely.

Longer context does not always mean better. Longer embeddings can dilute the signal from important parts. Sometimes chunking + shorter context gives better retrieval than one long embedding.

Asymmetric Search Consideration

In real search, the query is short ("how to file GST return") but the document is long (entire tax guide page). Some models handle this asymmetry better than others. Models trained with asymmetric pairs (short query, long passage) significantly outperform symmetric models for search tasks.

Note: Matryoshka embeddings and instruction-tuning are game-changers. They give you flexibility to optimize for different tasks and storage constraints with a single model.

Interview Questions

Q: How would you choose an embedding model for a production RAG system?

Four-step process: (1) Define requirements - languages, document length, latency, privacy, budget. (2) Shortlist 3-5 models based on MTEB scores and requirements. (3) Build a test set of 100+ real query-document pairs from your domain and evaluate Recall@10, MRR, latency, and cost for each model. (4) Make a tradeoff table and choose based on your specific priorities. Never rely on benchmarks alone - always test on your own data.

Q: What is MTEB and why might its scores be misleading?

MTEB (Massive Text Embedding Benchmark) evaluates models across 8 task types and 58+ datasets. Scores can mislead because: (1) Average scores hide task-specific weaknesses - a model with 70 average might have 50 on retrieval. (2) Models can be overfit to benchmark datasets. (3) MTEB is mostly English Wikipedia/academic text, not your domain. (4) Size/speed tradeoffs are not reflected in scores - a 14GB model scoring 75 may not beat a 2GB model at 72 for your use case.

Q: When would you choose self-hosted over API embeddings?

Choose self-hosted when: (1) Data privacy is required (healthcare, finance - data cannot leave your infra). (2) Scale makes API cost prohibitive (millions of embeddings per day). (3) You need low latency without network hops. Choose API when: startup/MVP phase (fast setup), small scale (API cost is negligible), or you lack GPU infrastructure. Many teams start with API and migrate to self-hosted as they scale.

Q: What are Matryoshka embeddings and why are they useful?

Matryoshka embeddings (like Russian nesting dolls) are trained so that quality degrades gracefully when you truncate dimensions. A 1536-dim embedding truncated to 256 dims still works reasonably well. This is useful because you can dynamically trade quality for storage space and search speed. Store full dimensions for critical queries but use truncated versions for fast approximate search or when storage is constrained.

Frequently Asked Questions

What is Embedding Selection?

How does Embedding Selection work?

The Foundation That Determines Everything Above It The Building Foundation Analogy Think of your AI system like a building. The embedding model is the foundation .

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Embedding Selection breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Embedding Selection

Why Does Embedding Model Selection Matter?

The MTEB Benchmark - Understanding the Leaderboard

Open-Source vs API Models - The Big Decision

Practical Model Selection Framework

Advanced Selection Considerations

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster