AI & AutomationFree to read

Embeddings & Similarity Search

The Universal Language That Machines Use to Understand Everything

Master the fundamental concept behind all modern AI - embeddings. Learn how text, images, and code are converted into numbers that machines can compare, search, and reason about. Understand cosine similarity, distance metrics, and how similarity search powers every AI application you use.

What are Embeddings?

Turning Everything Into Numbers That Capture Meaning

The GPS Coordinates Analogy

Every place on Earth has GPS coordinates - two numbers (latitude, longitude) that tell you exactly where it is. Mumbai is (19.07, 72.87), Delhi is (28.61, 77.20). Even though these are just numbers, they capture something meaningful: nearby places have similar coordinates. Mumbai and Pune are close in coordinates because they are close in reality. Embeddings work the same way but for meaning. "Happy" and "joyful" have similar embeddings because they mean similar things. But instead of 2 numbers (lat, long), embeddings use 384 to 3072 numbers to capture the many dimensions of meaning.

What Exactly is an Embedding?

Formally: A dense vector of floating-point numbers that represents a piece of data (text, image, audio, code) in a continuous vector space
Simply: A list of numbers that captures the "meaning" or "essence" of something
Example: "I love biryani" might become [0.23, -0.45, 0.89, 0.12, ...] (384 numbers)
Key Property: Similar things get similar number lists, different things get different number lists

What Can Be Embedded?

Text: Words, sentences, paragraphs, documents (most common use)
Images: Photos, diagrams, screenshots (CLIP, ViT models)
Audio: Speech, music (Whisper embeddings)
Code: Functions, classes, repositories (CodeBERT, StarCoder)
Multi-modal: Text + Image together in same space (CLIP, ALIGN)

Note: Embeddings are the most fundamental concept in modern AI. Every application - search, chatbots, recommendations, RAG - relies on embeddings to understand and compare content.

Similarity Metrics - How to Compare Embeddings

Three Ways to Measure How Close Two Embeddings Are

Cosine Similarity (Most Popular)

Measures the angle between two vectors, ignoring their length. Like comparing the direction two people are walking, not how fast they are walking. Two people walking northeast are similar even if one walks 2 km/hr and the other walks 10 km/hr.

Range: -1 (opposite meaning) to +1 (identical meaning)
0.8+: Very similar (basically same topic)
0.5-0.8: Somewhat related
Below 0.3: Unrelated
Best For: Text similarity, when you care about meaning not magnitude

Dot Product

Multiplies corresponding dimensions and adds them up. Considers both direction and magnitude. Like measuring not just which direction someone walks, but also how far they go. A long document about cooking has a stronger signal than a short tweet about cooking.

Range: -infinity to +infinity
Best For: When magnitude carries information (longer documents = more relevant)
Note: If vectors are normalized (length = 1), dot product equals cosine similarity

Euclidean Distance (L2)

The straight-line distance between two points. Like measuring the actual distance between Mumbai and Delhi on a map. Closer points = more similar.

Range: 0 (identical) to infinity (completely different)
Best For: When absolute position in space matters, clustering tasks
Note: LOWER is better (unlike cosine where higher is better)

Which One to Use?

Default choice: Cosine similarity. Works best for most text embedding models.
Check your model docs: Some models are trained for dot product (OpenAI) or cosine (sentence-transformers). Using the wrong metric gives wrong results.
Pro tip: If you normalize all vectors to unit length first, cosine similarity, dot product, and Euclidean distance all give equivalent rankings.

Note: When in doubt, use cosine similarity. It is the default for most embedding models and ignores vector magnitude, focusing purely on directional similarity (meaning).

How Similarity Search Works at Scale

From Brute Force to Blazing Fast

The Naive Approach (Brute Force)

Compare your query vector against EVERY vector in the database. For 1 million 768-dim vectors, that is 768 million floating-point multiplications per query. Takes about 200ms. Fine for small datasets (under 10K), completely impractical for millions.

Approximate Nearest Neighbor (ANN)

The key insight: you do not need the exact nearest neighbors. Getting 95-99% of the true nearest neighbors in 1ms is far better than getting 100% in 200ms. ANN algorithms trade a tiny bit of accuracy for massive speed gains.

HNSW: Builds a navigation graph. Like using Google Maps instead of checking every road - you follow promising paths to quickly reach the destination. Most popular in production.
IVF: Clusters vectors into buckets. Only search the nearest buckets. Like checking only the relevant aisles in a supermarket instead of every shelf.
LSH: Uses hash functions to bucket similar vectors together. Fast but less accurate than HNSW.

The Speed Comparison

Dataset: 10 million vectors, 768 dimensions

Brute Force:   ~2000ms per query   | 100% recall
IVF-Flat:      ~10ms per query     | 95% recall
HNSW:          ~1ms per query      | 98% recall
HNSW+PQ:       ~0.5ms per query    | 93% recall (compressed)

* Recall = percentage of true nearest neighbors found

Two-Stage Retrieval

Best production pattern: (1) Fast ANN search to get top 100-200 candidates from millions of vectors. (2) Exact re-ranking on just those 100-200 candidates using a cross-encoder or more expensive model. This gives you the speed of ANN with the accuracy of brute force on the final result set.

Note: ANN search gives you 95-99% accuracy at 1000x the speed of brute force. For production systems, this tradeoff is always worth it.

Real-World Embedding Applications

How Every Major Tech Company Uses Embeddings

Semantic Search (Google, Flipkart)

User types "shoes for running on wet roads." Traditional keyword search looks for these exact words. Semantic search embeds the query and finds products whose embeddings are close - returning "waterproof trail running shoes" and "grip sole athletic footwear" even though they share no keywords. This is why modern search feels like it "understands" you.

RAG (ChatGPT with your data)

The core of Retrieval-Augmented Generation: embed your company documents, embed the user question, find the most similar documents, and feed them as context to the LLM. The LLM generates an answer grounded in YOUR data instead of making things up.

Recommendation Systems (Netflix, Spotify, Zomato)

Embed movies/songs/restaurants and users into the same vector space. A user who likes "Andaz Apna Apna" and "Hera Pheri" will have a user embedding close to other Bollywood comedies. Recommend items whose embeddings are nearest to the user embedding.

Anomaly Detection (Fraud, Security)

Embed normal transactions. When a new transaction comes in, compute its distance from the cluster of normal transactions. If it is far away (low similarity), flag it as potentially fraudulent. Banks use this to catch unusual spending patterns.

Code Search (GitHub Copilot)

Embed code functions and natural language descriptions into the same space. Developer types "sort array in descending order" - system finds code snippets with similar embeddings even if variable names and language differ. Powers intelligent code completion and search.

Note: Embeddings are the invisible backbone of modern AI. Every time an app 'understands' you - search results, recommendations, chatbot responses - embeddings are doing the heavy lifting behind the scenes.

Embedding Gotchas and Best Practices

Critical Mistakes That Ruin Your Embedding Pipeline

Gotcha 1: Asymmetric Query-Document Mismatch

Users ask short questions ("GST rate laptops") but your documents are long paragraphs. Some models handle this asymmetry poorly. A query embedding and a document embedding may not align well if the model was trained on same-length pairs. Fix: Use models explicitly trained for asymmetric retrieval (E5, BGE with query/passage prefixes).

Gotcha 2: The Curse of Dimensionality

In very high dimensions (1000+), all vectors become approximately the same distance from each other. This means similarity scores cluster tightly together, making it hard to distinguish truly similar items from moderately similar ones. Fix: Higher dimensions are not always better. 768 is often the sweet spot. Use Matryoshka embeddings to test smaller dimensions.

Gotcha 3: Embedding Drift Over Time

If you update your embedding model (new version, fine-tuned), old and new embeddings are incompatible. Half your database uses old vectors, half uses new ones. Similarity comparisons between them are meaningless. Fix: When you change models, re-embed EVERYTHING. Keep track of which model version each embedding was created with.

Best Practice: Always Evaluate Retrieval Quality

Do not assume embeddings are working well just because the app runs. Build a test set of 50+ query-document pairs. Measure Recall@10 - are the correct documents showing up in top 10? If recall drops below 0.8, investigate your embedding model, chunking strategy, or similarity metric.

Note: The number one mistake: changing your embedding model without re-embedding all existing documents. Old and new embeddings live in different vector spaces and cannot be compared.

Interview Questions

Q: What are embeddings and why are they important in AI?

Embeddings are dense numerical vectors that represent the meaning of data (text, images, code) in a continuous vector space. They are important because they enable machines to understand semantic similarity - that "happy" and "joyful" mean the same thing, that a picture of a cat is related to the word "cat." Every modern AI application (search, RAG, recommendations, chatbots) depends on embeddings to compare and retrieve content by meaning rather than exact keywords.

Q: Explain cosine similarity and when you would use it vs dot product.

Cosine similarity measures the angle between vectors (direction only, ignoring magnitude), ranging from -1 to +1. Dot product considers both direction and magnitude. Use cosine when you want pure semantic similarity regardless of document length. Use dot product when magnitude carries information (e.g., a longer, more detailed document should rank higher). If vectors are normalized to unit length, both give equivalent rankings. Default recommendation: cosine similarity for most text search tasks.

Q: Why is Approximate Nearest Neighbor used instead of exact search?

Exact nearest neighbor (brute force) on 10M vectors takes about 2 seconds per query - unacceptable for real-time applications. ANN algorithms like HNSW trade a tiny amount of accuracy (finding 98% instead of 100% of true neighbors) for 1000x speed improvement (1ms instead of 2000ms). In production, this tradeoff is always worth it because the 2% missed results are usually not noticeably different from the returned results.

Q: What happens when you change your embedding model in production?

Old and new embeddings exist in different vector spaces and are completely incompatible. Comparing an embedding from Model A with one from Model B gives meaningless results - even if both are 768 dimensions. You MUST re-embed ALL existing documents with the new model. This means: (1) Keep original source documents always. (2) Track which model version created each embedding. (3) Plan for re-indexing time and cost when upgrading models.

Frequently Asked Questions

What is Embeddings & Similarity Search?

Master the fundamental concept behind all modern AI - embeddings. Learn how text, images, and code are converted into numbers that machines can compare, search, and reason about.

How does Embeddings & Similarity Search work?

Turning Everything Into Numbers That Capture Meaning The GPS Coordinates Analogy Every place on Earth has GPS coordinates - two numbers (latitude, longitude) that tell you exactly where it is. Mumbai is (19.07, 72.87), Delhi is (28.61, 77.20).

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full Embeddings & Similarity Search breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Embeddings & Similarity Search

What are Embeddings?

Similarity Metrics - How to Compare Embeddings

How Similarity Search Works at Scale

Real-World Embedding Applications

Embedding Gotchas and Best Practices

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster