Tokenization

What is Tokenization?

Breaking Text into Digestible Pieces for AI

The Core Problem:

Neural networks work with numbers, not text. Before an LLM can process your prompt, it needs to convert text into numbers. Tokenization is the first step - breaking text into smaller pieces called tokens.

A token is not exactly a word. It is a subword unit. Common words might be a single token, while rare words get split into multiple tokens.

How Tokenization Works - Examples:

"Hello world"       -> ["Hello", " world"]         (2 tokens)
"Tokenization"      -> ["Token", "ization"]        (2 tokens)
"I love Flipkart"   -> ["I", " love", " Flip", "kart"] (4 tokens)
"unhappiness"       -> ["un", "happiness"]         (2 tokens)
"GPT-4"             -> ["G", "PT", "-", "4"]      (4 tokens)

Hindi text uses MORE tokens:
"namaste"           -> ["nam", "aste"]            (2 tokens)
"kaise ho"          -> ["ka", "ise", " ho"]       (3 tokens)

Rule of thumb: In English, 1 token is roughly 3/4 of a word, or about 4 characters. Non-English languages typically use more tokens for the same content.

Why Not Just Split by Words or Characters?

Approach	Problem
Word-level	Vocabulary too large (millions of words). New words unseen during training cannot be handled
Character-level	Sequences become very long. Model struggles to learn word-level meaning from individual characters
Subword (BPE)	Best of both! Manageable vocabulary (32K-100K), handles new words, efficient sequences

Note: Tokenization directly impacts cost and performance. More tokens = more expensive API calls and slower generation. Hindi/Devanagari text typically costs 2-3x more tokens than equivalent English text.

BPE - How Modern Tokenizers Work

Byte Pair Encoding - The Algorithm Behind GPT and LLaMA

How BPE Builds Its Vocabulary:

BPE starts with individual characters and iteratively merges the most frequent pairs:

Start: Vocabulary = all individual bytes/characters [a, b, c, ..., z, 0, 1, ...]
Count: Find the most frequent adjacent pair in the training corpus
Merge: Create a new token from that pair and add to vocabulary
Repeat: Keep merging until vocabulary reaches target size (e.g., 50,000)

Start: [l, o, w, e, r, n, w, e, s, t]
Step 1: "e" + "r" most frequent -> merge "er"
Step 2: "l" + "o" most frequent -> merge "lo"
Step 3: "lo" + "w" most frequent -> merge "low"
Step 4: "low" + "er" -> merge "lower"
... and so on for billions of text

Popular Tokenizers:

Tokenizer	Used By	Vocab Size
cl100k_base	GPT-4, GPT-3.5	100,277
o200k_base	GPT-4o	200,019
SentencePiece	LLaMA, Mistral	32,000
WordPiece	BERT	30,522

Why Tokenization Matters for You:

API Costs: You pay per token. A longer prompt = more tokens = higher cost
Context Window: Token limit, not word limit. 128K tokens is not 128K words
Non-English penalty: Hindi uses 2-3x more tokens than English for same content. This means higher costs and less context available
Code: Code is relatively token-efficient because common keywords are single tokens
Math: Numbers get split strangely (12345 might be ["123", "45"]), which partly explains why LLMs are bad at arithmetic

Note: Use OpenAI's tiktoken library or the Tiktokenizer web tool to count tokens and see how your text gets tokenized. This helps optimize prompts and manage costs.

Embeddings - From Tokens to Meaning

Converting Words into Numbers That Capture Meaning

What Are Embeddings?

An embedding is a high-dimensional vector (list of numbers) that represents a token, word, sentence, or document. The key property: similar meanings = similar vectors.

Analogy: Think of embeddings as GPS coordinates for meaning. Just like Delhi and Noida have similar coordinates (close on the map), "king" and "queen" have similar embedding vectors (close in meaning space).

The Magic of Embedding Arithmetic:

Famous example from Word2Vec:

King - Man + Woman = Queen

The vectors capture semantic relationships. You can do math with meaning! Other examples:

Paris - France + India = Delhi
Walked - Walk + Swim = Swam
Bigger - Big + Small = Smaller

This happens because embeddings encode relationships as directions in vector space. The direction from "man" to "woman" is the same as from "king" to "queen".

Types of Embeddings:

Token Embeddings: Each token in the LLM vocabulary has a learned embedding vector. This is the first layer of any Transformer
Word Embeddings (Word2Vec, GloVe): Older approach. One vector per word. Cannot handle polysemy ("bank" = river bank vs money bank)
Contextual Embeddings (BERT, GPT): Same word gets different embeddings based on context. "bank" in "river bank" vs "bank account" gets different vectors
Sentence/Document Embeddings: Entire text compressed into one vector. Used for semantic search, RAG, similarity comparison

Popular Embedding Models:

Model	Dimensions	Use Case
text-embedding-3-large (OpenAI)	3,072	Best quality, paid API
text-embedding-3-small (OpenAI)	1,536	Good balance of cost/quality
all-MiniLM-L6-v2 (HF)	384	Fast, free, lightweight
BGE-large-en-v1.5 (HF)	1,024	Open-source, high quality

Note: Embeddings are the bridge between human language and machine math. They power semantic search, RAG, recommendation systems, and document clustering. Understanding embeddings is essential for building any AI application.

Real-World Applications of Embeddings

Where Embeddings Power Real Products

1. Semantic Search (Flipkart/Amazon)

Traditional search matches keywords. Semantic search with embeddings understands meaning:

User searches: "comfortable shoes for walking all day"
Keyword search: Looks for exact words - misses "cushioned sneakers for daily wear"
Embedding search: Understands the meaning and finds all relevant products regardless of exact wording

How it works: Embed all product descriptions into vectors. When user searches, embed the query, find the most similar product vectors using cosine similarity.

2. RAG (Retrieval Augmented Generation)

The most important embedding application for LLMs:

Split your documents into chunks
Embed each chunk and store in a vector database (Pinecone, ChromaDB, Weaviate)
When user asks a question, embed the question
Find the most similar document chunks
Pass those chunks + question to the LLM
LLM answers using the retrieved context - no hallucination!

3. Recommendation Systems (Swiggy/Zomato)

Embed user preferences and restaurant features into the same vector space. Restaurants with vectors closest to the user preference vector get recommended.

If a user frequently orders butter chicken, biryani, and kebabs, their preference vector will be close to North Indian restaurant vectors. The system recommends similar restaurants the user has not tried yet.

4. Duplicate Detection

Embed support tickets or user queries. If two tickets have very similar embeddings (cosine similarity > 0.95), they are probably duplicates or related issues. Used by customer support systems to auto-merge tickets.

Note: Embeddings + Vector Databases + LLMs = RAG, which is the most practical and widely deployed pattern for building AI applications. Master this stack!

Tokenization Pitfalls & Gotchas

Things That Go Wrong with Tokenization

Pitfall 1: Non-English Token Inflation

LLM tokenizers are trained primarily on English text. Hindi, Tamil, and other Indian languages get 2-4x more tokens for the same content. This means higher API costs and less effective context usage.

Example: "How are you?" = 4 tokens. The Hindi equivalent may use 8-12 tokens.

Pitfall 2: Numbers and Math

Numbers get tokenized inconsistently. "123456" might become ["123", "456"] or ["12", "34", "56"]. The model sees these as separate subwords, not as a single number. This is one reason LLMs struggle with arithmetic.

Pitfall 3: Spacing Matters

Leading spaces are part of tokens! "hello" and " hello" tokenize differently. This is by design (to preserve whitespace) but can cause confusion when counting tokens or constructing prompts.

Pitfall 4: Embedding Drift

Embeddings from different models live in different vector spaces. You cannot mix embeddings from OpenAI with embeddings from a Hugging Face model. If you change your embedding model, you must re-embed all your data.

Pitfall 5: Embedding Dimensionality and Storage

High-dimensional embeddings (3072-dim) capture more nuance but use more storage and are slower to search. For 1 million documents with 3072-dim float32 embeddings, you need ~12 GB just for vectors. Consider dimensionality reduction or smaller models for large-scale applications.

Note: Always test your tokenizer with your actual content, especially for non-English text. Token count differences can significantly impact costs and performance.

Interview Questions

Q: What is tokenization and why do LLMs use subword tokenization?

Tokenization splits text into units (tokens) for model processing. LLMs use subword tokenization (BPE) because word-level creates too large a vocabulary and cannot handle unseen words, while character-level creates too long sequences. Subword is the sweet spot - manageable vocabulary size (32K-100K), handles any word including unseen ones, and efficient sequences.

Q: What are embeddings and how are they used in RAG?

Embeddings are high-dimensional vectors that capture semantic meaning. Similar meanings produce similar vectors. In RAG: (1) Documents are chunked and embedded into vectors stored in a vector DB. (2) User query is embedded. (3) Most similar document chunks are found via cosine similarity. (4) Retrieved chunks are passed as context to the LLM, grounding its response in real data.

Q: What is the difference between static and contextual embeddings?

Static embeddings (Word2Vec, GloVe) assign one fixed vector per word regardless of context. "Bank" gets the same vector in "river bank" and "bank account". Contextual embeddings (BERT, GPT) generate different vectors for the same word based on surrounding context, correctly capturing different meanings.

Q: How does BPE (Byte Pair Encoding) build its vocabulary?

BPE starts with individual characters as the vocabulary. It then iteratively: (1) counts all adjacent token pairs in the corpus, (2) merges the most frequent pair into a new token, (3) adds it to the vocabulary. This repeats until the target vocabulary size is reached. Common words become single tokens, while rare words are composed of multiple subword tokens.

Q: Why is cosine similarity used for comparing embeddings?

Cosine similarity measures the angle between two vectors, ignoring magnitude. This is ideal for embeddings because we care about the direction (meaning) not the length. Two texts about the same topic will point in the same direction even if they differ in length. Values range from -1 (opposite) to 1 (identical). Dot product and Euclidean distance are alternatives.

Frequently Asked Questions

What is Tokenization?

Learn how LLMs break text into tokens and convert them into numerical vectors that capture semantic meaning. The foundation of all NLP and AI applications.

How does Tokenization work?

Breaking Text into Digestible Pieces for AI The Core Problem: Neural networks work with numbers, not text . Before an LLM can process your prompt, it needs to convert text into numbers.

Browse all AI & Automation topics →

What is Tokenization?

BPE - How Modern Tokenizers Work

Embeddings - From Tokens to Meaning

Real-World Applications of Embeddings

Tokenization Pitfalls & Gotchas

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster