Transformer Deep Dive

What is the Transformer?

"Attention Is All You Need" - The Paper That Changed Everything

The Origin Story:

In 2017, a team at Google published a paper titled "Attention Is All You Need". They proposed a new neural network architecture called the Transformer that could process sequences without recurrence (no RNNs) or convolution (no CNNs). It relied entirely on a mechanism called attention.

This single paper is the foundation of GPT, BERT, Claude, Gemini, LLaMA, and virtually every modern AI model. It is arguably the most impactful ML paper ever written.

Why Transformers Replaced RNNs:

Before Transformers, RNNs (Recurrent Neural Networks) were the go-to for text. But they had critical problems:

Sequential processing: RNNs process tokens one by one, left to right. Cannot parallelize. Training is painfully slow
Vanishing gradients: Information from early tokens gets "diluted" as the sequence gets longer. RNNs struggle with long-range dependencies
Memory bottleneck: The entire context must fit into a fixed-size hidden state

Transformers fix all three: They process all tokens in parallel, use attention to directly connect any two tokens regardless of distance, and have no fixed memory bottleneck.

The Big Picture - Transformer Components:

Input Embedding: Converts tokens to vectors
Positional Encoding: Adds position information (since attention has no inherent order)
Multi-Head Self-Attention: The core mechanism - lets tokens look at each other
Feed-Forward Network: Processes each position independently after attention
Layer Normalization: Stabilizes training
Residual Connections: Allows gradients to flow through deep networks

Note: The Transformer is the single most important architecture in modern AI. GPT = Transformer decoder, BERT = Transformer encoder, T5 = full encoder-decoder. Understanding attention is the key to understanding all modern LLMs.

Self-Attention - The Core Mechanism

How Tokens Pay Attention to Each Other

The Intuition Behind Attention:

Consider the sentence: "The cat sat on the mat because it was tired."

What does "it" refer to? The cat or the mat? Humans instantly know it is the cat. Self-attention lets the model make this connection by computing how much each token should "attend to" every other token.

Analogy: Imagine a classroom where every student (token) can look at every other student and decide who is most relevant to them. The "it" student looks around and decides the "cat" student is most important for understanding its meaning.

Query, Key, Value - The Q/K/V Framework:

Each token gets three learned representations:

Query (Q): "What am I looking for?" - represents what this token wants to find
Key (K): "What do I contain?" - represents what this token offers
Value (V): "What information do I carry?" - the actual content to pass along

Analogy - Flipkart Search:

Query: You type "wireless earphones under 2000"
Key: Each product has tags (wireless, earphone, price range)
Value: The actual product details (name, price, reviews)

The search engine matches your Query against all product Keys, ranks by relevance, and returns the Values of the best matches. Self-attention works the same way!

The Attention Formula:

Attention(Q, K, V) = softmax(QKT / sqrt(dk)) * V

QKT: Dot product of queries and keys - measures similarity between tokens
sqrt(dk): Scaling factor to prevent large values that make softmax too sharp
softmax: Converts scores to probabilities (0 to 1, sum to 1)
* V: Weighted sum of values based on attention scores

The result: each token gets a new representation that incorporates information from all other relevant tokens.

Note: Self-attention is O(n^2) in sequence length - every token attends to every other token. This is why context windows have limits. A 128K context window means 128K x 128K attention computations!

Multi-Head Attention & Positional Encoding

Multiple Perspectives and Position Awareness

Multi-Head Attention - Why Multiple Heads?

Instead of computing attention once, the Transformer does it multiple times in parallel with different learned projections. Each "head" learns to attend to different things.

Analogy: When you read a newspaper article, you simultaneously track multiple things - who did what (subject-verb), where it happened (location), when (time), and why (cause). Each attention head learns one of these relationships.

Head 1: Might learn subject-verb relationships
Head 2: Might learn adjective-noun connections
Head 3: Might track pronoun references
Head 4: Might focus on negation words

GPT-4 uses 96 attention heads per layer. The outputs of all heads are concatenated and linearly projected back to the model dimension.

Positional Encoding - Adding Order Information:

Self-attention treats input as a set, not a sequence. "Dog bites man" and "Man bites dog" would look the same! We need to tell the model about word order.

Solution: Add a unique positional vector to each token embedding before feeding it to attention. Two main approaches:

Sinusoidal (original): Uses sin/cos functions at different frequencies. Each position gets a unique pattern. Can generalize to unseen lengths
RoPE (modern): Rotary Position Embeddings - encodes position as rotation in vector space. Used by LLaMA, Mistral, and most modern LLMs. Better at handling long sequences
Learned (BERT): Position embeddings are learned parameters. Simple but cannot generalize beyond training length

Feed-Forward Network (FFN):

After attention aggregates information across tokens, each token passes through an independent feed-forward network (two linear layers with an activation function). This is where the model does "thinking" about the information it gathered.

Recent research suggests the FFN layers act as key-value memories that store factual knowledge. The attention layers route information, and the FFN layers recall facts.

Note: Multi-head attention lets the model attend to information from different representation subspaces simultaneously. It is like having multiple experts analyze the same text from different angles.

Encoder vs Decoder vs Encoder-Decoder

Three Transformer Variants for Different Tasks

Encoder-Only (BERT family):

Uses bidirectional attention - each token can attend to all tokens (both left and right). Great for understanding tasks.

Training: Masked language modeling (predict hidden words)
Best for: Classification, NER, sentiment analysis, sentence embeddings
Examples: BERT, RoBERTa, ALBERT, DeBERTa
Cannot: Generate new text naturally

Decoder-Only (GPT family) - The LLM Standard:

Uses causal (masked) attention - each token can only attend to previous tokens (left only). Designed for generation.

Training: Next-token prediction (autoregressive)
Best for: Text generation, chatbots, code generation, reasoning
Examples: GPT-4, Claude, LLaMA, Mistral, Gemini
Why dominant: Scales better, single unified architecture for all tasks

Key insight: The causal mask ensures the model cannot "cheat" by looking at future tokens during training. Each position only sees past context, just like during generation.

Encoder-Decoder (T5, BART):

The original Transformer design. Encoder processes input with bidirectional attention. Decoder generates output with causal attention + cross-attention to the encoder.

Best for: Translation, summarization, sequence-to-sequence tasks
Examples: T5, BART, mT5, Flan-T5
Why less popular now: Decoder-only models can do everything with simpler architecture

Quick Comparison:

Type	Attention	Primary Use	Example
Encoder-only	Bidirectional	Understanding	BERT
Decoder-only	Causal (left)	Generation	GPT, Claude
Encoder-Decoder	Both	Seq-to-Seq	T5, BART

Note: Decoder-only architecture has won the LLM race. GPT, Claude, LLaMA, Mistral, Gemini are all decoder-only. The simplicity and scaling properties of autoregressive models make them dominant.

Putting It All Together - A Transformer Forward Pass

Following Data Through the Architecture

Step-by-Step Forward Pass (Decoder-Only):

Input: "The Taj Mahal is located in"

Tokenization: ["The", " Taj", " Mah", "al", " is", " located", " in"] - 7 tokens
Embedding: Each token mapped to a 4096-dimensional vector (for a model with d_model=4096)
Positional Encoding: Position information added to each embedding vector
Layer 1-N (repeated N times):
- Multi-Head Self-Attention: Each token attends to all previous tokens (causal mask)
- Residual Connection + Layer Norm
- Feed-Forward Network: Two linear layers with GELU activation
- Residual Connection + Layer Norm
Output Projection: Final hidden state mapped to vocabulary logits (50,000+ scores)
Softmax: Logits converted to probabilities. Highest prob: "Agra" (92%)

Model Scale Example - LLaMA 2 70B:

Component	Value
Hidden dimension (d_model)	8,192
Number of layers	80
Attention heads	64
Head dimension	128
FFN dimension	28,672
Vocabulary size	32,000
Context length	4,096
Total parameters	70 billion

Modern Optimizations:

GQA (Grouped Query Attention): Shares key/value heads across queries. Reduces memory and speeds up inference. Used by LLaMA 2, Mistral
Flash Attention: Memory-efficient attention computation. Processes attention in tiles to minimize GPU memory reads/writes
KV Cache: Stores previously computed key/value vectors so they are not recomputed for each new token. Essential for fast inference
MoE (Mixture of Experts): Only activates a subset of parameters per token. GPT-4 uses this to have 1.8T total params but only ~280B active per token

Note: The KV cache is why LLM inference uses so much memory. For a 70B model with 128K context, the KV cache alone can be 40+ GB. This is the main bottleneck for long context generation.

Interview Questions

Q: Explain self-attention in simple terms.

Self-attention lets each token in a sequence compute a weighted combination of all other tokens. Each token creates a Query (what am I looking for?), Key (what do I contain?), and Value (what info do I carry?). The dot product of Q and K gives attention scores, which are used to create a weighted sum of Values. This allows the model to capture relationships between any two positions regardless of distance.

Q: Why did Transformers replace RNNs for language tasks?

(1) Parallelization: Transformers process all tokens simultaneously; RNNs process sequentially. (2) Long-range dependencies: Attention directly connects any two tokens; RNNs suffer from vanishing gradients. (3) Training speed: Parallel processing makes Transformers much faster to train. These advantages become more significant at scale.

Q: What is the difference between encoder-only, decoder-only, and encoder-decoder Transformers?

Encoder-only (BERT): Bidirectional attention, best for understanding tasks (classification, NER). Decoder-only (GPT, Claude): Causal attention, best for generation. Dominant architecture for LLMs. Encoder-Decoder (T5): Both, designed for sequence-to-sequence tasks like translation. Decoder-only has won due to better scaling.

Q: Why is positional encoding necessary in Transformers?

Self-attention treats input as a set with no inherent order. Without positional encoding, "Dog bites man" and "Man bites dog" would produce identical representations. Positional encoding adds position-dependent vectors to token embeddings so the model knows word order. Modern approaches use RoPE (Rotary Position Embeddings) which encodes position as rotations in vector space.

Q: What is KV cache and why is it important for LLM inference?

During autoregressive generation, previously computed key and value vectors do not change when new tokens are added. The KV cache stores these vectors so they are not recomputed at each step, turning O(n^2) per-token computation into O(n). Without KV cache, generating each token would require reprocessing the entire sequence. The trade-off is high memory usage.

Frequently Asked Questions

What is Transformer Deep Dive?

Understand the Transformer architecture in depth - self-attention, multi-head attention, positional encoding, and why it replaced RNNs for all language tasks.

How does Transformer Deep Dive work?

"Attention Is All You Need" - The Paper That Changed Everything The Origin Story: In 2017, a team at Google published a paper titled "Attention Is All You Need" . They proposed a new neural network architecture called the Transformer that could process sequences without recurrence (no RNNs) or…

Browse all AI & Automation topics →

What is the Transformer?

Self-Attention - The Core Mechanism

Multi-Head Attention & Positional Encoding

Encoder vs Decoder vs Encoder-Decoder

Putting It All Together - A Transformer Forward Pass

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster