How LLMs Work
Demystifying the Technology Behind ChatGPT, Claude, and Gemini
Understand how Large Language Models are built, trained, and fine-tuned. From next-token prediction to RLHF, learn the complete LLM pipeline.
What is a Large Language Model (LLM)?
The Core Idea - Predicting the Next Word at Scale
Simple Definition:
An LLM is a neural network (specifically, a Transformer) trained on massive amounts of text data to predict the next word (token) in a sequence. That is it. The entire magic of ChatGPT, Claude, and Gemini boils down to incredibly sophisticated next-word prediction.
But this simple objective, when applied at enormous scale (trillions of words, billions of parameters), produces emergent abilities - reasoning, coding, translation, summarization - that the model was never explicitly taught.
Analogy - The Auto-Complete Genius:
Think of your phone's keyboard auto-complete. It predicts the next word based on what you have typed. Now imagine making that prediction engine 1 billion times more powerful - trained on every book, website, and conversation ever written. It can now complete not just words but entire essays, code, and reasoning chains.
That is essentially what an LLM is - the world's most powerful auto-complete, trained on the internet.
What Makes It "Large"?
| Model | Parameters | Training Data |
|---|---|---|
| GPT-2 (2019) | 1.5 billion | 40 GB text |
| GPT-3 (2020) | 175 billion | 570 GB text |
| LLaMA 2 (2023) | 70 billion | 2 trillion tokens |
| GPT-4 (2023) | ~1.8 trillion (MoE) | ~13 trillion tokens |
| LLaMA 3 (2024) | 405 billion | 15 trillion tokens |
Scale is the key: These models need thousands of GPUs training for months, costing millions of dollars. That is why only big labs can train frontier models from scratch.
Note: LLMs do one thing - predict the next token. But at sufficient scale, this simple objective produces remarkable capabilities including reasoning, coding, and creative writing.
The Three Phases of LLM Training
Pre-training, Fine-tuning, and Alignment
Phase 1: Pre-training (Most Expensive)
The model reads trillions of tokens from the internet - books, Wikipedia, code, websites, forums. For each chunk of text, it predicts the next token, compares with the actual token, and updates its weights.
- Objective: Next-token prediction (autoregressive language modeling)
- Data: Massive web crawls (Common Crawl, C4, The Pile, etc.)
- Cost: Millions of dollars, thousands of GPUs, weeks to months
- Result: A "base model" that can complete text but is not great at following instructions
Analogy: This is like a child reading millions of books. They learn language, facts, and patterns, but they do not yet know how to have a helpful conversation.
Phase 2: Supervised Fine-Tuning (SFT)
Human annotators create thousands of high-quality instruction-response pairs. The model learns to follow instructions by training on these examples.
- Objective: Learn to follow instructions and be helpful
- Data: 10K-100K curated prompt-response pairs written by experts
- Cost: Much cheaper than pre-training, days on fewer GPUs
- Result: An "instruct model" that follows instructions but may sometimes be harmful or unhelpful
Analogy: Like taking a well-read person and teaching them customer service skills - how to understand questions and give helpful answers.
Phase 3: RLHF / DPO Alignment
Human raters compare multiple model responses and rank them. A reward model learns these preferences. The LLM is then optimized to maximize the reward model's score.
- RLHF: Reinforcement Learning from Human Feedback - uses PPO algorithm
- DPO: Direct Preference Optimization - newer, simpler alternative to RLHF
- Goal: Make the model helpful, harmless, and honest
- Result: The "chat model" you actually interact with (ChatGPT, Claude)
Analogy: Like a manager reviewing the customer service agent's performance, telling them "this response was great, this one was bad" until quality consistently improves.
Note: Pre-training gives the model knowledge, SFT teaches it to follow instructions, and RLHF aligns it with human preferences. All three phases are essential for a great LLM.
How Next-Token Prediction Actually Works
The Mechanics of Text Generation
Step-by-Step Generation:
When you type a prompt, here is what happens inside the LLM:
- 1. Tokenize: Your text is split into tokens (subwords). "Artificial Intelligence" becomes ["Art", "ificial", " Intell", "igence"]
- 2. Embed: Each token is converted to a high-dimensional vector (numbers)
- 3. Process: Vectors pass through dozens of Transformer layers (attention + feedforward)
- 4. Predict: The final layer outputs a probability distribution over ALL possible next tokens (50,000+ options)
- 5. Sample: One token is selected based on the probabilities (controlled by temperature)
- 6. Repeat: The selected token is appended to the input, and steps 2-5 repeat for the next token
Temperature - Controlling Randomness:
After the model outputs probabilities, temperature controls how tokens are selected:
- Temperature 0: Always picks the highest-probability token. Deterministic, factual, but repetitive
- Temperature 0.7: Good balance of quality and creativity. Most common setting
- Temperature 1.0: Full sampling from the distribution. More creative but may produce errors
- Temperature >1.0: Amplifies randomness. Can produce nonsensical output
Analogy: Temperature 0 is like a strict teacher who always gives the "textbook" answer. Temperature 1 is like a creative writer who takes risks. Temperature 2 is like someone having a fever dream.
Top-p and Top-k Sampling:
- Top-k: Only consider the top k most likely tokens. k=50 means choose from the 50 most probable tokens
- Top-p (nucleus): Only consider tokens until cumulative probability reaches p. p=0.9 means consider enough tokens to cover 90% probability
- Why both? They prevent the model from selecting extremely unlikely tokens while still allowing creativity
Note: LLMs generate text one token at a time. Each token takes a full forward pass through the entire network. This is why longer responses take longer to generate.
Why LLMs Are So Powerful (And Their Limitations)
Emergent Abilities and Fundamental Constraints
Emergent Abilities - The Magic of Scale:
As models get larger, they suddenly gain abilities that smaller models do not have. These are called emergent abilities:
- Few-shot learning: Given a few examples in the prompt, the model generalizes to new inputs
- Chain-of-thought reasoning: Can solve complex problems by thinking step-by-step
- Code generation: Writes functional code in dozens of languages
- Translation: Translates between languages it was never explicitly taught to translate
- Tool use: Learns to call APIs and use external tools from instructions alone
These abilities were not programmed. They emerged from training on enough data at enough scale.
Fundamental Limitations:
- Hallucination: Models confidently generate false information. They optimize for plausibility, not truth
- Knowledge Cutoff: Training data has a date. The model does not know about events after that date
- No Real Understanding: It is pattern matching, not comprehension. It can write about swimming without knowing what water feels like
- Context Window Limit: Cannot process infinitely long text. Information beyond the window is lost
- Poor Math: Arithmetic is unreliable for large numbers. Tokenization makes math hard for LLMs
- Brittleness: Slight prompt changes can dramatically alter output quality
- Bias: Training data contains biases that the model reproduces
Why Hallucinations Happen - A Technical Perspective:
LLMs are trained to produce plausible-sounding text, not factually correct text. During training, the model learns that "The capital of France is Paris" and "The CEO of Tesla is Elon Musk" are patterns. But it also learns patterns like "[Famous person] wrote [Book title]" - and may combine these patterns incorrectly.
The model has no internal fact-checker. It generates the most likely next token based on the input context and its training, even if that leads to factual errors.
Note: Never trust LLM output without verification for factual claims. Use RAG (Retrieval Augmented Generation) or tool calling to ground responses in real data.
The LLM Training Pipeline - Real-World Scale
What It Takes to Train a Frontier Model
Data Collection & Cleaning:
- Sources: Common Crawl (web pages), Wikipedia, Books, GitHub code, ArXiv papers, Reddit, StackOverflow
- Cleaning: Remove duplicates, low-quality text, toxic content, personal information
- Filtering: Quality classifiers to keep only high-quality text
- Scale: Typically 1-15 trillion tokens for a frontier model
Analogy: Like Zomato collecting reviews - you want millions of reviews, but you need to filter out spam, duplicate content, and inappropriate language.
Training Infrastructure:
- Hardware: Thousands of NVIDIA A100 or H100 GPUs (each costs $25,000-$40,000)
- Networking: InfiniBand connections between GPUs at 400 Gbps
- Duration: Weeks to months of continuous training
- Cost: GPT-4 estimated at $100 million+. LLaMA 3 405B cost ~$30 million
- Power: Training GPT-4 used estimated 50 GWh of electricity
Post-Training Pipeline:
- SFT Data Creation: Hire hundreds of expert annotators to write instruction-response pairs
- Reward Model Training: Annotators compare model responses, building a preference dataset
- RLHF/DPO: Optimize the model against the reward model
- Red-teaming: Dedicated team tries to break the model and find harmful outputs
- Safety Evaluation: Benchmark against safety criteria before release
Scaling Laws - The Key Insight:
Research by Kaplan et al. (2020) showed that model performance improves predictably with three factors:
- More parameters: Bigger models learn better representations
- More data: More diverse training data reduces errors
- More compute: Longer training with more GPUs
These "scaling laws" are why companies keep building bigger models. The performance gains are mathematically predictable!
Note: Training a frontier LLM requires enormous resources - but using one through APIs is cheap. This is why the API model (OpenAI, Anthropic) has become the dominant business model.
Interview Questions
Q: How does an LLM generate text?
LLMs generate text one token at a time using autoregressive generation. The input is tokenized and passed through the Transformer network. The final layer outputs probabilities for every possible next token. One token is sampled (controlled by temperature), appended to the input, and the process repeats until a stop condition.
Q: What are the three phases of LLM training?
(1) Pre-training: Self-supervised next-token prediction on trillions of tokens. Gives the model world knowledge. (2) Supervised Fine-Tuning (SFT): Training on curated instruction-response pairs. Teaches instruction following. (3) RLHF/DPO: Optimization using human preference data. Aligns the model to be helpful, harmless, and honest.
Q: Why do LLMs hallucinate?
LLMs are trained to produce plausible text, not factually correct text. They learn statistical patterns and may combine them incorrectly. There is no internal fact-checker. The model generates the most likely next token based on patterns, which can produce confident but false statements. Mitigation: RAG, tool use, chain-of-thought verification.
Q: What is the difference between a base model and a chat model?
A base model (pre-trained only) is a text completion engine - it can continue text but does not understand instructions or conversation. A chat model has been additionally fine-tuned with SFT and aligned with RLHF to follow instructions, be helpful, and avoid harmful outputs. GPT-4-base vs GPT-4-turbo is this distinction.
Q: What does temperature control in LLM generation?
Temperature controls randomness during token sampling. At temperature 0, the model always picks the most probable token (deterministic). At temperature 1, tokens are sampled according to their probabilities (creative). Higher temperatures amplify randomness, lower temperatures reduce it. Use low temperature for factual tasks, higher for creative tasks.
Frequently Asked Questions
What is How LLMs Work?
Understand how Large Language Models are built, trained, and fine-tuned. From next-token prediction to RLHF, learn the complete LLM pipeline.
How does How LLMs Work work?
The Core Idea - Predicting the Next Word at Scale Simple Definition: An LLM is a neural network (specifically, a Transformer) trained on massive amounts of text data to predict the next word (token) in a sequence. That is it.
Related topics
Practice this on DevInterviewMaster
Read the full How LLMs Work breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.