How LLMs Work

What is a Large Language Model (LLM)?

The Core Idea - Predicting the Next Word at Scale

Simple Definition:

An LLM is a neural network (specifically, a Transformer) trained on massive amounts of text data to predict the next word (token) in a sequence. That is it. The entire magic of ChatGPT, Claude, and Gemini boils down to incredibly sophisticated next-word prediction.

But this simple objective, when applied at enormous scale (trillions of words, billions of parameters), produces emergent abilities - reasoning, coding, translation, summarization - that the model was never explicitly taught.

Analogy - The Auto-Complete Genius:

Think of your phone's keyboard auto-complete. It predicts the next word based on what you have typed. Now imagine making that prediction engine 1 billion times more powerful - trained on every book, website, and conversation ever written. It can now complete not just words but entire essays, code, and reasoning chains.

That is essentially what an LLM is - the world's most powerful auto-complete, trained on the internet.

What Makes It "Large"?

Model	Parameters	Training Data
GPT-2 (2019)	1.5 billion	40 GB text
GPT-3 (2020)	175 billion	570 GB text
LLaMA 2 (2023)	70 billion	2 trillion tokens
GPT-4 (2023)	~1.8 trillion (MoE)	~13 trillion tokens
LLaMA 3 (2024)	405 billion	15 trillion tokens

Scale is the key: These models need thousands of GPUs training for months, costing millions of dollars. That is why only big labs can train frontier models from scratch.

Note: LLMs do one thing - predict the next token. But at sufficient scale, this simple objective produces remarkable capabilities including reasoning, coding, and creative writing.

The Three Phases of LLM Training

Pre-training, Fine-tuning, and Alignment

Phase 1: Pre-training (Most Expensive)

The model reads trillions of tokens from the internet - books, Wikipedia, code, websites, forums. For each chunk of text, it predicts the next token, compares with the actual token, and updates its weights.

Objective: Next-token prediction (autoregressive language modeling)
Data: Massive web crawls (Common Crawl, C4, The Pile, etc.)
Cost: Millions of dollars, thousands of GPUs, weeks to months
Result: A "base model" that can complete text but is not great at following instructions

Analogy: This is like a child reading millions of books. They learn language, facts, and patterns, but they do not yet know how to have a helpful conversation.

Phase 2: Supervised Fine-Tuning (SFT)

Human annotators create thousands of high-quality instruction-response pairs. The model learns to follow instructions by training on these examples.

Objective: Learn to follow instructions and be helpful
Data: 10K-100K curated prompt-response pairs written by experts
Cost: Much cheaper than pre-training, days on fewer GPUs
Result: An "instruct model" that follows instructions but may sometimes be harmful or unhelpful

Analogy: Like taking a well-read person and teaching them customer service skills - how to understand questions and give helpful answers.

Phase 3: RLHF / DPO Alignment

Human raters compare multiple model responses and rank them. A reward model learns these preferences. The LLM is then optimized to maximize the reward model's score.

RLHF: Reinforcement Learning from Human Feedback - uses PPO algorithm
DPO: Direct Preference Optimization - newer, simpler alternative to RLHF
Goal: Make the model helpful, harmless, and honest
Result: The "chat model" you actually interact with (ChatGPT, Claude)

Analogy: Like a manager reviewing the customer service agent's performance, telling them "this response was great, this one was bad" until quality consistently improves.

Note: Pre-training gives the model knowledge, SFT teaches it to follow instructions, and RLHF aligns it with human preferences. All three phases are essential for a great LLM.

How Next-Token Prediction Actually Works

The Mechanics of Text Generation

Step-by-Step Generation:

When you type a prompt, here is what happens inside the LLM:

1. Tokenize: Your text is split into tokens (subwords). "Artificial Intelligence" becomes ["Art", "ificial", " Intell", "igence"]
2. Embed: Each token is converted to a high-dimensional vector (numbers)
3. Process: Vectors pass through dozens of Transformer layers (attention + feedforward)
4. Predict: The final layer outputs a probability distribution over ALL possible next tokens (50,000+ options)
5. Sample: One token is selected based on the probabilities (controlled by temperature)
6. Repeat: The selected token is appended to the input, and steps 2-5 repeat for the next token

Temperature - Controlling Randomness:

After the model outputs probabilities, temperature controls how tokens are selected:

Temperature 0: Always picks the highest-probability token. Deterministic, factual, but repetitive
Temperature 0.7: Good balance of quality and creativity. Most common setting
Temperature 1.0: Full sampling from the distribution. More creative but may produce errors
Temperature >1.0: Amplifies randomness. Can produce nonsensical output

Analogy: Temperature 0 is like a strict teacher who always gives the "textbook" answer. Temperature 1 is like a creative writer who takes risks. Temperature 2 is like someone having a fever dream.

Top-p and Top-k Sampling:

Top-k: Only consider the top k most likely tokens. k=50 means choose from the 50 most probable tokens
Top-p (nucleus): Only consider tokens until cumulative probability reaches p. p=0.9 means consider enough tokens to cover 90% probability
Why both? They prevent the model from selecting extremely unlikely tokens while still allowing creativity

Note: LLMs generate text one token at a time. Each token takes a full forward pass through the entire network. This is why longer responses take longer to generate.

Why LLMs Are So Powerful (And Their Limitations)

Emergent Abilities and Fundamental Constraints

Emergent Abilities - The Magic of Scale:

As models get larger, they suddenly gain abilities that smaller models do not have. These are called emergent abilities:

Few-shot learning: Given a few examples in the prompt, the model generalizes to new inputs
Chain-of-thought reasoning: Can solve complex problems by thinking step-by-step
Code generation: Writes functional code in dozens of languages
Translation: Translates between languages it was never explicitly taught to translate
Tool use: Learns to call APIs and use external tools from instructions alone

These abilities were not programmed. They emerged from training on enough data at enough scale.

Fundamental Limitations:

Hallucination: Models confidently generate false information. They optimize for plausibility, not truth
Knowledge Cutoff: Training data has a date. The model does not know about events after that date
No Real Understanding: It is pattern matching, not comprehension. It can write about swimming without knowing what water feels like
Context Window Limit: Cannot process infinitely long text. Information beyond the window is lost
Poor Math: Arithmetic is unreliable for large numbers. Tokenization makes math hard for LLMs
Brittleness: Slight prompt changes can dramatically alter output quality
Bias: Training data contains biases that the model reproduces

Why Hallucinations Happen - A Technical Perspective:

LLMs are trained to produce plausible-sounding text, not factually correct text. During training, the model learns that "The capital of France is Paris" and "The CEO of Tesla is Elon Musk" are patterns. But it also learns patterns like "[Famous person] wrote [Book title]" - and may combine these patterns incorrectly.

The model has no internal fact-checker. It generates the most likely next token based on the input context and its training, even if that leads to factual errors.

Note: Never trust LLM output without verification for factual claims. Use RAG (Retrieval Augmented Generation) or tool calling to ground responses in real data.

The LLM Training Pipeline - Real-World Scale

What It Takes to Train a Frontier Model

Data Collection & Cleaning:

Sources: Common Crawl (web pages), Wikipedia, Books, GitHub code, ArXiv papers, Reddit, StackOverflow
Cleaning: Remove duplicates, low-quality text, toxic content, personal information
Filtering: Quality classifiers to keep only high-quality text
Scale: Typically 1-15 trillion tokens for a frontier model

Analogy: Like Zomato collecting reviews - you want millions of reviews, but you need to filter out spam, duplicate content, and inappropriate language.

Training Infrastructure:

Hardware: Thousands of NVIDIA A100 or H100 GPUs (each costs $25,000-$40,000)
Networking: InfiniBand connections between GPUs at 400 Gbps
Duration: Weeks to months of continuous training
Cost: GPT-4 estimated at $100 million+. LLaMA 3 405B cost ~$30 million
Power: Training GPT-4 used estimated 50 GWh of electricity

Post-Training Pipeline:

SFT Data Creation: Hire hundreds of expert annotators to write instruction-response pairs
Reward Model Training: Annotators compare model responses, building a preference dataset
RLHF/DPO: Optimize the model against the reward model
Red-teaming: Dedicated team tries to break the model and find harmful outputs
Safety Evaluation: Benchmark against safety criteria before release

Scaling Laws - The Key Insight:

Research by Kaplan et al. (2020) showed that model performance improves predictably with three factors:

More parameters: Bigger models learn better representations
More data: More diverse training data reduces errors
More compute: Longer training with more GPUs

These "scaling laws" are why companies keep building bigger models. The performance gains are mathematically predictable!

Note: Training a frontier LLM requires enormous resources - but using one through APIs is cheap. This is why the API model (OpenAI, Anthropic) has become the dominant business model.

Interview Questions

Q: How does an LLM generate text?

LLMs generate text one token at a time using autoregressive generation. The input is tokenized and passed through the Transformer network. The final layer outputs probabilities for every possible next token. One token is sampled (controlled by temperature), appended to the input, and the process repeats until a stop condition.

Q: What are the three phases of LLM training?

(1) Pre-training: Self-supervised next-token prediction on trillions of tokens. Gives the model world knowledge. (2) Supervised Fine-Tuning (SFT): Training on curated instruction-response pairs. Teaches instruction following. (3) RLHF/DPO: Optimization using human preference data. Aligns the model to be helpful, harmless, and honest.

Q: Why do LLMs hallucinate?

LLMs are trained to produce plausible text, not factually correct text. They learn statistical patterns and may combine them incorrectly. There is no internal fact-checker. The model generates the most likely next token based on patterns, which can produce confident but false statements. Mitigation: RAG, tool use, chain-of-thought verification.

Q: What is the difference between a base model and a chat model?

A base model (pre-trained only) is a text completion engine - it can continue text but does not understand instructions or conversation. A chat model has been additionally fine-tuned with SFT and aligned with RLHF to follow instructions, be helpful, and avoid harmful outputs. GPT-4-base vs GPT-4-turbo is this distinction.

Q: What does temperature control in LLM generation?

Temperature controls randomness during token sampling. At temperature 0, the model always picks the most probable token (deterministic). At temperature 1, tokens are sampled according to their probabilities (creative). Higher temperatures amplify randomness, lower temperatures reduce it. Use low temperature for factual tasks, higher for creative tasks.

Frequently Asked Questions

What is How LLMs Work?

Understand how Large Language Models are built, trained, and fine-tuned. From next-token prediction to RLHF, learn the complete LLM pipeline.

How does How LLMs Work work?

The Core Idea - Predicting the Next Word at Scale Simple Definition: An LLM is a neural network (specifically, a Transformer) trained on massive amounts of text data to predict the next word (token) in a sequence. That is it.

Browse all AI & Automation topics →

What is a Large Language Model (LLM)?

The Three Phases of LLM Training

How Next-Token Prediction Actually Works

Why LLMs Are So Powerful (And Their Limitations)

The LLM Training Pipeline - Real-World Scale

Interview Questions

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster