Model Benchmarks & Evaluation (MMLU, HumanEval)
How to Measure and Compare LLM Performance
Understand the major benchmarks used to evaluate LLMs, how to read leaderboards critically, and how to build your own evaluation for real-world applications.
Why Benchmarks Matter (And Why They Lie)
Measuring Intelligence is Hard
What Are LLM Benchmarks?
Benchmarks are standardized tests for AI models. Just like JEE tests students on Physics, Chemistry, Math - LLM benchmarks test models on knowledge, reasoning, coding, math, and language understanding.
They provide a common yardstick to compare models. When Anthropic says "Claude 3.5 Sonnet scores 88.7% on MMLU", it means something because we can compare that to GPT-4o's 88.7% on the same test.
The Problem with Benchmarks:
- Contamination: If benchmark questions appear in training data, the model memorizes answers. This is like a student getting the exam paper in advance
- Overfitting: Companies optimize models specifically for popular benchmarks, which may not improve real-world performance
- Narrow scope: Benchmarks test specific skills. A model can score 90% on MMLU but struggle with creative writing
- Static: Benchmarks stay the same while models and use cases evolve. Old benchmarks become too easy
- Gaming: Some models are trained specifically on benchmark-like data to inflate scores
Analogy - College Rankings:
Benchmarks are like college rankings (NIRF in India). IIT Bombay might be #1 overall, but a specific IIT or NIT could be better for a specific branch. Rankings give direction but do not tell the whole story. Similarly, the #1 model on MMLU may not be the best for your task.
Note: Benchmarks are useful directional indicators but not the final answer. Always complement benchmark analysis with testing on YOUR specific use case.
Major LLM Benchmarks Explained
The Tests That Define LLM Capabilities
MMLU (Massive Multitask Language Understanding)
The most widely cited benchmark. 57 subjects from elementary math to professional medicine, law, and computer science. 14,042 multiple-choice questions.
- Tests: Breadth of knowledge across domains
- Format: 4-option multiple choice
- Human baseline: ~89.8% (expert level)
- Top models: GPT-4o and Claude 3.5 Sonnet both ~88-89%
- Limitation: Mostly English, mostly factual recall. Heavily contaminated in training data
HumanEval (Coding)
164 Python programming problems. The model writes code that is tested against unit tests.
- Tests: Code generation ability
- Metric: pass@1 (code passes all tests on first attempt)
- Top models: GPT-4o ~92%, Claude 3.5 Sonnet ~92%
- Extended version: HumanEval+ adds edge case tests, scores are typically 10-15% lower
Other Important Benchmarks:
| Benchmark | Tests | Format |
|---|---|---|
| MATH | Mathematical reasoning | Free-form math problems |
| GSM8K | Grade school math | Word problems |
| ARC | Science reasoning | Science exam questions |
| HellaSwag | Common sense | Sentence completion |
| MT-Bench | Multi-turn conversation | LLM-as-judge scoring |
| GPQA | Graduate-level science | Expert-written questions |
| Chatbot Arena (ELO) | Human preference | Head-to-head voting |
Note: Chatbot Arena (LMSYS) is the most reliable benchmark because it uses live human preference ratings from blind comparisons. Check arena.lmsys.org for the latest rankings.
How to Read Leaderboards Critically
Do Not Be Fooled by Numbers
Red Flags on Leaderboards:
- Suspiciously high scores on MMLU: If a small model scores 90%+ on MMLU, suspect data contamination. The benchmark questions may have leaked into training data
- Only self-reported scores: Companies cherry-pick benchmarks where they do well. Always look for independent evaluation
- Missing benchmarks: If a company does not report HumanEval but reports MMLU, their coding might be weak
- Old benchmark versions: Some companies report on easier/older versions of benchmarks
Trusted Sources for Model Evaluation:
- LMSYS Chatbot Arena: Live human preference rankings. Most reliable overall ranking. arena.lmsys.org
- Open LLM Leaderboard (HuggingFace): Standardized evaluation of open models. Transparent methodology
- LM Arena / WildBench: Tests on real-world user queries
- Artificial Analysis: Tracks speed, pricing, and quality across providers
- Independent blogs (Simon Willison, etc.): Hands-on testing and comparison
What Matters More Than Benchmarks:
- Instruction following: Does it actually do what you ask?
- Consistency: Does it give good results 95% of the time, not just cherry-picked examples?
- Edge cases: How does it handle unusual inputs, adversarial prompts, or ambiguous requests?
- Your language: Hindi/Hinglish performance can be very different from English performance
- Latency: Time to first token and tokens per second in production
Note: The best benchmark for your application is YOUR application. Build a custom evaluation set from real user queries and test models against it.
Building Your Own Evaluation
Custom Evaluation for Real-World Applications
Step 1: Create a Test Set
- Collect 50-200 real queries from your actual use case
- Write ideal/expected outputs for each query
- Include edge cases, adversarial inputs, and multilingual examples
- Cover the full range of difficulty (easy FAQs to complex reasoning)
Example for a Flipkart review analyzer: Include reviews in English, Hindi, Hinglish, short reviews, long reviews, sarcastic reviews, reviews with spelling errors
Step 2: Define Metrics
- Accuracy: Does the output contain the correct information?
- Completeness: Did it cover all required points?
- Relevance: No irrelevant or hallucinated information?
- Format compliance: Does it follow the requested output format?
- Latency: Response time acceptable for your use case?
- Cost: Token usage and API cost per query
Step 3: LLM-as-Judge
Use a strong model (GPT-4o or Claude) to automatically score outputs from other models. This scales evaluation without needing human reviewers for every query.
Prompt to Judge LLM:
"Rate the following response on a scale of 1-5 for:
- Accuracy (correct information)
- Completeness (covers all points)
- Relevance (no hallucination)
- Clarity (easy to understand)
Question: {original_question}
Expected Answer: {ideal_answer}
Model Response: {model_output}
Provide scores and brief justification."
Step 4: A/B Testing in Production
After initial evaluation, run A/B tests with real users:
- Route 50% of traffic to Model A, 50% to Model B
- Track user satisfaction metrics (thumbs up/down, follow-up questions)
- Monitor error rates, hallucination frequency, escalation to human
- Measure latency and cost impact
Note: LLM-as-Judge combined with human evaluation on a sample gives the best balance of scale and accuracy. Always validate the judge's ratings match human judgment.
Evaluation Pitfalls
Common Mistakes in LLM Evaluation
Pitfall 1: Vibes-Based Evaluation
"I tried 5 prompts and Claude felt better" is not evaluation. You need systematic testing with diverse examples and quantitative metrics. Gut feeling is useful for initial exploration but not for production decisions.
Pitfall 2: Testing Only Happy Path
Your model works great on normal questions. But what about: typos, Hindi mixed with English, adversarial prompts, edge cases, very long inputs, empty inputs? Test the unhappy paths too.
Pitfall 3: Not Measuring Consistency
Run each test query 5-10 times. LLMs are non-deterministic (even at temperature 0, API implementations may vary). A model that gives great answers 60% of the time and terrible answers 40% is not reliable for production.
Pitfall 4: Ignoring Cost-Quality Trade-off
A model that scores 5% higher on your evaluation but costs 10x more may not be worth it. Always plot quality vs cost. The sweet spot is usually not the most expensive model.
Note: Never pick a model based only on benchmarks or vibes. Build a systematic evaluation pipeline with your actual data, test edge cases, measure consistency, and factor in cost.
Interview Questions
Q: What is MMLU and why is it the most cited LLM benchmark?
MMLU (Massive Multitask Language Understanding) tests models across 57 subjects with 14,042 multiple-choice questions. It is widely cited because it measures breadth of knowledge across many domains. However, it is heavily contaminated in training data and tests recall more than reasoning. It should be one of many evaluation criteria, not the sole one.
Q: What is benchmark contamination and why is it a problem?
Benchmark contamination occurs when benchmark questions appear in the model's training data. The model essentially memorizes answers rather than genuinely solving problems. This inflates scores artificially and makes the benchmark useless for measuring true capability. Mitigation: use newer benchmarks, track independent evaluations, and always test on your own data.
Q: How would you evaluate an LLM for a production customer support chatbot?
(1) Create a test set from real customer queries (100-200 examples). (2) Define metrics: accuracy, completeness, relevance, tone, format compliance. (3) Use LLM-as-Judge for automated scoring at scale. (4) Human review a 10-20% sample to validate. (5) A/B test top candidates in production. (6) Track escalation rate, user satisfaction, and cost. Test in all user languages (Hindi, English, Hinglish).
Q: What is LLM-as-Judge and when would you use it?
LLM-as-Judge uses a strong model (like GPT-4o) to evaluate outputs from other models or earlier versions. You provide the original question, expected answer, and model output - the judge rates quality on defined criteria. It scales evaluation without needing human reviewers for every query. Use it for continuous evaluation pipelines, regression testing, and comparing multiple models. Always validate judge ratings match human judgment on a sample.
Frequently Asked Questions
What is Model Benchmarks & Evaluation?
Understand the major benchmarks used to evaluate LLMs, how to read leaderboards critically, and how to build your own evaluation for real-world applications.
How does Model Benchmarks & Evaluation work?
Measuring Intelligence is Hard What Are LLM Benchmarks? Benchmarks are standardized tests for AI models.
Related topics
Practice this on DevInterviewMaster
Read the full Model Benchmarks & Evaluation (MMLU, HumanEval) breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.