Evaluation & Observability
If you can't see it, you can't improve it
Imagine a chef who never tastes the food and never reads customer reviews. They'd have no idea if dinner was great or terrible. An agent is the same. Observability is the kitchen camera that records every step (tracing) plus the dashboards for cost and speed.
Key points
- Observability = see what the agent did (traces, logs, cost, latency).
- Evaluation = measure how GOOD the output is, with a score you trust.
- Tools like LangSmith and Langfuse record traces and run evals for you.
What 'Evaluation & Observability' means
Observability is being able to see inside your running agent: a trace of every step (prompts, tool calls, results), plus metrics like cost and latency. Evaluation (evals) is measuring the quality of the outputs — did it give the right answer? — using a test set and a scoring method. One shows you what happened; the other shows you how good it was.
Note: Observability = what happened (traces + metrics). Evaluation = how good it was (scores on a test set).
A trace: every step recorded so you can debug
RUN #abc123 total: 3 steps · 4.2s · ₹0.18 ┌──────────────────────────────────────────────────────────┐ │ Step 1 THINK model=small 0.4s ₹0.01 │ │ prompt: "Find weather in Mumbai" │ │ decision: call tool weather(city='Mumbai') │ ├──────────────────────────────────────────────────────────┤ │ Step 2 ACT tool=weather 1.1s ₹0.00 │ │ result: {temp: 31, sky: 'humid'} │ ├──────────────────────────────────────────────────────────┤ │ Step 3 THINK model=big 2.7s ₹0.17 │ │ answer: "It's 31°C and humid in Mumbai." ✅ │ └──────────────────────────────────────────────────────────┘ ↑ when something breaks, you open the trace and SEE the step
The 4 things to watch
- Tracing (logs) — A step-by-step record of prompts, tool calls and results for each run. Example: Open run #abc123 and see exactly which tool returned bad data.
- Quality evals — Score outputs against a test set of inputs with known good answers. Example: Run 100 saved questions; check how many answers are correct.
- Cost tracking — How much money each step/run/day spends, so bills don't surprise you. Example: Alert if average cost-per-run jumps above ₹0.50.
- Latency tracking — How long steps and whole runs take, so you catch slowdowns. Example: Alert if p95 latency goes above 5 seconds.
The evaluation loop (measure → improve → re-measure)
┌─────────────────────────────────────────────┐ │ │ ▼ │ ┌────────────┐ run ┌────────────┐ score ┌────────────┐ │ TEST SET │─────────► │ AGENT │─────────► │ EVAL │ │ 100 inputs │ │ │ │ correct? │ │ + answers │ └────────────┘ │ 82/100 │ └────────────┘ └─────┬──────┘ ▲ │ │ change prompt / model / tools, then re-run │ └─────────────────────────────────────────────────┘ goal: the score goes UP over time 📈
A tiny code example (trace each step + score a test set)
First we wrap a step so it records what happened. Then we run a small test set and compute an accuracy score we can track over time.
import time
TRACE = []
def traced_step(name, fn, *args):
start = time.time()
result = fn(*args)
TRACE.append({ # OBSERVABILITY: record the step
"step": name,
"latency_s": round(time.time() - start, 2),
"output": result,
})
return result
TEST_SET = [
{"q": "2+2?", "expected": "4"},
{"q": "capital of France?", "expected": "Paris"},
]
def evaluate(agent):
# EVALUATION: score quality
correct = 0
for case in TEST_SET:
ans = agent(case["q"])
if case["expected"].lower() in ans.lower():
correct += 1
score = correct / len(TEST_SET)
print(f"accuracy = {score:.0%}") # track this over time
return score
When observability & evals pay off
| Scenario | Recommendation | Why |
|---|---|---|
| Anything running in production with real users | ✅ Tracing + cost/latency | You need to debug and budget live behaviour. |
| Tuning prompts/models and unsure if changes help | ✅ Evals on a test set | A score tells you objectively if quality went up or down. |
| Costs or latency creeping up mysteriously | ✅ Per-step metrics + alerts | Traces reveal which step is the culprit. |
| A throwaway one-off script | ❌ Skip heavy tooling | Not worth the setup if it runs once and is gone. |
Tools you'll hear about (high level)
You don't have to build tracing from scratch. Popular platforms record traces, costs and latency, and help you run evals:
- LangSmith: tracing + evaluation dashboard from the LangChain team.
- Langfuse: open-source tracing, cost tracking and evals.
They all do the same core idea: capture every step, show metrics, and let you score quality. Pick one and instrument early — don't wait until something breaks.
Evaluation & observability mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| Shipping with no logging or tracing at all. | When it breaks in production you're blind and can't debug. | Instrument tracing from day one; record every step's inputs/outputs. |
| Judging quality by 'it felt good in a demo'. | You ship regressions you can't see; quality silently drops. | Build a test set with known answers and track an accuracy score. |
| Tracking quality but ignoring cost and latency. | A 'better' agent that's secretly 5x slower and pricier. | Watch cost and latency alongside quality — all three matter. |
| Evaluating only once, then never again. | Models/prompts/data drift; old scores become meaningless. | Re-run evals on every meaningful change (CI), not just once. |
Observability rules to live by
- Instrument tracing BEFORE you launch, not after the first incident.
- Quality, cost and latency are three dials — watch all three together.
- An eval test set turns 'feels good' into a number you can improve.
Key takeaways
- Observability shows what the agent did: traces of every step plus cost and latency.
- Evaluation shows how good the output is, scored against a test set of known answers.
- Watch quality, cost and latency together — a 'smarter' agent can secretly be slower/pricier.
- Tools like LangSmith and Langfuse provide tracing and evals; instrument early, re-eval on every change.
Frequently Asked Questions
What is Evaluation & Observability?
Imagine a chef who never tastes the food and never reads customer reviews. They'd have no idea if dinner was great or terrible.
How does Evaluation & Observability work?
Observability is being able to see inside your running agent: a trace of every step (prompts, tool calls, results), plus metrics like cost and latency . Evaluation (evals) is measuring the quality of the outputs — did it give the right answer?
What are the key takeaways about Evaluation & Observability?
Observability shows what the agent did: traces of every step plus cost and latency. Evaluation shows how good the output is, scored against a test set of known answers. Watch quality, cost and latency together — a 'smarter' agent can secretly be slower/pricier. Tools like LangSmith and Langfuse provide tracing and evals; instrument early, re-eval on every change.
Related topics
Practice this on DevInterviewMaster
Read the full Evaluation & Observability breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.