Exception Handling & Recovery
Real systems fail — your agent must not crash
Picture a delivery driver whose GPS suddenly dies. A bad driver just stops in the middle of the road. A good driver tries again , then switches to a paper map , and if truly stuck, calls the office for help . Exception Handling & Recovery is teaching your agent to do the same: catch the failure, retry sensibly, fall back to a plan B, or escalate to a human, instead of blowing up.
Key points
- Tools, networks and even the LLM WILL fail sometimes.
- Catch the error, don't let one failed step kill the whole run.
- Recovery options: retry (with backoff) → fallback → ask a human.
What 'Exception Handling & Recovery' means
Exception handling is wrapping risky actions (tool calls, API requests, parsing the LLM's answer) so that when they fail, your code catches the error instead of crashing. Recovery is what you do next: retry the step (often after a short wait), use a fallback plan, or escalate to a person. The goal is graceful failure, not a stack trace in the user's face.
Note: Catch the error → decide: retry, fall back, or ask for help. Never just crash.
Fragile agent vs resilient agent
FRAGILE (crashes) RESILIENT (recovers) ───────────────── ────────────────────
call tool call tool │ │ ▼ ▼ ❌ timeout ❌ timeout │ │ caught! ▼ ▼ 💥 WHOLE AGENT ┌──────────────┐ CRASHES │ retry (wait) │──┐ user sees a └──────────────┘ │ stack trace 😱 │ still fails? │ ▼ │ ┌──────────────┐ │ │ fallback plan│◄─────┘ └──────┬───────┘ │ also fails? ▼ 🙋 ask a human (escalate)
Your recovery toolbox
- Retry with backoff — Try again, waiting a bit longer each time, for flaky/temporary errors. Example: Wait 1s, then 2s, then 4s (exponential backoff) before giving up.
- Fallback — A simpler plan B when the main path keeps failing. Example: Search API down? Use a cached answer or a cheaper search.
- Escalate to human — When the agent truly can't proceed safely, hand off to a person. Example: "I couldn't book this; here's what I tried — please confirm."
- Validate LLM output — Treat the model's answer as untrusted; check shape before using it. Example: Expecting JSON? Parse it; if it fails, ask the LLM to fix it.
Retry with exponential backoff (why we wait longer each time)
Attempt 1 ✗ fail ──► wait 1s Attempt 2 ✗ fail ──► wait 2s Attempt 3 ✗ fail ──► wait 4s Attempt 4 ✓ success! (server had recovered)
time ─────────────────────────────────────────────► |‐| |‐‐| |‐‐‐‐| 1s 2s 4s gaps grow so we don't hammer a struggling server
A tiny code example (try/except + retry + fallback)
This wraps a flaky tool. It retries with growing waits, and if all retries fail it uses a fallback instead of crashing.
import time
def call_with_recovery(tool, args, max_retries=3):
delay = 1.0
for attempt in range(1, max_retries + 1):
try:
return tool(args) # try the risky action
except TimeoutError as e: # catch only what we expect
print(f"attempt {attempt} failed: {e}")
if attempt == max_retries:
break
time.sleep(delay) # wait...
delay *= 2 # ...longer each time (backoff)
return fallback(args) # plan B, never crash
def fallback(args):
cached = cache_lookup(args)
if cached is not None:
return cached
return escalate_to_human(args) # last resort
When recovery matters most
| Scenario | Recommendation | Why |
|---|---|---|
| Calling external APIs / network tools | ✅ Retry + fallback | Networks are flaky; transient errors are normal. |
| Parsing the LLM's output into JSON/structured data | ✅ Validate + re-ask | Models sometimes return malformed or extra text. |
| Irreversible actions (payments, deletions) | ✅ Escalate, don't blind-retry | Retrying could double-charge or double-delete. |
| A pure local calculation you control | ❌ Minimal handling | Little can fail; a simple try/except is enough. |
Recovery mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| Bare 'except:' that swallows every error silently. | Real bugs hide; the agent 'succeeds' while doing nothing useful. | Catch specific exceptions and log them; never swallow silently. |
| Retrying instantly, many times, with no wait. | You hammer a struggling server and may get rate-limited or banned. | Use exponential backoff and a sane max-retry cap. |
| Retrying non-idempotent actions like payments. | Customers get charged twice; data gets duplicated. | Only auto-retry safe (idempotent) steps; escalate the risky ones. |
| Trusting LLM output without validating it. | Malformed JSON crashes the next step downstream. | Validate the shape; if invalid, ask the model to correct it. |
Recovery rules of thumb
- Catch specific errors, log them, never use a silent bare except.
- Retry order: backoff → fallback → escalate to a human.
- Never blindly retry actions that change money or delete data.
Key takeaways
- Tools, networks and LLMs all fail; the agent must catch errors, not crash.
- Recovery options in order: retry with backoff, then fallback, then escalate to a human.
- Use exponential backoff and a retry cap so you don't hammer struggling services.
- Never auto-retry irreversible actions; validate LLM output before using it.
Frequently Asked Questions
What is Exception Handling & Recovery?
Picture a delivery driver whose GPS suddenly dies. A bad driver just stops in the middle of the road.
How does Exception Handling & Recovery work?
Exception handling is wrapping risky actions (tool calls, API requests, parsing the LLM's answer) so that when they fail, your code catches the error instead of crashing. Recovery is what you do next: retry the step (often after a short wait), use a fallback plan, or escalate to a person.
What are the key takeaways about Exception Handling & Recovery?
Tools, networks and LLMs all fail; the agent must catch errors, not crash. Recovery options in order: retry with backoff, then fallback, then escalate to a human. Use exponential backoff and a retry cap so you don't hammer struggling services. Never auto-retry irreversible actions; validate LLM output before using it.
Related topics
Practice this on DevInterviewMaster
Read the full Exception Handling & Recovery breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.