DevInterviewMasterStart free →
Agentic AI PatternsFree to read

Exception Handling & Recovery

Real systems fail — your agent must not crash

Picture a delivery driver whose GPS suddenly dies. A bad driver just stops in the middle of the road. A good driver tries again , then switches to a paper map , and if truly stuck, calls the office for help . Exception Handling & Recovery is teaching your agent to do the same: catch the failure, retry sensibly, fall back to a plan B, or escalate to a human, instead of blowing up.

Key points

What 'Exception Handling & Recovery' means

Exception handling is wrapping risky actions (tool calls, API requests, parsing the LLM's answer) so that when they fail, your code catches the error instead of crashing. Recovery is what you do next: retry the step (often after a short wait), use a fallback plan, or escalate to a person. The goal is graceful failure, not a stack trace in the user's face.

Note: Catch the error → decide: retry, fall back, or ask for help. Never just crash.

Fragile agent vs resilient agent

FRAGILE (crashes) RESILIENT (recovers) ───────────────── ────────────────────

call tool call tool │ │ ▼ ▼ ❌ timeout ❌ timeout │ │ caught! ▼ ▼ 💥 WHOLE AGENT ┌──────────────┐ CRASHES │ retry (wait) │──┐ user sees a └──────────────┘ │ stack trace 😱 │ still fails? │ ▼ │ ┌──────────────┐ │ │ fallback plan│◄─────┘ └──────┬───────┘ │ also fails? ▼ 🙋 ask a human (escalate)

Your recovery toolbox

Retry with exponential backoff (why we wait longer each time)

Attempt 1 ✗ fail ──► wait 1s Attempt 2 ✗ fail ──► wait 2s Attempt 3 ✗ fail ──► wait 4s Attempt 4 ✓ success! (server had recovered)

time ─────────────────────────────────────────────► |‐| |‐‐| |‐‐‐‐| 1s 2s 4s gaps grow so we don't hammer a struggling server

A tiny code example (try/except + retry + fallback)

This wraps a flaky tool. It retries with growing waits, and if all retries fail it uses a fallback instead of crashing.

import time

def call_with_recovery(tool, args, max_retries=3):
    delay = 1.0
    for attempt in range(1, max_retries + 1):
        try:
            return tool(args)                       # try the risky action
        except TimeoutError as e:                   # catch only what we expect
            print(f"attempt {attempt} failed: {e}")
            if attempt == max_retries:
                break
            time.sleep(delay)                       # wait...
            delay *= 2                              # ...longer each time (backoff)
    return fallback(args)                          # plan B, never crash

def fallback(args):
    cached = cache_lookup(args)
    if cached is not None:
        return cached
    return escalate_to_human(args)                 # last resort

When recovery matters most

ScenarioRecommendationWhy
Calling external APIs / network tools✅ Retry + fallbackNetworks are flaky; transient errors are normal.
Parsing the LLM's output into JSON/structured data✅ Validate + re-askModels sometimes return malformed or extra text.
Irreversible actions (payments, deletions)✅ Escalate, don't blind-retryRetrying could double-charge or double-delete.
A pure local calculation you control❌ Minimal handlingLittle can fail; a simple try/except is enough.

Recovery mistakes

MistakeConsequenceFix
Bare 'except:' that swallows every error silently.Real bugs hide; the agent 'succeeds' while doing nothing useful.Catch specific exceptions and log them; never swallow silently.
Retrying instantly, many times, with no wait.You hammer a struggling server and may get rate-limited or banned.Use exponential backoff and a sane max-retry cap.
Retrying non-idempotent actions like payments.Customers get charged twice; data gets duplicated.Only auto-retry safe (idempotent) steps; escalate the risky ones.
Trusting LLM output without validating it.Malformed JSON crashes the next step downstream.Validate the shape; if invalid, ask the model to correct it.

Recovery rules of thumb

Key takeaways

Frequently Asked Questions

What is Exception Handling & Recovery?

Picture a delivery driver whose GPS suddenly dies. A bad driver just stops in the middle of the road.

How does Exception Handling & Recovery work?

Exception handling is wrapping risky actions (tool calls, API requests, parsing the LLM's answer) so that when they fail, your code catches the error instead of crashing. Recovery is what you do next: retry the step (often after a short wait), use a fallback plan, or escalate to a person.

What are the key takeaways about Exception Handling & Recovery?

Tools, networks and LLMs all fail; the agent must catch errors, not crash. Recovery options in order: retry with backoff, then fallback, then escalate to a human. Use exponential backoff and a retry cap so you don't hammer struggling services. Never auto-retry irreversible actions; validate LLM output before using it.

Browse all Agentic AI Patterns topics →

Practice this on DevInterviewMaster

Read the full Exception Handling & Recovery breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.