Agentic AI PatternsFree to read

Exception Handling & Recovery

Real systems fail — your agent must not crash

Picture a delivery driver whose GPS suddenly dies. A bad driver just stops in the middle of the road. A good driver tries again , then switches to a paper map , and if truly stuck, calls the office for help . Exception Handling & Recovery is teaching your agent to do the same: catch the failure, retry sensibly, fall back to a plan B, or escalate to a human, instead of blowing up.

Key points

Tools, networks and even the LLM WILL fail sometimes.
Catch the error, don't let one failed step kill the whole run.
Recovery options: retry (with backoff) → fallback → ask a human.

What 'Exception Handling & Recovery' means

Exception handling is wrapping risky actions (tool calls, API requests, parsing the LLM's answer) so that when they fail, your code catches the error instead of crashing. Recovery is what you do next: retry the step (often after a short wait), use a fallback plan, or escalate to a person. The goal is graceful failure, not a stack trace in the user's face.

Note: Catch the error → decide: retry, fall back, or ask for help. Never just crash.

Fragile agent vs resilient agent

FRAGILE (crashes) RESILIENT (recovers) ───────────────── ────────────────────

call tool call tool │ │ ▼ ▼ ❌ timeout ❌ timeout │ │ caught! ▼ ▼ 💥 WHOLE AGENT ┌──────────────┐ CRASHES │ retry (wait) │──┐ user sees a └──────────────┘ │ stack trace 😱 │ still fails? │ ▼ │ ┌──────────────┐ │ │ fallback plan│◄─────┘ └──────┬───────┘ │ also fails? ▼ 🙋 ask a human (escalate)

Your recovery toolbox

Retry with backoff — Try again, waiting a bit longer each time, for flaky/temporary errors. Example: Wait 1s, then 2s, then 4s (exponential backoff) before giving up.
Fallback — A simpler plan B when the main path keeps failing. Example: Search API down? Use a cached answer or a cheaper search.
Escalate to human — When the agent truly can't proceed safely, hand off to a person. Example: "I couldn't book this; here's what I tried — please confirm."
Validate LLM output — Treat the model's answer as untrusted; check shape before using it. Example: Expecting JSON? Parse it; if it fails, ask the LLM to fix it.

Retry with exponential backoff (why we wait longer each time)

Attempt 1 ✗ fail ──► wait 1s Attempt 2 ✗ fail ──► wait 2s Attempt 3 ✗ fail ──► wait 4s Attempt 4 ✓ success! (server had recovered)

time ─────────────────────────────────────────────► |‐| |‐‐| |‐‐‐‐| 1s 2s 4s gaps grow so we don't hammer a struggling server

A tiny code example (try/except + retry + fallback)

This wraps a flaky tool. It retries with growing waits, and if all retries fail it uses a fallback instead of crashing.

import time

def call_with_recovery(tool, args, max_retries=3):
    delay = 1.0
    for attempt in range(1, max_retries + 1):
        try:
            return tool(args)                       # try the risky action
        except TimeoutError as e:                   # catch only what we expect
            print(f"attempt {attempt} failed: {e}")
            if attempt == max_retries:
                break
            time.sleep(delay)                       # wait...
            delay *= 2                              # ...longer each time (backoff)
    return fallback(args)                          # plan B, never crash

def fallback(args):
    cached = cache_lookup(args)
    if cached is not None:
        return cached
    return escalate_to_human(args)                 # last resort

When recovery matters most

Scenario	Recommendation	Why
Calling external APIs / network tools	✅ Retry + fallback	Networks are flaky; transient errors are normal.
Parsing the LLM's output into JSON/structured data	✅ Validate + re-ask	Models sometimes return malformed or extra text.
Irreversible actions (payments, deletions)	✅ Escalate, don't blind-retry	Retrying could double-charge or double-delete.
A pure local calculation you control	❌ Minimal handling	Little can fail; a simple try/except is enough.

Recovery mistakes

Mistake	Consequence	Fix
Bare 'except:' that swallows every error silently.	Real bugs hide; the agent 'succeeds' while doing nothing useful.	Catch specific exceptions and log them; never swallow silently.
Retrying instantly, many times, with no wait.	You hammer a struggling server and may get rate-limited or banned.	Use exponential backoff and a sane max-retry cap.
Retrying non-idempotent actions like payments.	Customers get charged twice; data gets duplicated.	Only auto-retry safe (idempotent) steps; escalate the risky ones.
Trusting LLM output without validating it.	Malformed JSON crashes the next step downstream.	Validate the shape; if invalid, ask the model to correct it.

Recovery rules of thumb

Catch specific errors, log them, never use a silent bare except.
Retry order: backoff → fallback → escalate to a human.
Never blindly retry actions that change money or delete data.

Key takeaways

Tools, networks and LLMs all fail; the agent must catch errors, not crash.
Recovery options in order: retry with backoff, then fallback, then escalate to a human.
Use exponential backoff and a retry cap so you don't hammer struggling services.
Never auto-retry irreversible actions; validate LLM output before using it.

Frequently Asked Questions

What is Exception Handling & Recovery?

Picture a delivery driver whose GPS suddenly dies. A bad driver just stops in the middle of the road.

How does Exception Handling & Recovery work?

Exception handling is wrapping risky actions (tool calls, API requests, parsing the LLM's answer) so that when they fail, your code catches the error instead of crashing. Recovery is what you do next: retry the step (often after a short wait), use a fallback plan, or escalate to a person.

What are the key takeaways about Exception Handling & Recovery?

Tools, networks and LLMs all fail; the agent must catch errors, not crash. Recovery options in order: retry with backoff, then fallback, then escalate to a human. Use exponential backoff and a retry cap so you don't hammer struggling services. Never auto-retry irreversible actions; validate LLM output before using it.

Browse all Agentic AI Patterns topics →

Practice this on DevInterviewMaster

Read the full Exception Handling & Recovery breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Exception Handling & Recovery

Key points

What 'Exception Handling & Recovery' means

Fragile agent vs resilient agent

Your recovery toolbox

Retry with exponential backoff (why we wait longer each time)

A tiny code example (try/except + retry + fallback)

When recovery matters most

Recovery mistakes

Recovery rules of thumb

Key takeaways

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster