Agentic AI PatternsFree to read

Resource-Aware Optimization (Cost & Latency)

Every LLM call costs time and money — spend wisely

Imagine your agent is a taxi meter: every call to the model is the meter ticking, in rupees (cost) and seconds (latency). You wouldn't take a luxury cab to go next door. Resource-aware optimization means matching the cheapest, fastest option to each step: a small model for easy work, caching so you don't pay twice for the same answer, and limits so the meter can't run forever.

Key points

Cost ≈ tokens × model price; latency ≈ time per call × number of calls.
Use a small model for easy steps, a big model only when it's truly needed.
Cache, limit tokens, cap loop steps, and batch calls to save both.

What 'Resource-Aware Optimization' means

Resource-aware optimization is designing your agent so it uses the least time and money needed to do the job well. The main levers are: choosing a cheaper/smaller model for easy steps, caching repeated work, limiting tokens, capping loop steps, and batching many small calls into one. You trade a little engineering effort for big savings in cost and speed.

Note: Right-size the model, cache repeats, cap tokens & steps, batch calls. Pay only for what you truly need.

The cost meter: small model vs big model

EASY STEP (classify a message) HARD STEP (write a legal brief) ────────────────────────────── ──────────────────────────────

┌───────────────┐ ┌───────────────┐ │ 🐣 small model │ fast + cheap │ 🦣 big model │ slow+pricey │ ~₹0.01 │ │ ~₹2.00 │ └───────────────┘ └───────────────┘ meter: ₹ meter: ₹₹₹₹₹ time: ▸ time: ▸▸▸▸▸

Rule of thumb: route EASY work to the small model, save the big model for the few steps that really need it.

The 5 main cost/latency levers

Right-size the model — Route easy steps to a small/cheap model, hard steps to the big one. Example: Classify with a small model; only draft long answers with the big one.
Cache results — Store answers to repeated questions so you don't pay/wait twice. Example: Same FAQ asked again → return the saved answer instantly, ₹0.
Limit tokens — Trim the prompt and cap the output length; fewer tokens = less cost & latency. Example: Summarize old history instead of resending all of it every step.
Cap loop steps — A max-steps limit stops a runaway agent from racking up calls. Example: Stop after 8 steps and return the best result so far.
Batch calls — Combine many tiny requests into one call instead of dozens. Example: Classify 50 comments in one prompt, not 50 separate calls.

Caching: don't pay for the same answer twice

Question arrives │ ▼ ┌──────────────┐ HIT ┌────────────────────────┐ │ check cache │───────► │ return saved answer │ ₹0, instant └──────┬───────┘ └────────────────────────┘ │ MISS ▼ ┌──────────────┐ call ┌────────────┐ save ┌──────────────┐ │ LLM (pay) │───────► │ answer │───────► │ store in cache│ └──────────────┘ └────────────┘ └──────────────┘ (next time = HIT)

A tiny code example (route by difficulty + cache)

This picks a cheap model for easy work and an expensive one only when needed, and checks a cache first so repeats are free.

CACHE = {}

def ask(prompt, hard=False, max_tokens=300):
    if prompt in CACHE:                # 1. cache hit = free + instant
        return CACHE[prompt]
    model = "big-expensive" if hard else "small-cheap"  # 2. right-size
    answer = llm(prompt, model=model, max_tokens=max_tokens)  # 3. cap tokens
    CACHE[prompt] = answer             # 4. remember for next time
    return answer

def classify(msg):
    return ask(f"Label as spam/ham: {msg}", hard=False)  # easy → cheap

def write_brief(topic):
    return ask(f"Write a detailed brief on {topic}", hard=True)  # hard → big

When to optimize resources

Scenario	Recommendation	Why
High traffic / many users hitting the agent	✅ Cache + small models	Small per-call savings multiply hugely at scale.
Repetitive questions (FAQs, lookups)	✅ Cache aggressively	Same input → reuse the saved answer for free.
Latency-sensitive UX (user is waiting)	✅ Smaller model + fewer steps	Cuts the time the user stares at a spinner.
Rare, high-stakes task where quality is everything	❌ Don't skimp	Here correctness beats saving a few rupees.

Cost & latency mistakes

Mistake	Consequence	Fix
Using the biggest model for every single step.	Bills explode and responses are needlessly slow.	Route easy steps to a small model; reserve the big one for hard steps.
Re-sending the entire history on every loop.	Token count (and cost/latency) grows every step.	Summarize or trim old context; send only what the step needs.
No cache, so identical questions are re-computed.	You pay full price for answers you already have.	Cache by input; return saved answers on a hit.
No max-steps cap on the loop.	A stuck agent silently runs up a giant bill.	Cap steps and return the best-so-far result on the limit.

Cost intuition cheat-sheet

Cost ≈ tokens × price; fewer tokens and cheaper models = cheaper agent.
Cache first, call the model second — repeats should cost ₹0.
Right-size the model per step; always cap loop steps and output tokens.

Key takeaways

Cost ≈ tokens × model price; latency ≈ time per call × number of calls.
Route easy steps to a cheap small model; use the big model only when truly needed.
Cache repeated work, limit tokens, cap loop steps and batch calls to save time and money.
Small per-call savings multiply enormously at scale, so optimize before traffic grows.

Frequently Asked Questions

What is Resource-Aware Optimization?

Imagine your agent is a taxi meter: every call to the model is the meter ticking, in rupees (cost) and seconds (latency). You wouldn't take a luxury cab to go next door.

How does Resource-Aware Optimization work?

Resource-aware optimization is designing your agent so it uses the least time and money needed to do the job well. The main levers are: choosing a cheaper/smaller model for easy steps, caching repeated work, limiting tokens , capping loop steps , and batching many small calls into one.

What are the key takeaways about Resource-Aware Optimization?

Cost ≈ tokens × model price; latency ≈ time per call × number of calls. Route easy steps to a cheap small model; use the big model only when truly needed. Cache repeated work, limit tokens, cap loop steps and batch calls to save time and money. Small per-call savings multiply enormously at scale, so optimize before traffic grows.

Browse all Agentic AI Patterns topics →

Practice this on DevInterviewMaster

Read the full Resource-Aware Optimization (Cost & Latency) breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

Resource-Aware Optimization (Cost & Latency)

Key points

What 'Resource-Aware Optimization' means

The cost meter: small model vs big model

The 5 main cost/latency levers

Caching: don't pay for the same answer twice

A tiny code example (route by difficulty + cache)

When to optimize resources

Cost & latency mistakes

Cost intuition cheat-sheet

Key takeaways

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster