Resource-Aware Optimization (Cost & Latency)
Every LLM call costs time and money — spend wisely
Imagine your agent is a taxi meter: every call to the model is the meter ticking, in rupees (cost) and seconds (latency). You wouldn't take a luxury cab to go next door. Resource-aware optimization means matching the cheapest, fastest option to each step: a small model for easy work, caching so you don't pay twice for the same answer, and limits so the meter can't run forever.
Key points
- Cost ≈ tokens × model price; latency ≈ time per call × number of calls.
- Use a small model for easy steps, a big model only when it's truly needed.
- Cache, limit tokens, cap loop steps, and batch calls to save both.
What 'Resource-Aware Optimization' means
Resource-aware optimization is designing your agent so it uses the least time and money needed to do the job well. The main levers are: choosing a cheaper/smaller model for easy steps, caching repeated work, limiting tokens, capping loop steps, and batching many small calls into one. You trade a little engineering effort for big savings in cost and speed.
Note: Right-size the model, cache repeats, cap tokens & steps, batch calls. Pay only for what you truly need.
The cost meter: small model vs big model
EASY STEP (classify a message) HARD STEP (write a legal brief) ────────────────────────────── ──────────────────────────────
┌───────────────┐ ┌───────────────┐ │ 🐣 small model │ fast + cheap │ 🦣 big model │ slow+pricey │ ~₹0.01 │ │ ~₹2.00 │ └───────────────┘ └───────────────┘ meter: ₹ meter: ₹₹₹₹₹ time: ▸ time: ▸▸▸▸▸
Rule of thumb: route EASY work to the small model, save the big model for the few steps that really need it.
The 5 main cost/latency levers
- Right-size the model — Route easy steps to a small/cheap model, hard steps to the big one. Example: Classify with a small model; only draft long answers with the big one.
- Cache results — Store answers to repeated questions so you don't pay/wait twice. Example: Same FAQ asked again → return the saved answer instantly, ₹0.
- Limit tokens — Trim the prompt and cap the output length; fewer tokens = less cost & latency. Example: Summarize old history instead of resending all of it every step.
- Cap loop steps — A max-steps limit stops a runaway agent from racking up calls. Example: Stop after 8 steps and return the best result so far.
- Batch calls — Combine many tiny requests into one call instead of dozens. Example: Classify 50 comments in one prompt, not 50 separate calls.
Caching: don't pay for the same answer twice
Question arrives │ ▼ ┌──────────────┐ HIT ┌────────────────────────┐ │ check cache │───────► │ return saved answer │ ₹0, instant └──────┬───────┘ └────────────────────────┘ │ MISS ▼ ┌──────────────┐ call ┌────────────┐ save ┌──────────────┐ │ LLM (pay) │───────► │ answer │───────► │ store in cache│ └──────────────┘ └────────────┘ └──────────────┘ (next time = HIT)
A tiny code example (route by difficulty + cache)
This picks a cheap model for easy work and an expensive one only when needed, and checks a cache first so repeats are free.
CACHE = {}
def ask(prompt, hard=False, max_tokens=300):
if prompt in CACHE: # 1. cache hit = free + instant
return CACHE[prompt]
model = "big-expensive" if hard else "small-cheap" # 2. right-size
answer = llm(prompt, model=model, max_tokens=max_tokens) # 3. cap tokens
CACHE[prompt] = answer # 4. remember for next time
return answer
def classify(msg):
return ask(f"Label as spam/ham: {msg}", hard=False) # easy → cheap
def write_brief(topic):
return ask(f"Write a detailed brief on {topic}", hard=True) # hard → big
When to optimize resources
| Scenario | Recommendation | Why |
|---|---|---|
| High traffic / many users hitting the agent | ✅ Cache + small models | Small per-call savings multiply hugely at scale. |
| Repetitive questions (FAQs, lookups) | ✅ Cache aggressively | Same input → reuse the saved answer for free. |
| Latency-sensitive UX (user is waiting) | ✅ Smaller model + fewer steps | Cuts the time the user stares at a spinner. |
| Rare, high-stakes task where quality is everything | ❌ Don't skimp | Here correctness beats saving a few rupees. |
Cost & latency mistakes
| Mistake | Consequence | Fix |
|---|---|---|
| Using the biggest model for every single step. | Bills explode and responses are needlessly slow. | Route easy steps to a small model; reserve the big one for hard steps. |
| Re-sending the entire history on every loop. | Token count (and cost/latency) grows every step. | Summarize or trim old context; send only what the step needs. |
| No cache, so identical questions are re-computed. | You pay full price for answers you already have. | Cache by input; return saved answers on a hit. |
| No max-steps cap on the loop. | A stuck agent silently runs up a giant bill. | Cap steps and return the best-so-far result on the limit. |
Cost intuition cheat-sheet
- Cost ≈ tokens × price; fewer tokens and cheaper models = cheaper agent.
- Cache first, call the model second — repeats should cost ₹0.
- Right-size the model per step; always cap loop steps and output tokens.
Key takeaways
- Cost ≈ tokens × model price; latency ≈ time per call × number of calls.
- Route easy steps to a cheap small model; use the big model only when truly needed.
- Cache repeated work, limit tokens, cap loop steps and batch calls to save time and money.
- Small per-call savings multiply enormously at scale, so optimize before traffic grows.
Frequently Asked Questions
What is Resource-Aware Optimization?
Imagine your agent is a taxi meter: every call to the model is the meter ticking, in rupees (cost) and seconds (latency). You wouldn't take a luxury cab to go next door.
How does Resource-Aware Optimization work?
Resource-aware optimization is designing your agent so it uses the least time and money needed to do the job well. The main levers are: choosing a cheaper/smaller model for easy steps, caching repeated work, limiting tokens , capping loop steps , and batching many small calls into one.
What are the key takeaways about Resource-Aware Optimization?
Cost ≈ tokens × model price; latency ≈ time per call × number of calls. Route easy steps to a cheap small model; use the big model only when truly needed. Cache repeated work, limit tokens, cap loop steps and batch calls to save time and money. Small per-call savings multiply enormously at scale, so optimize before traffic grows.
Related topics
Practice this on DevInterviewMaster
Read the full Resource-Aware Optimization (Cost & Latency) breakdown with interactive demos, quizzes, and Hinglish notes.
800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.