AI & AutomationFree to read

LLM Cost Management & Token Optimization

Spend Smart on AI - Save 80% Without Sacrificing Quality

Master the art of managing LLM costs in production. Learn token optimization, semantic caching, model routing, and budgeting strategies that save thousands of rupees monthly.

Understanding LLM Costs

Every Token Costs Money - And Tokens Add Up Fast

LLM costs are fundamentally different from traditional SaaS costs. Instead of paying a flat monthly fee, you pay per token - per word essentially. This means your costs scale directly with usage and can be wildly unpredictable if not managed carefully.

Real-World Analogy - Mobile Data Plans in India

Think of LLM costs like pre-Jio mobile data. Every MB counted, every video buffered was money. Users who did not track usage got bill shocks. LLM tokens are the same - every word in your prompt, every word in the response, costs money. A chatbot handling 10,000 conversations daily can easily cost Rs 50,000-5,00,000 per month depending on how efficiently you manage tokens. Just like Jio disrupted telecom with smart pricing, smart token management can slash your AI costs by 80%.

How LLM Pricing Works

Model	Input Cost / 1M tokens	Output Cost / 1M tokens	Relative Cost
GPT-4o	$2.50	$10.00	Medium
GPT-4o-mini	$0.15	$0.60	Very Low
Claude Sonnet	$3.00	$15.00	Medium-High
Claude Haiku	$0.25	$1.25	Low
Claude Opus	$15.00	$75.00	Very High

The Cost Equation

Total Cost = (Input Tokens x Input Price) + (Output Tokens x Output Price) x Number of Requests
Key Insight 1: Output tokens cost 3-5x more than input tokens
Key Insight 2: Your system prompt is sent with EVERY request - a 2000-token prompt across 10K requests = 20M input tokens
Key Insight 3: Conversation history grows with each turn - a 10-turn chat sends all previous turns every time

Note: Output tokens are 3-5x more expensive than input tokens. Telling the LLM to be concise is one of the cheapest and most effective cost optimizations.

Token Optimization Techniques

Squeeze Every Rupee Out of Every Token

Token optimization is about sending less data to the LLM and getting shorter responses - without sacrificing quality. These techniques alone can save 30-50% of your LLM spend.

Technique 1: Prompt Compression

Remove Redundancy: Audit your system prompts. Most contain repeated instructions, unnecessary examples, and verbose formatting.
Abbreviate Instructions: "You are a helpful assistant that helps users with their banking queries" can become "Banking assistant. Help with bank queries."
Use Few-Shot Wisely: 3 examples instead of 10. Each example costs tokens on every request.
Dynamic Context: Only include relevant parts of the system prompt based on query type.

Technique 2: Output Control

Max Tokens: Set max_tokens to limit response length. A 100-word answer does not need 2000-token capacity.
Conciseness Instructions: "Be concise. Answer in under 100 words." saves 50-70% output tokens.
Structured Output: JSON responses are typically shorter than prose. Use structured output for API calls.
Stop Sequences: Define stop tokens to prevent unnecessary generation after the answer.

Technique 3: Context Window Management

Conversation Summarization: After 5-7 turns, summarize the conversation into a compact context instead of sending full history.
Sliding Window: Only keep the last 3-5 turns of conversation, not the entire history.
RAG Chunking: Retrieve 3-5 relevant chunks, not 10-15. Each chunk costs input tokens.
Selective Context: Only include RAG context when the query actually needs it. Simple greetings do not need 2000 tokens of context.

Note: The biggest token waste is usually the system prompt. A 2000-token system prompt sent with every request costs more than the actual user queries. Audit and compress it.

Semantic Caching - The 40% Cost Reducer

Why Pay Twice for the Same Answer?

Many users ask similar (not identical) questions. Semantic caching stores LLM responses and returns cached answers for semantically similar queries, avoiding redundant API calls. This single technique can reduce costs by 30-50%.

How Semantic Caching Works

Step 1: New query comes in. Compute its embedding (vector representation).
Step 2: Search cache for existing queries with similar embeddings (cosine similarity above 0.95).
Step 3: If match found, return cached response. No LLM call needed.
Step 4: If no match, call LLM, store query embedding + response in cache.

Example: E-Commerce Support Bot

"What is your return policy?" - First query, calls LLM, caches response
"How do I return an item?" - Similar embedding, returns cached response
"Return policy kya hai?" - Similar meaning, returns cached response
"Can I return after 30 days?" - Different enough, calls LLM, caches new response

Result: Out of 1000 return-related queries, only 50-100 unique ones actually call the LLM.

Implementation Options

GPTCache: Open-source semantic cache library by Zilliz. Supports multiple embedding models and vector stores.
Redis + Vector Search: Use Redis with its vector similarity search for fast, in-memory caching.
LangChain CacheBackedEmbeddings: Built into LangChain for easy RAG caching.
Helicone: Proxy-based caching with zero code changes.

Cache Configuration Tips

Similarity threshold: 0.95 for factual queries, 0.90 for creative (lower = more cache hits but less precision)
TTL (Time to Live): 24 hours for dynamic data, 7 days for static knowledge
Cache invalidation: When prompts change or knowledge base updates
Monitor cache hit rate: Below 20% means your threshold is too high or queries are too diverse

Note: A 40% cache hit rate on a 10,000 daily request chatbot saves approximately $200-500 per month depending on the model. Semantic caching often pays for itself in the first week.

Model Routing - Right Model for the Right Task

Do Not Use a Rolls-Royce to Deliver Groceries

The single biggest cost optimization in LLM applications is using the right model for each query. Not every question needs GPT-4 or Claude Opus. Most queries (60-70%) can be handled by smaller, cheaper models without noticeable quality loss.

The Indian Transport Analogy

Auto-rickshaw (GPT-4o-mini, Haiku): Simple queries: FAQs, greetings, status checks, basic classification. Cost: Rs 0.01 per query. Handles 60-70% of all traffic.
Sedan (GPT-4o, Sonnet): Moderate complexity: summarization, content generation, multi-step reasoning. Cost: Rs 0.5 per query. Handles 20-25% of traffic.
Mercedes (Opus, GPT-4): Complex tasks: legal analysis, code review, critical decisions. Cost: Rs 5 per query. Only 5-10% of traffic.

How to Implement Model Routing

Rule-Based: Keywords or intent classification to route. "Hi" goes to mini, "explain quantum computing" goes to full model.
Classifier-Based: Train a small classifier to predict query complexity. Route based on predicted difficulty.
Cascade: Try mini first. If confidence is low or user gives negative feedback, retry with bigger model.
Semantic: Use embedding similarity to match query against categories pre-assigned to model tiers.

Real Cost Impact Example

Approach	Daily Cost (10K requests)	Monthly Cost
All GPT-4o	$50	$1,500
All GPT-4o-mini	$1.50	$45
Smart Routing (70/25/5)	$15	$450

Smart routing gives 70% savings over all-GPT-4o while maintaining quality for complex queries.

Note: Start by analyzing your query distribution. If 60%+ of queries are simple (FAQ, greetings, status), model routing alone can save 50-70% of your LLM spend.

Budgeting and Cost Monitoring

Set Budgets Before You Get the Bill

LLM costs can spiral out of control faster than any other cloud cost. A bug in a loop, a DDoS attack, or a viral feature launch can multiply costs 100x overnight. Proactive budgeting and monitoring are non-negotiable.

Horror Stories

The Infinite Loop: A developer bug caused the AI to call itself recursively. 50,000 API calls in 10 minutes. $2,000 bill.
The Viral Feature: A Twitter post made an AI feature go viral. 100x normal traffic. Monthly budget burned in 3 days.
The Prompt Injection: Attackers found the chatbot and sent massive prompts repeatedly. $500 in junk API calls overnight.

Cost Control Mechanisms

Per-User Rate Limits: Max 50 AI requests per user per hour. Prevents abuse and bot attacks.
Daily Budget Circuit Breaker: If daily spend exceeds Rs 5,000, pause AI features and alert team.
Per-Request Max Tokens: Hard limit on input + output tokens per request.
Monthly Budget Alerts: Alerts at 50%, 80%, and 100% of monthly budget.
Cost per Feature: Track which features consume the most tokens. Optimize or gate expensive features.

Building a Cost Dashboard

Real-Time: Current spend today, this week, this month
Per Model: Cost breakdown by GPT-4o, GPT-4o-mini, Claude, etc.
Per Feature: Chatbot costs vs summarization vs search
Per User: Top 10 most expensive users (often indicates abuse)
Trend: Daily cost trend with 7-day moving average
Projection: At current rate, monthly bill will be Rs X

Batch Processing for Cost Savings

OpenAI and Anthropic offer 50% discount on batch API calls. If you have non-urgent tasks (daily summaries, content generation, data analysis), queue them and process in batch.

Note: Set up cost alerts and rate limits BEFORE launching your AI feature. Not after. The bill shock stories always start with someone saying they planned to add monitoring later.

Interview Questions - LLM Cost Management

Q1: How would you reduce LLM costs by 80% for a high-traffic chatbot without sacrificing quality?

Answer: Layered approach: (1) Model routing - send 70% of simple queries to GPT-4o-mini (saves 60% immediately). (2) Semantic caching - cache similar questions, 30-40% of queries return cached responses (saves another 15-20%). (3) Prompt compression - audit system prompt, remove verbosity, reduce few-shot examples (saves 10-15% on remaining calls). (4) Output control - set max_tokens, add conciseness instructions (saves 5-10% on output tokens). (5) Conversation summarization after 5 turns instead of sending full history. Combined effect: 75-85% cost reduction.

Q2: Explain semantic caching and when it would NOT work well.

Answer: Semantic caching stores LLM responses keyed by query embedding. Similar queries (cosine similarity above threshold) return cached responses. Works well for: FAQs, customer support, documentation queries (repetitive, factual). Does NOT work well for: (1) Highly personalized queries requiring user-specific context. (2) Time-sensitive queries ("latest news"). (3) Creative tasks where variety is desired. (4) Very diverse query distributions with few repeated patterns. Cache hit rate below 15% means semantic caching adds overhead without sufficient savings.

Q3: Your AI chatbot daily cost suddenly jumped from Rs 5,000 to Rs 50,000. How do you investigate and fix?

Answer: Immediate: (1) Enable cost circuit breaker to cap spend. (2) Check for infinite loops or recursive calls in logs. (3) Check for traffic spike - is it organic or an attack? Investigation: (4) Compare per-user token usage - one user consuming 90% indicates abuse or bot. (5) Check if model routing is working - are queries incorrectly going to expensive models? (6) Check prompt length - did a deployment increase system prompt tokens? Fix: Rate limit abusive users, fix routing bugs, revert problematic deployments. Long-term: Add per-user cost caps, request size limits, and anomaly detection alerts.

Q4: How do you implement model routing in production?

Answer: Four approaches from simple to sophisticated: (1) Rule-based - keyword and intent mapping (greetings to mini, complex to full). Quick to implement. (2) Classifier - train a small model on query complexity. More accurate. (3) Cascade - start with cheapest model, escalate on low confidence or negative feedback. Best quality guarantee. (4) Semantic routing - embedding similarity against category centroids. Most flexible. Start with rule-based for quick wins, graduate to classifier as you collect data on query patterns.

Frequently Asked Questions

What is LLM Cost Management & Token Optimization?

Master the art of managing LLM costs in production. Learn token optimization, semantic caching, model routing, and budgeting strategies that save thousands of rupees monthly.

How does LLM Cost Management & Token Optimization work?

Every Token Costs Money - And Tokens Add Up Fast LLM costs are fundamentally different from traditional SaaS costs. Instead of paying a flat monthly fee, you pay per token - per word essentially.

Browse all AI & Automation topics →

Practice this on DevInterviewMaster

Read the full LLM Cost Management & Token Optimization breakdown with interactive demos, quizzes, and Hinglish notes.

Open the interactive topic →

800+ system-design, LLD, coding, and design-pattern topics. Unlock everything with Pro (₹499, one-time) or Ultimate (₹999, one-time) — lifetime access, no subscription.

LLM Cost Management & Token Optimization

Understanding LLM Costs

Token Optimization Techniques

Semantic Caching - The 40% Cost Reducer

Model Routing - Right Model for the Right Task

Budgeting and Cost Monitoring

Interview Questions - LLM Cost Management

Frequently Asked Questions

Related topics

Practice this on DevInterviewMaster