AI agents are great until your monthly API bill arrives. A single agent doing serious work (research, coding, outreach) on frontier models routinely costs $300-800/month in token spend if you don't engineer the cost side. Most teams don't — they run Opus 4.7 for everything and wonder why the bill is ugly.
This guide is the playbook we built for Klaws to run thousands of agents across dozens of tasks per day per user on a flat-credit model. The seven techniques below, stacked, cut realistic agent costs by 70-85% compared to a naive Opus-4.7-for-everything baseline.
Before we start: the only metric that matters is total cost to complete a task, not per-token price. A cheaper model that takes 3 retries and longer context is more expensive than a premium model that one-shots the job. Optimize for task cost, not token cost.
1. Model routing — pay frontier prices only for frontier work
Savings: 40-60% alone.
Most agent tasks are not hard. An agent reading Gmail and classifying emails as "urgent / newsletter / spam / FYI" doesn't need Opus 4.7 at $75/M output. A Flash-class model does it correctly, at 1/30 the cost.
The routing logic we use:
| Task type | Route to | Why |
|---|---|---|
| Chat / simple queries | Gemini 3 Flash ($0.10/$0.40 per M) | Fast, cheap, good enough |
| Coding / agent tasks | Kimi K2.6 ($0.60/$2.80) | SWE-Bench Pro lead, 10x cheaper than Opus |
| Long documents (100K+ tokens) | Gemini 3.1 Pro ($2/$12) | 2M context, no degradation |
| Hard reasoning / writing | Claude Opus 4.7 ($15/$75) | Still best on certain reasoning benches |
| Tool-heavy multi-step | Qwen 3.6 Max (preview) | Terminal-Bench 2.0 lead |
A single router decision maps task → model. The router itself uses Flash ($0.0003 per decision). Net: frontier prices only when the task actually demands it.
2. Prompt caching — pay for context once, not every call
Savings: 60-90% on cached portions.
Modern APIs (Anthropic prompt caching, Gemini implicit cache, OpenAI cache discounts) let you mark portions of the prompt as cacheable. The system prompt, tool definitions, and any stable context then cost 10% of their original price on subsequent calls within the cache TTL (typically 5 minutes).
Real numbers from a Klaws agent doing a 6-minute research loop with 113 LLM calls:
- Without caching: ~5M prompt tokens @ $0.10/M = $0.50
- With caching (90% cache hit): ~5M cached @ $0.01/M + ~500K fresh @ $0.10/M = $0.10
Caching alone saved 80% on that single task.
3. Context compression — keep the summary, drop the transcript
Savings: 50-70% on long sessions.
Agent loops grow their prompt with every tool call. After 30 iterations a simple research task can have 50-80K tokens of accumulated tool outputs, intermediate reasoning, and message history. None of it is needed — the agent only needs to know current state + goal.
Fix: run a compression pass when prompt > 70K tokens. Replace old messages with a structured summary:
## Active Task— what we're doing right now## Completed Actions— what's already done (one line each)## Resolved Questions— facts established## Open Questions— what's unknown
Iterative compression — each new compression updates the previous summary instead of rebuilding from scratch — keeps the summary itself small. We run this via Gemini 2.5 Flash (~$0.03 per compression). Pays itself back within 2 subsequent LLM calls.
4. Tool output pruning — free summaries, no LLM needed
Savings: 30-50% on tool-heavy loops.
A web_search call returns 5-20K tokens of raw HTML. A read_file returns up to the full file. An agent rarely needs this raw data more than 1-2 turns later.
Fix: after N turns, replace old tool outputs with a 1-line summary: [web_search] searched "kimi k2.6 pricing" (3,421 chars). The agent knows the call happened and can re-run if needed.
This is free — no LLM involved, just regex replacement. In Klaws we keep the 10 most recent tool outputs verbatim; everything older gets pruned.
5. Open-weight models for open work
Savings: 70-90% on tasks that don't need proprietary frontier.
Open-weight models closed the performance gap dramatically in 2026. Kimi K2.6 beats Claude Opus 4.6 on SWE-Bench Pro at $0.60/$2.80 per M — 12x cheaper than Opus 4.7 at $15/$75. For coding agents, there is no reason to pay Opus pricing unless you specifically need Anthropic's safety tuning.
Same applies to:
- Content generation (smaller open models match GPT-3.5-class quality at 1/10 the cost)
- Embedding generation (open models match OpenAI ada-002 at 1/20)
- Summarization (any 7B+ model does this well enough)
Route per-task. Don't pay frontier for commodity work.
6. Batch scheduling — cheap slots, not real-time
Savings: 30-50% on scheduled work.
If a task is scheduled (daily digest, weekly competitor monitor, monthly report) it doesn't need to run at peak. Several providers offer batch APIs at 50% discount with 24-hour turnaround:
- Anthropic Message Batches API: 50% off
- OpenAI Batch API: 50% off
- Gemini batch predictions: similar discount
A daily 8am briefing that runs at 3am (when compute is cheap) on the batch API costs half. For anything that doesn't need sub-minute latency, batch.
7. Cap iterations — and fail fast
Savings: 15-30% on tail cases.
Agents sometimes get stuck in loops — tool call fails, retry, different tool call fails, retry. Without a cap you can spend $5 on a $0.50 task.
Hard cap: max iterations per turn. We run 30 for most tasks, 60 for research-heavy. Above that, the agent returns what it has and asks the user. Combined with good compression this barely triggers on normal work, but it keeps tail-case spend bounded.
Stack the techniques
Applied together on a realistic workload (daily agent doing research + scheduled posts + email triage):
| Technique | Baseline | Optimized | Saved |
|---|---|---|---|
| Model routing | $450/mo | $180/mo | 60% |
| + Prompt caching | $180/mo | $110/mo | 39% |
| + Context compression | $110/mo | $75/mo | 32% |
| + Tool output pruning | $75/mo | $60/mo | 20% |
| + Open-weight for coding | $60/mo | $40/mo | 33% |
| + Batch scheduling | $40/mo | $30/mo | 25% |
| + Iteration caps | $30/mo | $28/mo | 7% |
End state: $28/mo for a workload that was $450/mo naively — a 94% reduction. Real Klaws users on the $19 Starter plan routinely run this level of workload without hitting credit limits.
What this means for you
If you're building your own agent: implement all 7. Expect to spend 1-2 weeks on the cost-side engineering before scaling. The alternative is burning runway on Opus tokens.
If you're evaluating managed agent platforms: ask which of these they do. Most don't. Klaws runs all 7 by default — that's why $19/mo flat works.
Try Klaws for 3 days free →. You don't need to know any of this — we run the cost engine for you.
For deeper reads: Kimi K2.6 review, Qwen 3.6 Max review, 2026 platform roundup.