The default framing for AI pricing assumes chat: a human types, the model responds, ~500 output tokens, happens maybe 50 times a day. DeepSeek V4 breaks that assumption.
Agents are different. An autonomous agent running a "do my morning brief" task generates 20-50x the tokens of a chat turn. Background crons multiply that by time. A single user running 5 daily autonomous workflows can consume more tokens in a week than a heavy chat user does in a month.
At Claude Opus prices, that economics didn't work — running autonomous agents at scale was a $300+/user/month proposition. V4 just rewrote that math.
The token multiplier nobody talks about
A typical agentic task has three phases:
- Planning — agent reads a prompt, writes a plan, calls tools. ~3-5k input, 1-2k output.
- Execution — agent iterates: search web, read docs, write code, call APIs, read results. 10-50 tool calls. 20-100k input (tool results keep compounding), 3-8k output.
- Synthesis — agent produces the final deliverable. 5-20k input, 2-5k output.
Total per task: ~30-150k input tokens, 6-15k output tokens. Versus a chat turn's ~1-3k / 500.
At Claude Opus rates ($15 in / $75 out), a single agent task costs $0.90-$3.50. Five tasks a day = $4.50-$17.50/day per user = $135-$525/month just in raw model cost. No platform margin.
At V4-Flash rates ($0.14 in / $0.28 out), the same task costs $0.006-$0.025. Five tasks a day = $0.03-$0.125/day per user = $1-$3.75/month.
That's not a 10x improvement. That's a product category change.
Why Flash specifically matters for agents
V4-Flash has 13B active params. That's enough for most agent subtasks — search, routing, extraction, summarization — but not enough for the hardest reasoning. Historically this meant agent platforms had to pick: use a big model for everything (expensive) or a small model for everything (breaks on hard steps).
With V4, the right answer is: V4-Flash for 90% of agent work, V4-Pro for the 10% that actually needs reasoning depth. Intelligent routing at the agent-loop level.
At a realistic 90/10 split, the effective cost per task is:
- 0.9 × $0.015 (Flash) + 0.1 × $0.10 (Pro) = $0.024 per task
Compare to a Claude-Opus-everywhere setup: $1.20+ per task. That's 50x cheaper for equivalent user-perceived quality.
The 1M context unlock
Agent memory has been a hard problem because of context. Keep everything → context blows up → quality degrades. Summarize aggressively → lose detail → make mistakes.
1M context changes this. You can keep:
- Full conversation history from the past week
- Every artifact the agent produced
- Every tool call trace
- Every relevant doc from the user's workspace
And still have 500k tokens left for the current turn. The "memory compression" tax that every agent platform pays with 200k-context models drops significantly.
Combined with the HCA architecture's 10% KV cache — the cost of actually using that 1M context is more reasonable than previous 1M-context models like Gemini's.
MIT license = on-prem agents become real
Compliance-heavy industries (healthcare, legal, financial services) have been locked out of autonomous agents not because they don't want them — they can't send sensitive data to a closed API without a BAA, and BAAs from OpenAI/Anthropic are enterprise-tier.
V4-Pro under MIT means those companies can run the agent stack fully on-prem. Fine-tune on their own data without worrying about vendor access. Audit the weights. Ship an agent product into verticals that previously couldn't touch one.
For platforms like Klaws, that's a new TAM: "autonomous agents for regulated industries" wasn't a product before because the pricing wasn't there. Now it is.
What this means for Klaws specifically
We've been running Gemini 3 Flash as primary because the economics worked. V4-Flash undercuts Gemini 3 Flash by 3.5x on input while matching most quality. We're evaluating V4-Flash in the agent routing this week.
For heavy reasoning tasks — the ones where our agents previously had to compromise between "use Claude Opus and eat the cost" or "use a smaller model and accept failures" — V4-Pro is a third option that didn't exist Monday.
Per our agent modes design, we'll likely introduce a new "Deep V4" mode alongside the existing Fast (Gemini) and Deep (Qwen 3.6 Plus) tiers. Users opt in per task when they want frontier quality without Opus prices.
The uncomfortable question for agent startups
If you've been charging $50-$200/month for your agent product, your cost-of-goods-sold just dropped by 10-20x. That means:
- You can cut price and capture market share
- You can keep price and improve margins
- You can run more autonomous work per user at the same price
The first mover here will set the new price expectation. Agent products that don't factor V4 into their economics within the next 60 days will look expensive by comparison.
For the launch details, see DeepSeek V4 is out. For the head-to-head with Claude, see the honest comparison.