Speed is the benchmark nobody talks about on Twitter but everyone notices in production. The difference between 50 tokens/second and 150 tokens/second is the difference between "users wait" and "users engage." And the difference between 150 and 2,000+ is the difference between a chat app and a voice app. Here's how the fast tier breaks down in 2026.
The speed leaderboard
| Model | Tokens/sec | Intelligence | Price/M |
|---|---|---|---|
| Cerebras Llama 4.1 405B | 2,100+ | 46 | $0.60 |
| Groq Llama 4.1 70B | 800 | 44 | $0.50 |
| SambaNova DeepSeek V4 | 400 | 49 | $0.80 |
| Gemini 3 Flash | 250 | 45 | $0.30 |
| Grok 4.20 | 168 | 49 | $3.00 |
| GPT-5.4 mini xhigh | 151 | 49 | $1.69 |
| GPT-5.4 xhigh | 73 | 57 | $5.63 |
| Claude Opus 4.7 | 50 | 57 | $10.00 |
The gap between "fast" and "frontier" is still 10-40x. You're picking a tier.
The three real speed tiers
Tier 1 — specialized inference hardware (500+ t/s)
Cerebras, Groq, and SambaNova run open-weight models on custom silicon (wafer-scale chips, LPUs, etc.) to achieve throughput no general-purpose GPU can match. These are genuinely transformative for real-time applications.
Cerebras serves Llama 4.1 405B at over 2,000 tokens/second. That's fast enough to stream a 500-word response in under a second. For voice assistants (where TTS needs the text fast) or live agents, this is the only game in town.
Groq serves Llama 4.1 70B at 800 t/s — still transformative. Smaller model, slightly lower quality, but cheaper and still ridiculously fast.
SambaNova serves DeepSeek V4 at 400 t/s with better reasoning than the Llama models at similar speed.
Use these when:
- Voice applications where latency compounds (ASR → LLM → TTS)
- Live chat UIs where users abandon if they wait
- Real-time decision-making (algo trading signals, fraud detection)
- You're building a demo that needs the "wow"
Tier 2 — frontier-tier speed (130-300 t/s)
Gemini 3 Flash at 250 t/s with Intelligence 45 is Google's answer to the speed tier. It's fast enough for snappy chat, cheap enough for high volume ($0.30/M), and good enough for most practical tasks.
Grok 4.20 at 168 t/s is xAI's flagship-speed offering. Intelligence 49 at $3/M — a better quality-speed balance than Flash but 10x the price.
GPT-5.4 mini xhigh at 151 t/s, Intelligence 49 at $1.69/M. OpenAI's "cheap and fast" play.
Use these when:
- You're building a production chat app
- You want frontier-adjacent quality with good speed
- You're running many concurrent users
Tier 3 — frontier quality (50-80 t/s)
GPT-5.4 xhigh and Claude Opus 4.7 are slow in 2026 terms but carry the highest intelligence. For deep reasoning tasks where you're not streaming to a user (background agents, scheduled research, batch processing), their speed is fine.
The real-world UX math
Your user is reading a streamed response. At what speed does it feel "instant"?
- <60 t/s: Visibly slow. Users wait, attention drifts.
- 60-120 t/s: Acceptable. Like reading along with a fast typist.
- 120-200 t/s: Snappy. Feels like the UI is keeping up with thought.
- 200-500 t/s: Faster than reading speed. Feels instant.
- 500+ t/s: Paragraph appears in one "blink." Only useful for voice/real-time.
For voice apps specifically: if you're targeting <500ms end-to-end latency from user speech end to TTS start, you need the first useful chunk of LLM output in ~200ms. At 100 t/s that's 20 tokens — barely a sentence. At 2,000 t/s that's 400 tokens — a full answer.
Cost vs speed tradeoff
The fast-tier providers compete on price too. At 1M tokens/month:
- Cerebras Llama 4.1: ~$0.60 + premium speed
- Groq Llama 4.1: ~$0.50
- Gemini 3 Flash: ~$0.30
- GPT-5.4 mini: ~$1.69
- Grok 4.20: ~$3.00
- Claude Opus 4.7: ~$10.00
For pure throughput at low cost, Cerebras and Groq win. For quality + reasonable speed, Gemini 3 Flash. For brand-name flagship-ish speed, GPT-5.4 mini or Grok.
What Klaws uses for fast responses
For conversational responses where latency matters, Klaws routes to Gemini 3 Flash by default (balance of speed, quality, cost). For voice features where sub-second response matters, we route to Groq Llama 4.1 70B. For slow-but-smart tasks (deep research, complex reasoning), we route to Claude Opus 4.7.
You don't configure any of this. The system picks based on task complexity and latency sensitivity. Try it →
When NOT to optimize for speed
Speed is only the right axis when latency directly affects UX. If you're running:
- Background batch jobs
- Scheduled agents (run while user sleeps)
- Deep research with tool use
- Long-form content generation
Then you should pick on quality and price, not speed. Opus taking 20 seconds vs Gemini Flash taking 4 seconds doesn't matter if the user isn't watching.
The honest verdict
Pick Cerebras or Groq if: voice apps, real-time systems, you need the speed.
Pick Gemini 3 Flash if: production chat UI, high volume, quality good enough.
Pick GPT-5.4 mini or Grok 4.20 if: you want brand-name flagships with OK speed.
Pick Opus 4.7 or Gemini 3.1 Pro if: speed doesn't matter for your use case.
See also: best AI models 2026, best cheap AI models, Gemini 3 vs Claude Opus.