Update — April 23, 2026: OpenAI shipped GPT-5.5 today — not a benchmark-shattering release, but the agent-task gains are real. The leaderboard commentary below is still accurate; GPT-5.5 slots in above 5.4 on multi-step work and structured output.
Every few months a new model tops the Artificial Analysis leaderboard and the internet declares a winner. But if you actually use these models every day, you know the truth: no single model is best at everything. The smartest model isn't the fastest. The fastest isn't the cheapest. And the cheapest occasionally beats the flagships at specific tasks. Here's how the 2026 field really stacks up — and what the leaderboard rankings actually mean for the work you do.
How we got here: a 2-year recap
In early 2024, the "use the best model" strategy was simple: OpenAI had GPT-4, and if you couldn't afford it, you used GPT-3.5. By mid-2024, Claude 3 Opus and Gemini 1.5 Pro closed the gap. By 2025, open-weight models from Meta, Mistral, and Alibaba were genuinely competitive on non-reasoning tasks. And now in 2026, the frontier is so crowded that three different labs are tied at the top — while five more labs ship models within 5 points of them at a fraction of the cost.
Two things drove this: (1) post-training techniques improved dramatically — especially reinforcement learning from verifier rewards for reasoning tasks, and (2) inference cost collapsed as labs invested in specialized silicon and distillation. What was a $60/million-tokens Opus-tier model in 2024 now runs at $10 or less, and the $0.50 tier is genuinely useful instead of toy-quality.
The top tier (Intelligence Index 53–57)
Three models are tied at the top with Intelligence Index 57:
- Claude Opus 4.7 (Anthropic) — best at long-form reasoning, coding, and agentic tasks. $10/M output tokens, 50 tokens/sec. Slow but precise. Strong 200k context window; excellent at following complex multi-step instructions without drifting.
- Gemini 3.1 Pro Preview (Google) — nearly the same intelligence at half the price ($4.50/M) and 2.6x the speed (130 t/s). The best raw value at the top. 2M-token context genuinely works — you can feed it books and it still reasons over the whole thing.
- GPT-5.4 xhigh (OpenAI) — the all-rounder. $5.63/M, 73 t/s. Wide tool use, strong at structured output, best vision capability of the three.
Below them sit GPT-5.3 Codex (specialized for code), Claude Opus 4.6, and Claude Sonnet 4.6 — the last being Anthropic's "fast smart" option at $6/M and 50 t/s. Sonnet 4.6 is Anthropic's recommended default for production agents because you get ~90% of Opus quality at 40% the price.
The mid-tier surprise: Chinese and open-weight models
The biggest story of 2026 isn't the frontier — it's the middle. Models scoring 49–51 on the leaderboard now cost dramatically less than the US flagships:
| Model | Intelligence | Price/M | Speed |
|---|---|---|---|
| GLM-5.1 | 51 | $2.15 | 41 t/s |
| Qwen3.6 Plus | 50 | $1.13 | 53 t/s |
| GLM-5 | 50 | $1.55 | 67 t/s |
| MiniMax-M2.7 | 50 | $0.53 | 49 t/s |
| MiMo-V2-Pro (Xiaomi) | 49 | $1.50 | 70 t/s |
MiniMax-M2.7 at $0.53/M is 19x cheaper than Claude Opus 4.7 while losing only 7 intelligence points. For a huge class of tasks — summarization, classification, routine research, first-pass drafting — the 7-point gap is invisible in output quality. You notice it on edge cases and hardest-reasoning tasks.
A practical test: ask both models to summarize a 5,000-word article. A human rater can't reliably tell them apart. Ask both to debug a subtle race condition in concurrent code, and Opus wins decisively. The question isn't "which is better" — it's "does your task sit in the 80% where they're equivalent, or the 20% where the flagship matters?"
The speed champion
If latency is what you care about, Grok 4.20 (xAI) leads at 168 tokens/second with Intelligence 49 at $3/M. For conversational interfaces or anything user-facing, "fast enough and smart enough" beats "genius but slow."
GPT-5.4 mini xhigh is close behind at 151 t/s, $1.69/M, Intelligence 49 — a very strong pick for agent workloads that do lots of small calls. Gemini 3 Flash is another strong speed pick at 250+ t/s depending on region, though its Intelligence score sits at 45 (still plenty for routine agent work).
Speed matters more than people admit. If a chat UI takes 15 seconds to stream a response, users abandon. If it takes 3 seconds, they engage. The difference is almost entirely about model throughput, not "intelligence" in any meaningful sense.
What the leaderboard doesn't show
The Intelligence Index is a blended score across MMLU-style knowledge, reasoning benchmarks, and coding tests. It's directionally useful, but it doesn't tell you:
- Which model is best at your task. Claude is still the preferred model for nuanced writing. GPT-5 handles structured tool-calling better. Gemini 3 wins at long-document analysis thanks to its 2M-token context. Qwen3.6 is quietly excellent at multilingual work, especially Chinese/Japanese/Korean.
- Which model refuses less. The flagships have gotten more guardrailed in 2026. For security research, medical questions, or legal analysis, the gap between "technically capable" and "will actually answer" is large. Grok is the most permissive; Gemini is the most cautious.
- Which model hallucinates less on your domain. Benchmarks use generic questions. Your domain — whether it's pharma research, Ethereum protocol details, or a specific programming language — might be a blind spot for some models and a strength for others.
- Which model's tool calls work in your stack. OpenAI's function calling spec is the de facto standard; Claude and Gemini follow it with small differences. If your codebase assumes one provider, switching is more work than you'd expect.
- Which model remembers. Long-conversation coherence varies wildly. Claude stays coherent across 100+ turns; GPT-5 sometimes loses thread context; Gemini uses its huge context to brute-force remember but can sometimes over-weight early context.
The model landscape by task
Based on our internal evals and user feedback at Klaws:
Creative writing: Claude Opus 4.7 > Claude Sonnet 4.6 > GPT-5.4 > Gemini 3.1 Pro. Anthropic's post-training clearly optimizes for natural voice.
Complex reasoning (math, logic, multi-hop): GPT-5.4 xhigh and Claude Opus 4.7 are tied. Gemini 3.1 Pro is close behind. Cheap models drop off noticeably here.
Code generation: Claude Opus 4.7 ≈ GPT-5.3 Codex > GPT-5.4 > Claude Sonnet 4.6 > Gemini 3.1 Pro.
Structured data extraction: GPT-5.4 > GPT-5.4 mini > Gemini 3.1 Pro > Claude. OpenAI's JSON-mode reliability is still unmatched.
Long documents (100k+ tokens): Gemini 3.1 Pro wins outright. Nothing else competes at 500k+ tokens without quality degradation.
Vision: GPT-5.4 > Gemini 3.1 Pro > Claude Opus 4.7. GPT-5's ability to reason over images (diagrams, charts, UI screenshots) is strongest.
Multilingual: Qwen3.6 Plus and Gemini 3.1 Pro lead. Especially for non-Latin scripts.
Agentic tool use (many steps, many tools): Claude Opus 4.7 and GPT-5.4 xhigh. These are also the models that coding assistants like Claude Code and GitHub Copilot default to.
How Klaws thinks about this
Klaws doesn't force you to pick one model. Under the hood we do smart routing: simple chat and summarization go to fast cheap models (Gemini 3 Flash, Qwen3.6), complex reasoning and code go to Claude Opus 4.7 or GPT-5.4, and everything in between lands on Claude Sonnet 4.6. You pay via flat credits — not per-token across five vendors.
That means:
- A morning news briefing costs a fraction of a credit (runs on Gemini Flash)
- A deep research report with web search costs ~15 credits (runs on Opus)
- Generating a working SaaS landing page costs ~10 credits (routed to Codex + Sonnet)
- A simple conversational reply costs 0.1 credits (runs on Gemini Flash or Qwen)
The result is that you get flagship-quality output when it matters without paying flagship prices on every trivial task. Users on our $19/mo Starter plan (700 credits) typically run 500–1,200 tasks per month without hitting the limit — because the router keeps average cost below 1 credit per task.
Which model should you use directly?
If you're using a model API directly rather than through an agent, our take for 2026:
- Default pick: Gemini 3.1 Pro or Claude Sonnet 4.6. Best intelligence/price balance.
- Hardest tasks only: Claude Opus 4.7 or GPT-5.4 xhigh.
- High-volume cheap tasks: MiniMax-M2.7 or Qwen3.6 Plus.
- Latency-critical UX: Grok 4.20 or GPT-5.4 mini xhigh.
- Long documents (>500k tokens): Gemini 3.1 Pro. It's the only frontier model that handles 2M context without falling apart.
- Code: GPT-5.3 Codex xhigh or Claude Opus 4.7.
- Vision/multimodal: GPT-5.4 xhigh.
- Non-English work: Qwen3.6 Plus or Gemini 3.1 Pro.
The cost math nobody talks about
If you're burning 50M tokens/month (realistic for a production app with a few thousand users), the monthly bill differs by model:
- Claude Opus 4.7: ~$500/month
- GPT-5.4 xhigh: ~$280/month
- Gemini 3.1 Pro: ~$225/month
- Claude Sonnet 4.6: ~$300/month
- Qwen3.6 Plus: ~$57/month
- MiniMax-M2.7: ~$27/month
That's an 18x gap between the most expensive and the cheapest usable model. If you can route 80% of tokens to the cheap tier and 20% to flagships, your bill drops from $500 to about $100 — same apparent quality to end users.
This is why "pick one model and stick with it" is the wrong play in 2026. It's also why doing the routing yourself is annoying — you need evals, per-task routing logic, vendor accounts, rate-limit handling, and fallbacks.
The honest summary
The frontier is getting crowded. Intelligence scores are converging, but price and speed are diverging. In 2024, "use the best model available" was a reasonable strategy. In 2026, it's wasteful — you're paying 10–20x more than you need to for most tasks.
The smart move is either to route across models yourself (complex, annoying), or use a tool that does the routing for you. If you'd rather spend your time building things than benchmarking vendors, try Klaws free for 3 days →.
For head-to-heads of specific models: Claude Opus 4.7 vs GPT-5.4, Gemini 3.1 Pro vs Claude Opus 4.7, and the best cheap AI models of 2026.