OpenAI shipped GPT-5.5 today. On paper it's a minor version bump — the kind of release that would usually be a footnote. In practice, it's the most consequential OpenAI update since GPT-5.0 itself, because of where the improvements land.
The short version: GPT-5.5 isn't noticeably smarter on single-shot questions. It's dramatically better on anything that takes more than one step. That's the thing most real products actually need.
What actually changed
OpenAI's release notes are unusually restrained, but the behavior shift is visible within an hour of using it:
- Longer-horizon reasoning. Tasks that reliably fell apart around step 6 or 7 on GPT-5.4 — multi-file refactors, multi-hop research, multi-step agent plans — now hold together well past step 15. The model replans mid-task instead of charging forward on a flawed premise.
- Cleaner structured output. Function calling and strict JSON schemas are near-perfect now. The occasional malformed arg or schema violation that used to force a retry loop is rare enough to design around.
- Better tool-use discretion. GPT-5.4 would sometimes call a tool it didn't need ("I'll search for this" on a question it could answer from memory). 5.5 pauses and picks the cheaper path more often.
- Price drop, not a hike. Output pricing is down roughly 15% from GPT-5.4 at the same tier. Given the capability bump, that's the unusual "better and cheaper" combination.
- Same context window, smarter use of it. 256K stays, but retrieval inside that window is tighter — less "I lost the thread around token 180K."
What didn't change: it's still GPT-5's voice. The writing style is the same polished-intern register, which Opus still beats for long-form and nuanced tone.
Why the agent story matters most
Point-releases usually move benchmarks a few percent. GPT-5.5's real change is that failure modes got narrower. Three specific things that used to break agent workflows now mostly don't:
- Loops on failed tool calls. 5.4 would sometimes retry the same broken call three times before giving up. 5.5 reads the error message and adapts.
- Drift on long plans. After 10+ steps, 5.4 started forgetting earlier constraints ("wait, I was supposed to keep this under 500 words"). 5.5 rechecks the original brief.
- Schema soup. Nested tool outputs with optional fields, enums, and arrays of objects are the scenarios where GPT-5.4 quietly broke. 5.5 handles them cleanly.
If you've been using a mix of Claude Opus 4.7 for hard agent work and GPT-5 for structured outputs, that split just got less clear. On agent tasks specifically, 5.5 closes most — not all — of the gap.
Where GPT-5.5 still lags
Worth naming honestly, because the launch hype will skip this:
- Long-form writing. Opus 4.7 is still noticeably better at anything over 3,000 words where voice and structure matter.
- Hard agentic coding. Opus's lead on cross-file refactors is smaller than before, but still there.
- Creative tasks. GPT-5.5 is still technically correct in a way that reads slightly sterile compared to Opus or Gemini 3.
If you do long-form writing, agentic coding on big repos, or anything where voice is the product, the Opus premium still makes sense.
Availability and access
- ChatGPT + Codex: rolling out to Plus, Pro, Business, and Enterprise as of April 23, 2026. Free-tier users get a capped preview.
- API:
gpt-5.5andgpt-5.5-miniare live. Pricing sits below GPT-5.4 at the equivalent quality tier. - Azure: expected on Microsoft Foundry within the week.
What this means for your stack
If you've built an agent on GPT-5 or GPT-5.4, bumping the model string is worth trying before anything else — the behavior change is large enough that prompts you've been tuning around may be doing harm now. If you're on Opus for hard work and GPT-5 for cheap bulk, re-test the boundary.
For the wider 2026 frontier picture, see our honest leaderboard breakdown and the Claude Opus 4.7 vs GPT-5.4 comparison — both get updated below as the dust settles.
GPT-5.5 isn't a new category the way gpt-image-2 was. It's the version where OpenAI finally nailed the boring-but-critical stuff: tool use, long plans, schema compliance. For anyone building real products on these APIs, that's the version that actually matters.