Anthropic shipped Claude Opus 4.8 on May 28, 2026 — exactly 41 days after Opus 4.7 — and the most interesting thing about it isn't the benchmark numbers. It's what the model does when it doesn't know the answer. Where previous models confidently declared victory and shipped broken code, Opus 4.8 flags the uncertainty. It catches its own bugs. It tells you when it's stuck. That behavioral shift, described by Anthropic as a fourfold reduction in unreported code flaws, is the real signal of what changed under the hood.
The benchmark numbers are real too. Opus 4.8 leads on agentic coding (69.2% SWE-bench Pro), autonomous computer use (83.4% OSWorld-Verified), knowledge work (1890 GDPval-AA Elo), financial analysis (53.9%), and multidisciplinary reasoning — both with and without tools. The one category where it loses: terminal-bench coding, where GPT-5.5 retains a narrow edge at 78.2% vs Opus 4.8's 74.6%.
Alongside the model itself, Anthropic launched two companion features: Dynamic Workflows in Claude Code (research preview) — which lets Claude orchestrate hundreds of parallel subagents for massive tasks — and effort control across claude.ai and Cowork, letting you dial how hard the model thinks from Low through Max. The Fast mode pricing also dropped to $10/$50 per million tokens, down from $30/$150 on Opus 4.7 — three times cheaper for the same model at 2.5× the speed.
Benchmark Breakdown
The chart below shows how Opus 4.8 stacks up against Opus 4.7, Opus 4.6, GPT-5.5, Gemini 3.1 Pro, and leading Chinese models across the benchmarks Anthropic published at launch. Toggle between metrics using the buttons in the chart.
AI coding model benchmark comparison
Claude Opus 4.8 vs frontier and open-weight competitors
SWE-Bench Pro (%) - Agentic coding benchmark for real multi-language repositories.
Terminal-Bench 2.1 (%) - Agentic terminal coding benchmark where GPT-5.5 keeps a narrow lead.
OSWorld-Verified (%) - Autonomous desktop and web navigation benchmark.
HLE with Tools (%) - Multidisciplinary reasoning with tool access enabled.
Estimated rows are marked with "(est.)" and should be read as directional third-party or trajectory estimates, not official Anthropic numbers. Sources: Anthropic launch notes and AI Coding Daily.
| Benchmark | Opus 4.8 | Opus 4.7 | GPT-5.5 | Gemini 3.1 Pro |
|---|---|---|---|---|
| Agentic coding — SWE-Bench Pro | 69.2% | 64.3% | 58.6% | 54.2% |
| Agentic terminal coding — Terminal-Bench 2.1 | 74.6% | 66.1% | 78.2% ★ | 70.3% |
| Multidisciplinary reasoning — HLE (no tools) | 49.8% | 46.9% | 41.4% | 44.4% |
| Multidisciplinary reasoning — HLE (with tools) | 57.9% | 54.7% | 52.2% | 51.4% |
| Agentic computer use — OSWorld-Verified | 83.4% | 82.8% | 78.7% | 76.2% |
| Knowledge work — GDPval-AA (Elo) | 1890 | 1753 | 1769 | 1314 |
| Agentic financial analysis — Finance Agent v2 | 53.9% | 51.5% | 51.8% | 43.0% |
| SWE-Bench Verified | 88.6% | 87.6% | — | — |
★ GPT-5.5 leads on Terminal-Bench 2.1. All other categories: Opus 4.8 leads. Source: Anthropic, May 2026.
The Honesty Story Is the Real Headline
Benchmarks are one thing. Behavior is another. The reason Anthropic emphasized honesty in the Opus 4.8 launch isn't marketing — it maps to a real problem that anyone who has used agentic coding tools for more than a few weeks has run into.
"Instead of inventing progress like other models, it tells you when it's not sure."
— Developer reaction on X, day of launchPrevious models — across labs, not just Anthropic — had a tendency to declare tasks complete when they weren't. They'd run into an error, attempt a workaround that didn't quite work, and then summarize optimistically. You'd come back to check progress and find the feature half-built but reported as done.
Opus 4.8 is reportedly four times less likely than Opus 4.7 to let a code flaw pass without flagging it. That's an alignment improvement that shows up in Anthropic's internal assessments, and it's the kind of change that matters more in production than a 2-3 point benchmark swing.
Dynamic Workflows: The Biggest Feature
Dynamic Workflows in Claude Code (currently in research preview) is the companion launch that's arguably more impactful than the model upgrade itself for teams with large codebases.
Here's how it works: for sufficiently complex tasks, Claude doesn't just run linearly. It makes a plan, fans out hundreds of parallel subagents across independent workstreams, and then verifies their outputs before reporting back. Think a database schema migration touching 150 files, a full test suite generation across a monorepo, or a security audit that needs to check every API endpoint.
-
-
Plan mode first: Claude writes the architecture of the task — which subagents will handle which parts, what they need to produce, and how success will be verified — before any code is written.
-
-
Hundreds of parallel subagents: each one works independently on its assigned portion. Your main session stays free while the background agents execute. You can check in at any time or walk away entirely.
-
-
Verification before reporting: agents don't just finish and summarize. They run the tests, check for regressions, and flag anything uncertain before surfacing results. The honesty improvement makes this actually trustworthy.
-
-
Available via
/fastin Claude Code: Toggle it on with a single command. No configuration overhead.
The practical implication: migrations and refactors that previously took a team a sprint — coordinating individually, reviewing each other's changes, managing merge conflicts — can now be issued as a single instruction to Opus 4.8 with Dynamic Workflows. Whether that replaces human coordination entirely is a different question, but it significantly lowers the activation energy for large-scope changes.
Effort Control: Finally, a Cost Knob That Makes Sense
Also launching with Opus 4.8: an effort control UI across claude.ai, Cowork, and Claude Code. You can now set the model's thinking intensity from Low → Medium → High → xHigh → Max.
Opus 4.8 defaults to High for the best balance of quality and cost. Running Low on simple tasks (formatting, summarizing, short rewrites) and Max on hard reasoning (architecture design, complex debugging, financial modelling) is the discipline that cuts your monthly bill without touching output quality where it matters.
This has always been technically available via the API's extended thinking tokens, but surfacing it as a first-class UI toggle makes it accessible to developers who aren't deep in the docs — and it's the right abstraction. Most users shouldn't be tuning token budgets manually; they should be picking effort levels.
Pricing: Same Standard Rate, Much Cheaper Fast Mode
| Tier | Input / 1M tokens | Output / 1M tokens | Notes |
|---|---|---|---|
| Opus 4.8 Standard | $5.00 | $25.00 | Unchanged from Opus 4.7 |
| Opus 4.8 Fast mode | $10.00 | $50.00 | 3× cheaper than Opus 4.7 Fast ($30/$150) |
| Opus 4.7 Fast (prev) | $30.00 | $150.00 | Now superseded by 4.8 Fast |
| GPT-5.5 (approx.) | ~$10.00 | ~$30.00 | Different capabilities profile |
The context window holds at 1M tokens on the Claude API, Amazon Bedrock, and Google Cloud Vertex AI, with 128K maximum output tokens. Microsoft Foundry gets 200K context at launch. The model ID is claude-opus-4-8 with a claude-opus-4-8[1m] variant where the full context window is needed explicitly.
Where Opus 4.8 Wins (and Where It Doesn't)
Where Opus 4.8 leads
Agentic coding at scale. The SWE-bench Pro gap is significant — 69.2% vs 58.6% for GPT-5.5. That's a 10.6 percentage point advantage on a benchmark specifically designed to test multi-language, real-repository problem solving. GPT-5.5 can't close this by tuning effort levels.
Computer use. 83.4% on OSWorld-Verified is the highest of any model in this benchmark cycle. For teams exploring autonomous agents that control actual desktops and web interfaces — not just coding tasks — Opus 4.8 is the clear choice.
Knowledge work at the frontier. The GDPval-AA Elo gap (1890 vs 1769 for GPT-5.5, 1314 for Gemini 3.1 Pro) suggests Opus 4.8 has a meaningful advantage on the kind of synthesis, research, and reasoning tasks that don't reduce to a pass/fail test.
Where it loses
Terminal-bench coding. GPT-5.5 scores 78.2% to Opus 4.8's 74.6% on Terminal-Bench 2.1 — the agentic terminal coding benchmark. That's a genuine gap, not noise. For workflows that are heavily terminal-focused, GPT-5.5 in its current form is the stronger pick on this specific metric.
Chinese models: DeepSeek V4-Pro, Qwen 3.7 Max, Kimi 2.6
The strongest open-weight alternative in May 2026 is DeepSeek V4-Pro — the first open-weight model to credibly compete on agentic benchmarks with Western frontier models. It trails Opus 4.8 on SWE-bench Pro and computer use, but its cost structure is significantly different: for routine coding tasks at scale, DeepSeek V4-Pro remains the most cost-efficient option available with self-hosting.
Qwen 3.7 Max from Alibaba performs well on math reasoning tasks (Qwen3's hybrid thinking mode is a genuine strength there) but trails on the practical coding and agentic benchmarks where Opus 4.8 has been specifically optimized. Kimi 2.6 showed competitive results in the AI Coding Daily leaderboard we covered in our Composer 2.5 review but sits well below Opus 4.8 on the more demanding benchmarks.
What's Next: Claude Mythos
Anthropic confirmed that a more powerful model — codenamed Claude Mythos — is in the pipeline. Details are scarce, but Anthropic has noted it's already being used by select organizations in cybersecurity contexts, suggesting it's in limited production rather than just internal testing.
The framing from Anthropic: Mythos-class intelligence will be meaningfully superior to Opus 4.8. Given that Opus 4.8 already leads six of seven published benchmarks at the frontier, that's a significant claim. The release cadence this year — Opus 4.6 in February, 4.7 in April, 4.8 in May — suggests Anthropic is comfortable shipping frequently, which means Mythos could land sooner than a traditional major-version timeline would imply.
Editorial Verdict
Opus 4.8 is the best general-purpose agentic coding model available as of May 28, 2026. The benchmark lead on SWE-bench Pro is real and meaningful. The honesty improvement is arguably more valuable in production than any benchmark number — a model that accurately reports uncertainty is fundamentally more trustworthy for long-running automated tasks. Dynamic Workflows changes the scope of what a single engineer can orchestrate without oversight. And getting all of that at the same price as 4.7, with a 3× cheaper Fast mode, makes the upgrade decision essentially automatic for current Opus users.
What we like
The honesty improvement is the feature we didn't know we needed. The SWE-bench Pro lead is substantial and practically relevant. Dynamic Workflows is the right abstraction for large-scale agentic tasks. Effort control finally surfaces extended thinking as a UI-level concept. Fast mode pricing is now actually competitive.
What to watch
The terminal-bench gap with GPT-5.5 is real. The slight regression in agentic prompt-injection robustness (9.6% vs 6.0% attack success rate) is worth noting for security-sensitive deployments. And Claude Mythos is close enough that for non-urgent platform decisions, it's worth knowing it's coming. One more thing: we don't yet have full independent third-party confirmation of all claimed benchmark numbers — Anthropic's figures are from their own evaluations, and third-party validation on SWE-bench Pro specifically can differ from internally-run numbers.
Start Building with Claude Opus 4.8
Available now on claude.ai, Claude Code, the API (claude-opus-4-8), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
Try Claude Opus 4.8

