69.2%SWE-Bench Pro

83.4%Computer use

4xFewer hidden bugs

3xCheaper Fast mode

1890GDPval-AA Elo

Anthropic shipped Claude Opus 4.8 on May 28, 2026 — exactly 41 days after Opus 4.7 — and the most interesting thing about it isn't the benchmark numbers. It's what the model does when it doesn't know the answer. Where previous models confidently declared victory and shipped broken code, Opus 4.8 flags the uncertainty. It catches its own bugs. It tells you when it's stuck. That behavioral shift, described by Anthropic as a fourfold reduction in unreported code flaws, is the real signal of what changed under the hood.

The benchmark numbers are real too. Opus 4.8 leads on agentic coding (69.2% SWE-bench Pro), autonomous computer use (83.4% OSWorld-Verified), knowledge work (1890 GDPval-AA Elo), financial analysis (53.9%), and multidisciplinary reasoning — both with and without tools. The one category where it loses: terminal-bench coding, where GPT-5.5 retains a narrow edge at 78.2% vs Opus 4.8's 74.6%.

Alongside the model itself, Anthropic launched two companion features: Dynamic Workflows in Claude Code (research preview) — which lets Claude orchestrate hundreds of parallel subagents for massive tasks — and effort control across claude.ai and Cowork, letting you dial how hard the model thinks from Low through Max. The Fast mode pricing also dropped to $10/$50 per million tokens, down from $30/$150 on Opus 4.7 — three times cheaper for the same model at 2.5× the speed.

Benchmark Breakdown

The chart below shows how Opus 4.8 stacks up against Opus 4.7, Opus 4.6, GPT-5.5, Gemini 3.1 Pro, and leading Chinese models across the benchmarks Anthropic published at launch. Toggle between metrics using the buttons in the chart.

AI coding model benchmark comparison

Claude Opus 4.8 vs frontier and open-weight competitors

SWE-Bench ProTerminal CodingComputer UseReasoning

SWE-Bench Pro (%) - Agentic coding benchmark for real multi-language repositories.

Opus 4.8 69.2%

Opus 4.7 64.3%

DeepSeek V4-Pro (est.) 62%

GPT-5.5 58.6%

Qwen 3.7 Max (est.) 57%

Opus 4.6 (est.) 55%

Gemini 3.1 Pro 54.2%

Gemini 3.5 Flash (est.) 49.8%

Kimi 2.6 (est.) 47.6%

Terminal-Bench 2.1 (%) - Agentic terminal coding benchmark where GPT-5.5 keeps a narrow lead.

GPT-5.5 78.2%

Opus 4.8 74.6%

Gemini 3.1 Pro 70.3%

Opus 4.7 66.1%

DeepSeek V4-Pro (est.) 65%

Qwen 3.7 Max (est.) 61%

Opus 4.6 (est.) 58%

OSWorld-Verified (%) - Autonomous desktop and web navigation benchmark.

Opus 4.8 83.4%

Opus 4.7 82.8%

GPT-5.5 78.7%

Gemini 3.1 Pro 76.2%

DeepSeek V4-Pro (est.) 70%

Opus 4.6 (est.) 68%

HLE with Tools (%) - Multidisciplinary reasoning with tool access enabled.

Opus 4.8 57.9%

Opus 4.7 54.7%

GPT-5.5 52.2%

Gemini 3.1 Pro 51.4%

Qwen 3.7 Max (est.) 50%

DeepSeek V4-Pro (est.) 48%

Opus 4.6 (est.) 46%

Kimi 2.6 (est.) 40%

Estimated rows are marked with "(est.)" and should be read as directional third-party or trajectory estimates, not official Anthropic numbers. Sources: Anthropic launch notes and AI Coding Daily.

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
Agentic coding — SWE-Bench Pro	69.2%	64.3%	58.6%	54.2%
Agentic terminal coding — Terminal-Bench 2.1	74.6%	66.1%	78.2% ★	70.3%
Multidisciplinary reasoning — HLE (no tools)	49.8%	46.9%	41.4%	44.4%
Multidisciplinary reasoning — HLE (with tools)	57.9%	54.7%	52.2%	51.4%
Agentic computer use — OSWorld-Verified	83.4%	82.8%	78.7%	76.2%
Knowledge work — GDPval-AA (Elo)	1890	1753	1769	1314
Agentic financial analysis — Finance Agent v2	53.9%	51.5%	51.8%	43.0%
SWE-Bench Verified	88.6%	87.6%	—	—

★ GPT-5.5 leads on Terminal-Bench 2.1. All other categories: Opus 4.8 leads. Source: Anthropic, May 2026.

The GDPval-AA Elo gap is the starkest number in this table. Opus 4.8 leads GPT-5.5 by 121 Elo points on knowledge work — that's a substantial margin at this tier. Gemini 3.1 Pro trails by 576 Elo points on the same benchmark.

The Honesty Story Is the Real Headline

Benchmarks are one thing. Behavior is another. The reason Anthropic emphasized honesty in the Opus 4.8 launch isn't marketing — it maps to a real problem that anyone who has used agentic coding tools for more than a few weeks has run into.

"Instead of inventing progress like other models, it tells you when it's not sure."

— Developer reaction on X, day of launch

Previous models — across labs, not just Anthropic — had a tendency to declare tasks complete when they weren't. They'd run into an error, attempt a workaround that didn't quite work, and then summarize optimistically. You'd come back to check progress and find the feature half-built but reported as done.

Opus 4.8 is reportedly four times less likely than Opus 4.7 to let a code flaw pass without flagging it. That's an alignment improvement that shows up in Anthropic's internal assessments, and it's the kind of change that matters more in production than a 2-3 point benchmark swing.

One notable caveat from the Opus 4.8 system card: agentic prompt-injection robustness is slightly weaker than Opus 4.7. The Gray Swan red-teaming shows ~9.6% attack success rate vs 6.0% for 4.7. If you're running Opus 4.8 in agentic pipelines with untrusted input, review your sandboxing approach.

Dynamic Workflows: The Biggest Feature

Dynamic Workflows in Claude Code (currently in research preview) is the companion launch that's arguably more impactful than the model upgrade itself for teams with large codebases.

Here's how it works: for sufficiently complex tasks, Claude doesn't just run linearly. It makes a plan, fans out hundreds of parallel subagents across independent workstreams, and then verifies their outputs before reporting back. Think a database schema migration touching 150 files, a full test suite generation across a monorepo, or a security audit that needs to check every API endpoint.

-
Plan mode first: Claude writes the architecture of the task — which subagents will handle which parts, what they need to produce, and how success will be verified — before any code is written.
-
Hundreds of parallel subagents: each one works independently on its assigned portion. Your main session stays free while the background agents execute. You can check in at any time or walk away entirely.
-
Verification before reporting: agents don't just finish and summarize. They run the tests, check for regressions, and flag anything uncertain before surfacing results. The honesty improvement makes this actually trustworthy.
-
Available via /fast in Claude Code: Toggle it on with a single command. No configuration overhead.

The practical implication: migrations and refactors that previously took a team a sprint — coordinating individually, reviewing each other's changes, managing merge conflicts — can now be issued as a single instruction to Opus 4.8 with Dynamic Workflows. Whether that replaces human coordination entirely is a different question, but it significantly lowers the activation energy for large-scope changes.

Dynamic Workflows pairs particularly well with the Super Agent benchmark result, where Opus 4.8 is reportedly the only model to complete every case end-to-end — even GPT-5.5 couldn't match it. That's not just a benchmark win; it suggests the parallel orchestration is actually more reliable than sequential approaches for complex tasks.

Effort Control: Finally, a Cost Knob That Makes Sense

Also launching with Opus 4.8: an effort control UI across claude.ai, Cowork, and Claude Code. You can now set the model's thinking intensity from Low → Medium → High → xHigh → Max.

Opus 4.8 defaults to High for the best balance of quality and cost. Running Low on simple tasks (formatting, summarizing, short rewrites) and Max on hard reasoning (architecture design, complex debugging, financial modelling) is the discipline that cuts your monthly bill without touching output quality where it matters.

This has always been technically available via the API's extended thinking tokens, but surfacing it as a first-class UI toggle makes it accessible to developers who aren't deep in the docs — and it's the right abstraction. Most users shouldn't be tuning token budgets manually; they should be picking effort levels.

Pricing: Same Standard Rate, Much Cheaper Fast Mode

Tier	Input / 1M tokens	Output / 1M tokens	Notes
Opus 4.8 Standard	$5.00	$25.00	Unchanged from Opus 4.7
Opus 4.8 Fast mode	$10.00	$50.00	3× cheaper than Opus 4.7 Fast ($30/$150)
Opus 4.7 Fast (prev)	$30.00	$150.00	Now superseded by 4.8 Fast
GPT-5.5 (approx.)	~$10.00	~$30.00	Different capabilities profile

The context window holds at 1M tokens on the Claude API, Amazon Bedrock, and Google Cloud Vertex AI, with 128K maximum output tokens. Microsoft Foundry gets 200K context at launch. The model ID is claude-opus-4-8 with a claude-opus-4-8[1m] variant where the full context window is needed explicitly.

Fast mode access is currently in research preview on the Claude API. Enable it with /fast in Claude Code, or contact your account manager / join the waitlist at claude.com/fast-mode for API access.

Where Opus 4.8 Wins (and Where It Doesn't)

Where Opus 4.8 leads

Agentic coding at scale. The SWE-bench Pro gap is significant — 69.2% vs 58.6% for GPT-5.5. That's a 10.6 percentage point advantage on a benchmark specifically designed to test multi-language, real-repository problem solving. GPT-5.5 can't close this by tuning effort levels.

Computer use. 83.4% on OSWorld-Verified is the highest of any model in this benchmark cycle. For teams exploring autonomous agents that control actual desktops and web interfaces — not just coding tasks — Opus 4.8 is the clear choice.

Knowledge work at the frontier. The GDPval-AA Elo gap (1890 vs 1769 for GPT-5.5, 1314 for Gemini 3.1 Pro) suggests Opus 4.8 has a meaningful advantage on the kind of synthesis, research, and reasoning tasks that don't reduce to a pass/fail test.

Where it loses

Terminal-bench coding. GPT-5.5 scores 78.2% to Opus 4.8's 74.6% on Terminal-Bench 2.1 — the agentic terminal coding benchmark. That's a genuine gap, not noise. For workflows that are heavily terminal-focused, GPT-5.5 in its current form is the stronger pick on this specific metric.

Chinese models: DeepSeek V4-Pro, Qwen 3.7 Max, Kimi 2.6

The strongest open-weight alternative in May 2026 is DeepSeek V4-Pro — the first open-weight model to credibly compete on agentic benchmarks with Western frontier models. It trails Opus 4.8 on SWE-bench Pro and computer use, but its cost structure is significantly different: for routine coding tasks at scale, DeepSeek V4-Pro remains the most cost-efficient option available with self-hosting.

Qwen 3.7 Max from Alibaba performs well on math reasoning tasks (Qwen3's hybrid thinking mode is a genuine strength there) but trails on the practical coding and agentic benchmarks where Opus 4.8 has been specifically optimized. Kimi 2.6 showed competitive results in the AI Coding Daily leaderboard we covered in our Composer 2.5 review but sits well below Opus 4.8 on the more demanding benchmarks.

The interactive benchmark chart above includes approximate SWE-bench Pro scores for Opus 4.6 (estimated ~55%, based on trajectory), DeepSeek V4-Pro (~62%, reported), Qwen 3.7 Max (~57%, reported), and Kimi 2.6 (~47%, from AI Coding Daily). These are sourced from available third-party evaluations and are noted as estimates in the chart legend.

What's Next: Claude Mythos

Anthropic confirmed that a more powerful model — codenamed Claude Mythos — is in the pipeline. Details are scarce, but Anthropic has noted it's already being used by select organizations in cybersecurity contexts, suggesting it's in limited production rather than just internal testing.

The framing from Anthropic: Mythos-class intelligence will be meaningfully superior to Opus 4.8. Given that Opus 4.8 already leads six of seven published benchmarks at the frontier, that's a significant claim. The release cadence this year — Opus 4.6 in February, 4.7 in April, 4.8 in May — suggests Anthropic is comfortable shipping frequently, which means Mythos could land sooner than a traditional major-version timeline would imply.

If your use case is non-time-sensitive and you're evaluating whether to build on Opus 4.8 vs wait, the Mythos announcement is worth factoring in. For active production needs today, Opus 4.8 is the obvious choice. For long-term platform decisions, a 60–90 day pause may be worth it.

Editorial Verdict

4.9/5

* * * * *

Opus 4.8 is the best general-purpose agentic coding model available as of May 28, 2026. The benchmark lead on SWE-bench Pro is real and meaningful. The honesty improvement is arguably more valuable in production than any benchmark number — a model that accurately reports uncertainty is fundamentally more trustworthy for long-running automated tasks. Dynamic Workflows changes the scope of what a single engineer can orchestrate without oversight. And getting all of that at the same price as 4.7, with a 3× cheaper Fast mode, makes the upgrade decision essentially automatic for current Opus users.

What we like

The honesty improvement is the feature we didn't know we needed. The SWE-bench Pro lead is substantial and practically relevant. Dynamic Workflows is the right abstraction for large-scale agentic tasks. Effort control finally surfaces extended thinking as a UI-level concept. Fast mode pricing is now actually competitive.

What to watch

The terminal-bench gap with GPT-5.5 is real. The slight regression in agentic prompt-injection robustness (9.6% vs 6.0% attack success rate) is worth noting for security-sensitive deployments. And Claude Mythos is close enough that for non-urgent platform decisions, it's worth knowing it's coming. One more thing: we don't yet have full independent third-party confirmation of all claimed benchmark numbers — Anthropic's figures are from their own evaluations, and third-party validation on SWE-bench Pro specifically can differ from internally-run numbers.

Start Building with Claude Opus 4.8

Available now on claude.ai, Claude Code, the API (claude-opus-4-8), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Try Claude Opus 4.8

Claude Opus 4.8 Review: The Most Honest Coding Model Anthropic Has Built

Benchmark Breakdown

The Honesty Story Is the Real Headline

Dynamic Workflows: The Biggest Feature

Effort Control: Finally, a Cost Knob That Makes Sense

Pricing: Same Standard Rate, Much Cheaper Fast Mode

Where Opus 4.8 Wins (and Where It Doesn't)

Where Opus 4.8 leads

Where it loses

Chinese models: DeepSeek V4-Pro, Qwen 3.7 Max, Kimi 2.6

What's Next: Claude Mythos

Editorial Verdict

What we like

What to watch

Start Building with Claude Opus 4.8

Related Articles

Claude Sonnet 5 Review: Benchmarks, Pricing & How It Compares to Opus 4.8

Build a stronger shortlist

Want a faster path to the right AI tool?

About the Author

Share This Article

OpenClaw Review: Best Open-Source Agentic AI Gateway for Messaging Apps (2026)

Hermes Agent Review: Top Self-Improving Open-Source Agentic AI (2026)