Anthropic 发布 Claude Opus 4.8：被称作"小幅但实在的改进"，在多数基准测试中超越 GPT-5.5

2026-05-29 05:20·17天前·Matthias Bastian

AI 摘要

Anthropic 发布其最新模型 Claude Opus 4.8。该模型在大多数基准测试中超越了 GPT-5.5 和 Gemini 3.1 Pro。其代码错误自动捕获能力是前代产品的四倍。同步推出动态工作流功能，可启动数百个并行子智能体来处理跨代码库迁移等任务。

原文 · 未翻译

Anthropic ships Claude Opus 4.8 as a "modest but tangible improvement" that tops GPT-5.5 in most benchmarks

Key Points

Anthropic has released Claude Opus 4.8, a new AI language model that the company claims outperforms competitors like OpenAI's GPT-5.5 across most benchmarks, while also communicating its own uncertainties better.

Anthropic also introduces dynamic workflows that allow it to schedule tasks and launch hundreds of parallel subagents, along with a new control that lets users determine how much effort the AI should put into generating a response.

API pricing remains unchanged from its predecessor, Opus 4.7, at $5 per million input tokens and $25 per million output tokens.

Anthropic's latest flagship model, Claude Opus 4.8, leads most benchmarks and is designed to be more upfront about its own mistakes.

Anthropic says Opus 4.8 beats both its predecessor and OpenAI's GPT-5.5 and Google's Gemini 3.1 Pro across most tested categories. On agentic coding (SWE-Bench Pro), the model hits 69.2 percent, up from 64.3 percent for Opus 4.7 and 58.6 percent for GPT-5.5. For multidisciplinary reasoning (Humanity's Last Exam), Opus 4.8 scores 49.8 percent without tools and 57.9 percent with tools, the highest marks in the field.

Less fake progress, more honesty

Anthropic calls the model's improved honesty one of its most noticeable upgrades. AI models have a habit of jumping to conclusions and claiming progress that falls apart on closer look. It's a widespread problem.

"Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims," Anthropic says. The company backs that up with its own coding evaluations, where the model lets bugs slip through without comment about four times less often than Opus 4.7.

The model also sets new highs on prosocial traits like supporting user autonomy. Deception attempts and other unaligned behavior are said to be at Claude Mythos levels. Details are in the Claude Opus 4.8 System Card. The first Mythos-class models are expected to roll out to all customers in the coming weeks, once all safety measures are in place, the company says.

Dynamic workflows and effort controls steal the show

The new features Anthropic shipped alongside the model may matter more than the model update itself, which the company calls "modest but tangible."

The biggest is "dynamic workflows." The model can plan a task and then spin up hundreds of parallel sub-agents in a single session. Anthropic says Claude Code with Opus 4.8 can now handle codebase-wide migrations across hundreds of thousands of lines, from planning all the way to merge. The feature is available on Enterprise, Team, and Max plans.

On claude.ai and in Cowork, there's now an effort control next to the model picker. It lets you decide how hard Claude works on a given response. Crank it up for deeper thinking and better results. Turn it down for faster answers that use less of your rate limit.

Opus 4.8 defaults to "high." For tough tasks, Anthropic recommends "extra" (called "xhigh" in Claude Code) or "max." These modes burn more tokens, but Anthropic says higher rate limits for Claude Code users help offset that. Anthropic's advice is to just pick whatever level feels right for the task.

API prices stay the same, fast mode gets cheaper

Fast Mode, which runs Opus 4.8 at 2.5x speed, now costs a third of what it did for earlier models. Pricing sits at $10 per million input tokens and $50 per million output tokens.

Standard prices are unchanged from Opus 4.7: $5 per million input tokens and $25 per million output tokens. But 4.7 was already about 30 to 40 percent pricier in practice than its predecessor, 4.6, because it chewed through more tokens without delivering noticeable gains on many everyday tasks.

Opus 4.8 might actually cost less to run

According to Artificial Analysis, Opus 4.8 could ease that 4.7 price bump. On the GDPval-AA benchmark, which tests real-world knowledge work tasks, the model needs 15 percent fewer passes per task and 35 percent fewer output tokens than Opus 4.7.

In practice, that could mean noticeably lower costs. But Opus 4.8 still uses roughly 30 percent more passes than OpenAI's GPT-5.5, the second-place model.

At the "max" effort level, Opus 4.8 scored 1,890 points on GDÜVvall-AA, 137 points above Opus 4.7 and 121 points ahead of GPT-5.5, a win rate of about 67 percent head-to-head against GPT-5.5.

AI News Without the Hype – Curated by Humans

Anthropic推理模型发布

The Decoder：AI News（RSS）