Anthropic 发布 Claude Opus 4.8:被称作"小幅但实在的改进",在多数基准测试中超越 GPT-5.5
Anthropic 发布其最新模型 Claude Opus 4.8。该模型在大多数基准测试中超越了 GPT-5.5 和 Gemini 3.1 Pro。其代码错误自动捕获能力是前代产品的四倍。同步推出动态工作流功能,可启动数百个并行子智能体来处理跨代码库迁移等任务。
Anthropic ships Claude Opus 4.8 as a "modest but tangible improvement" that tops GPT-5.5 in most benchmarks
Key Points
Anthropic has released Claude Opus 4.8, a new AI language model that the company claims outperforms competitors like OpenAI's GPT-5.5 across most benchmarks, while also communicating its own uncertainties better.
Anthropic also introduces dynamic workflows that allow it to schedule tasks and launch hundreds of parallel subagents, along with a new control that lets users determine how much effort the AI should put into generating a response.
API pricing remains unchanged from its predecessor, Opus 4.7, at $5 per million input tokens and $25 per million output tokens.
Anthropic's latest flagship model, Claude Opus 4.8, leads most benchmarks and is designed to be more upfront about its own mistakes.
Anthropic says Opus 4.8 beats both its predecessor and OpenAI's GPT-5.5 and Google's Gemini 3.1 Pro across most tested categories. On agentic coding (SWE-Bench Pro), the model hits 69.2 percent, up from 64.3 percent for Opus 4.7 and 58.6 percent for GPT-5.5. For multidisciplinary reasoning (Humanity's Last Exam), Opus 4.8 scores 49.8 percent without tools and 57.9 percent with tools, the highest marks in the field.
Less fake progress, more honesty
Anthropic calls the model's improved honesty one of its most noticeable upgrades. AI models have a habit of jumping to conclusions and claiming progress that falls apart on closer look. It's a widespread problem.
"Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims," Anthropic says. The company backs that up with its own coding evaluations, where the model lets bugs slip through without comment about four times less often than Opus 4.7.
The model also sets new highs on prosocial traits like supporting user autonomy. Deception attempts and other unaligned behavior are said to be at Claude Mythos levels. Details are in the Claude Opus 4.8 System Card. The first Mythos-class models are expected to roll out to all customers in the coming weeks, once all safety measures are in place, the company says.