Anthropic发布Claude Opus 4.8,在Artificial Analysis智能指数上以61.4分超越GPT-5.5(xhigh)1.2分,重新登顶。该模型在真实世界智能体任务和前沿学术推理上均有提升,在主要智能体评测GDPval-AA上以1890 Elo分取得约67%的胜率。在科学推理方面,Claude首次在Humanity's Last Exam基准上领先OpenAI和Google。其模型幻觉率维持在35.9%,显著低于竞品。上下文窗口仍为100万token,定价为输入$5、输出$25每百万token。
Claude Opus 4.8 takes the lead on the Artificial Analysis Intelligence Index at 61.4, with Anthropic retaking the #1 spot on GDPval-AA and advancing in terminal use and scientific reasoning
To reach the leading position on the Intelligence Index, @Anthropic made large improvements in both real-world agentic work and frontier academic reasoning tasks.
Key takeaways: ➤ Claude Opus 4.8 is the new leader on the Artificial Analysis Intelligence Index. Opus 4.8 scores 61.4, up +4.1 points from Opus 4.7 and +1.2 points ahead of GPT-5.5 (xhigh), the previous Index leader
➤ The new release is slightly more efficient than its predecessor on agentic tasks, but token efficiency varied by task type. We saw Opus 4.8 use fewer turns and output tokens on GDPval-AA, but approximately the same number of output tokens for the overall Intelligence Index to achieve significantly higher performance.
➤ Anthropic retakes the lead on GDPval-AA, our primary evaluation for agentic performance on knowledge work tasks. Opus 4.8 scored an 1,890 Elo, reflecting an implied win rate of approximately 67% against GPT-5.5
➤ Claude is now among the top models for scientific reasoning. Previous releases have trailed peers on complex academic reasoning tasks, but with Opus 4.8, Claude sits slightly ahead of OpenAI and Google as the leader on Humanity's Last Exam. It also scores higher than Gemini 3.1 Pro on CritPt, a frontier physics benchmark, but remains behind GPT-5.4 and GPT-5.5
➤ Claude Opus 4.8 reaches #2 on AA-Omniscience, slightly ahead of Opus 4.7. Opus 4.8 scores 27.4 on the AA-Omniscience Index behind only Gemini 3.1 Pro (32.9). Accuracy ticked up slightly to 46.6% and hallucination rate held roughly flat at 35.9% - Anthropic continues to demonstrate substantially lower hallucination rates than peer models from Google and OpenAI
➤ Compared with Opus 4.7, Opus 4.8 also makes material gains on Terminal-Bench Hard (+6.8 points), τ2-Bench Telecom (+5.9 points), and IFBench (+3.6 points), with relatively flat scores across AA-LCR, GPQA, and SciCode.