Arena 跳出了刷榜逻辑,用真实用户的多轮交互来评估 Agent,这比任何 toy benchmark 都更有说服力,选模型做 Agent 应用的可以把它当新指南。
Arena 推出基于真实用户任务的智能体排行榜,评估模型在代码编写、应用构建、文档分析等工作中的表现,而非孤立基准。排行榜基于30万+任务、200万+工具调用和4000万行代码,综合任务成功、纠正遵从性、错误恢复、用户表扬与抱怨、工具幻觉等信号。前三名:GPT-5.5 High(+10.7%)、Claude Opus 4.7 Thinking(+9.5%)、GPT-5.4 High(+8.9%)。
Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions.
The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files.
The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands.
Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds.
Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline.
The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist.
The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents.
The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls.
GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%.
The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction.