GPT-5.5 在 Agents' Last Exam 基准中排名第一，最难任务所有智能体成功率 0%

Noam Brown@polynoamial

2026-06-12 01:35·3天前

AI 摘要

OpenAI 研究员 Noam Brown 表示，GPT-5.5 在 Agents' Last Exam（ALE）基准中排名第一，且按模型 token、成本或墙钟时间衡量同样表现最佳。ALE 由 @dawnsongtweets 团队创建，是一个滚动基准，包含超过 1500 个专家任务、覆盖 55 个职业，测试 AI 智能体能否执行实际经济价值工作。评估对象包括 GPT-5.5、Fable 5、Composer 2.5 等前沿系统。结果显示：当前智能体能解决部分专业任务，但在需要持续推理和深度专业知识的最难层级，所有被测前沿智能体（包括 Fable 5）成功率为 0%。

I'm happy GPT-5.5 tops this eval

I'm even happier it's still doing the best when measured vs tokens， cost， or wall-clock time！

Dawn SongEveryone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case? Over the past many ...

OpenAI大佬观点评测/基准

在 X 查看原推

Noam Brown@polynoamial · X

2026-06-12 01:35·3天前

AI 摘要

I'm happy GPT-5.5 tops this eval

I'm even happier it's still doing the best when measured vs tokens， cost， or wall-clock time！

Dawn SongEveryone says the latest AI agents will be "job-ready" soon, especially after the release of Fable 5 this week. But is that really the case? Over the past many ...

OpenAI大佬观点评测/基准

在 X 查看原推x.com