Google Gemini 3.1 Pro领跑METR时间跨度基准测试

Chubby♨️@kimmonismus

处理中

2026-04-16 19:59·60天前

AI 摘要

Google Gemini 3.1 Pro在METR时间跨度基准测试中以80%成功率登顶，可处理平均耗时1.5小时的软件任务（置信区间52分钟至2小时39分钟），超越GPT-5.2等模型。这标志着AI自主任务处理能力从2023年GPT-4的接近零分跃升至小时级，打破了OpenAI和Anthropic长期主导该基准的局面。若任务时长翻倍趋势持续，多小时代理工作将从演示场景转变为实际工作流程。

New METR time horizon leader： Gemini 3.1 Pro.

On METR's time horizon benchmark （80% success rate （！））， Google's Gemini 3.1 Pro now handles software tasks that take humans 1 hour 30 minutes on average. 95% CI ranges from 52 minutes up to 2 hours 39 minutes. Average score： 77%.

That puts it ahead of GPT-5.2 （high）， GPT-5.1-Codex-Max， GPT-5， and the rest of the field.

Two things worth noting：

The curve keeps bending. GPT-4 sat near zero on this benchmark in 2023. Three years later we are talking about tasks measured in hours， not seconds.

Google has quietly taken the top spot on a benchmark that has been dominated by OpenAI and Anthropic for most of its existence.

The doubling time on autonomous task length is the number to watch. If it holds， multi-hour agentic work stops being a demo and starts being a workflow.

Makes me even more excited for Google i/o in may.

智能体GoogleOpenAI编码

在 X 查看原推

Chubby♨️@kimmonismus · X

处理中