Google Gemini 3.1 Pro在METR时间跨度基准测试中以80%成功率登顶,可处理平均耗时1.5小时的软件任务(置信区间52分钟至2小时39分钟),超越GPT-5.2等模型。这标志着AI自主任务处理能力从2023年GPT-4的接近零分跃升至小时级,打破了OpenAI和Anthropic长期主导该基准的局面。若任务时长翻倍趋势持续,多小时代理工作将从演示场景转变为实际工作流程。
New METR time horizon leader: Gemini 3.1 Pro.
On METR's time horizon benchmark (80% success rate (!)), Google's Gemini 3.1 Pro now handles software tasks that take humans 1 hour 30 minutes on average. 95% CI ranges from 52 minutes up to 2 hours 39 minutes. Average score: 77%.
That puts it ahead of GPT-5.2 (high), GPT-5.1-Codex-Max, GPT-5, and the rest of the field.
Two things worth noting:
The curve keeps bending. GPT-4 sat near zero on this benchmark in 2023. Three years later we are talking about tasks measured in hours, not seconds.
Google has quietly taken the top spot on a benchmark that has been dominated by OpenAI and Anthropic for most of its existence.
The doubling time on autonomous task length is the number to watch. If it holds, multi-hour agentic work stops being a demo and starts being a workflow.
Makes me even more excited for Google i/o in may.