Cognition 推出 FrontierCode 编码评估基准，聚焦代码可维护性

swyx@swyx

2026-06-09 04:27·7天前

AI 摘要

Cognition 发布 FrontierCode 编码评估，每任务由顶级开源维护者花费 40+ 小时编写。METR 发现 SWEBench 超一半结果为不可合并的垃圾代码。FrontierCode 含 3000+ 评分标准，首次衡量代码是否可合并。最高难度 FC Diamond 上，Opus 4.8 仅得 13.8%。在 FC Extended 最易任务中，Opus 在 2025 年底 4 个月内从 41% 提升至 74%，标志 AI 编码进入"可维护代码"时代。

It's finally out！！！ @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve， much less solve with high quality.

Cog had IOI Gold medalists and top code maintainers Look At The Data - FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks.

FC Diamond is so hard that Opus 4.8 scores 13.8%.

Three eras of AI coding ： Three eras of benchmarks

2021 • Autocomplete ： HumanEval 2023 • Passing Tests： SWEBench， TerminalBench 2026 • Maintainable Code： FrontierCode

to me the most beautiful chart when I requested a special historical run into all extant old models， the data was finding that the easiest third of FC tasks （in FC Extended） were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months.

This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out： it is the difference between getting 95% success in 2 rerolls vs 6， making it finally feasible to go up the next layer of abstraction in agentic coding， eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails.

My guess： as AI accelerates from here， each FrontierCode tier will saturate in sequence， hopefully ~annually. I've already asked the team to prepare FrontierCode 2027….

The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith， the next model forest grows. Circle of life.

CognitionIntroducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models ...

智能体编码评测/基准

在 X 查看原推

swyx@swyx · X