Cognition 推出 FrontierCode 编码基准：评估 AI 代码的可合并性

Rohan Paul@rohanpaul_ai

2026-06-09 20:03·6天前

AI 摘要

Cognition 发布 FrontierCode 编码基准，评测 AI 生成的代码是否达到维护者可合并的质量，而非仅通过测试。基准含 150 个任务（Main 最难 100 个，Diamond 最难 50 个），由 20 余位开源维护者设计，每个任务耗时超 40 小时。评分设阻隔项（如破坏行为、缺失逻辑等）和加权项（可读性、类型安全等）。额外包含反向测试、范围检查、自适应评分。在 Diamond 子集上，Claude Opus 4.8 得分 13.4%，GPT-5.5 6.3%，Gemini 3.1 Pro 4.7%，开源最佳 Kimi K2.6 3.8%，显示顶尖模型在可合并代码上仍表现糟糕。

Incredible！ This is just the benchmark we needed.

Claude Opus 4.8， achieves a score of only 13.4%. Other models score even lower： GPT-5.5 receives 6.3%， Gemini 3.1 Pro 4.7%， and others even less.

Cognition is introducing FrontierCode， a coding benchmark built to test whether AI code is good enough for a real maintainer to merge， not just whether it passes tests.

FrontierCode asks a harder question： did the model produce a clean， limited， well-tested， readable patch that fits the project's existing style and would survive serious code review？

They bring 3 nested subsets of FrontierCode at increasing difficulty： The benchmark contains 150 tasks， with Main as the hardest 100 and Diamond as the hardest 50.

More than 20 open-source maintainers helped design the tasks， and each task took over 40 hours to build， review， attack， and calibrate.

The biggest finding is that top models still struggle badly when the target is mergeable code instead of merely working code.

On Diamond， the best model， Claude Opus 4.8， scores only 13.4%， while GPT-5.5 scores 6.3%， Gemini 3.1 Pro scores 4.7%， and the best open-source model listed， Kimi K2.6， scores 3.8%.

Shows that today's strongest coding agents can often patch behavior， but they still fail many human-review standards around design， restraint， test quality， and project conventions.

The mechanism is a grading system built around blockers and non-blockers.

A blocker is something that would stop a maintainer from merging the PR， such as broken behavior， missing required behavior， unsafe scope changes， bad performance， or tests that do not prove the fix.

A solution that fails any blocker gets 0， even if parts of the code look good.

A passing solution then gets a weighted score based on softer quality items such as readability， type safety， style， and fit with the existing codebase.

FrontierCode also adds checks beyond normal unit tests.

Reverse-classical testing runs the model's own tests against the original broken code， and those tests must fail， which proves the model wrote tests that actually catch the bug.

Scope checks punish patches that touch unrelated files， add oversized diffs， or refactor things the task did not ask for.

Adaptive grading uses an LLM to adjust test scaffolding around valid implementation differences， so a good solution is not rejected just because it used a different function name or error wording.

CognitionIntroducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40+ hrs of work by leading open-source maintainers. Models ...

编码评测/基准

在 X 查看原推

Rohan Paul@rohanpaul_ai · X