Mythos正式上线FrontierCode基准测试,旨在衡量AI生成可维护代码的能力。该基准包含超1000小时维护者验证的任务,并引入3000+评分标准防奖励攻击。最高难度FC Diamond上,Opus 4.8得分仅13.8%,且Opus 4.8与GPT 5.5均未随effort扩展提升。Mythos/Fable后训练将test time compute用于数小时级长任务。基准已在Devin上线,ACU成本仅1.4倍。FC Extended中最易的1/3任务在2025年末被快速攻克——Opus从41%升至74%,标志着AI编码进入“维护可读代码”新时代。
Mythos is live! so excited to have our FrontierCode recognized as the next frontier coding bench.
on FC Diamond, BOTH Opus 4.8 and GPT 5.5 don't meaningfully scale with effort, which many of you caught yesterday.
Mythos/Fable posttraining have really applied that test time compute toward solving very, very long running problems - dozens of human hour equivalents, hundreds of dollars per task, for the first time ever measured.
Available now in @Cognition @Devin for only 1.4x ACUs too! (I never thought i'd see this launch lol)