一项新研究提出Meta-Agent Challenge(MAC)基准,测试AI智能体能否在没有人类设计帮助的情况下自主构建更优智能体。智能体需在安全工作区内自行发明策略、编写代码、测试并从失败中学习。实验覆盖数学、科学问答、竞赛编程、代码修复和长终端任务5个领域。结果显示,当前智能体大多无法超越人工设计的强智能体系统,仅Claude等少数封闭前沿模型取得较好表现。研究认为,当前智能体更像是强大的执行者,而非具备可靠自改进能力的工程师。
This paper tests whether today's AI agents can build better AI agents without human design help.
i.e. whether an AI can act more like an AI engineer.
That means it must invent a strategy, write the agent code, test it, learn from failures, and improve the system without a human guiding every choice.
Shows they are still weak at reliably building the systems that do tasks.
Their benchmark, called Meta-Agent Challenge, gives an AI coding agent a safe workspace, a scoring API, limited time, and limited model calls, then asks it to create another agent that performs well on hidden test tasks.
They tested this across 5 areas, including math, science questions, competitive programming, software bug fixing, and long terminal tasks.
The main result is that current agents usually do not beat strong human-made agent setups, and the few good results mostly come from closed frontier models like Claude.
Complete autonomy is not just tool use.
It is budget awareness, failure recovery, restraint under pressure, and the discipline to change designs instead of polishing a bad one.
Overall, Meta-Agent Challenge (MAC) suggests that today's agents are not yet self-improving engineers.
They are powerful executors with flashes of design judgment, still missing the boring reliability that makes engineering real.
----
Link - arxiv. org/abs/2606.04455
Title: "The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?"