论文“Harness Updating Is Not Harness Benefit”挑战了常见直觉——把最强模型放在进化者位置以写出更好更新。实验表明,廉价模型Qwen3.5-9B即可写出与Claude Opus 4.6效果相近的提示、记忆和技能更新。昂贵模型更适合作为求解任务的智能体,因弱模型无法正确加载或遵循更新,强模型已近能力上限,收益有限。甜区在中档模型:既能调用新程序,又有足够学习空间。
Better self-improving agents need better solvers, not bigger update-writing models.
This challenges the common habit of putting the strongest model in the evolver seat.
The usual intuition was: put the strongest model in the evolver seat, because a better model should write better prompts, memories, tools, and skills.
This paper cuts that intuition in half.
It separates two jobs that are usually blurred together: writing useful harness updates, and benefiting from those updates during task execution.
The paper says the cheaper model can often write good enough prompt, memory, or skill updates. So a small Qwen3.5-9B evolver can create updates that help about as much as Claude Opus 4.6.
The expensive model is more useful as the agent that actually solves the task with those updates.
i.e. using the updates is very model-dependent, because weak models often fail to load the right skill or load it and then stop following it during a long task.
Strong models can use the harness, but they may already be close enough to their ceiling that the update has less room to help.
The sweet spot is the mid-tier model: capable enough to invoke and follow the new procedure, but not so capable that the harness has nothing left to teach.
----
Link - arxiv. org/abs/2605.30621
Title: "Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents"