新论文构建 CL-BENCH 基准,评估 AI 智能体在编程、数据库、预测、无线电信号、扑克、疾病研究 6 个领域中的持续学习能力。每个任务隐藏可随时间习得的模式,考察智能体能否超越预训练知识。测试前沿 LLM 系统采用全上下文记忆、草稿笔记、检索记忆、剧本式记忆及编码智能体设置,结果发现当前记忆密集型 AI 智能体并未可靠优于简单保留完整对话上下文。Claude Sonnet 4.6 使用普通上下文取得最佳总体分数。论文指出智能体仍需更好方法记住有用经验、遗忘过时信息并适应环境变化。
This paper proposes a new test to see whether AI agents truly get better as they gain experience and finds they mostly still confuse memory with learning.
Shows that simple full-context learning beats the more specialized memory systems, with Claude Sonnet 4.6 using plain context getting the best overall score.
That distinction matters because the next wave of AI is not supposed to answer isolated prompts.
It is supposed to live inside codebases, databases, markets, sensors, clinics, and workflows where yesterday's mistake should make tomorrow's action sharper.
The authors build CL-BENCH, a benchmark where an agent works through connected tasks in 6 domains, including coding, databases, forecasting, radio signals, poker, and disease studies.
Each task hides a pattern the agent can learn over time, like a database layout, a codebase structure, or an opponent's strategy, so better performance should come from experience rather than pretraining.
They test frontier LLM systems with simple full-context memory, scratchpad notes, retrieval memory, playbook-style memory, and coding-agent setups.
The key finding is that current memory-heavy AI agents are not reliably better learners than just keeping the full conversation in context.
That means long-running AI agents still need better ways to remember useful lessons, forget stale ones, and adapt when the environment changes.
----
Link - arxiv. org/abs/2606.05661
Title: "Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments"