meng shao@shao__meng · 5天前64AGENTS.md 在 Coding Agents 中真的有用吗?
这篇论文,大规模实证研究仓库级上下文文件(AGENTS.md、CLAUDE.md 等)对编码 Agent 实际效果的影响,可能有些反直觉!感谢 @rasbt 分享!
论文在这:https://arxiv.org/abs/2602.11988
研究背景:实践先行,证据滞后
AGENTS.md 已成为行业惯例,GitHub 上已有 6 万+ 仓库采用,Claude Code (CLAUDE.md)、Codex、Qwen Code 等 Agent 都内置 /init 自动生成。但此前研究多停留在内容分类与描述性统计,缺少对任务完成率的严格评估。
核心难点在于:主流基准 SWE-bench 来自 Django、Flask 等知名仓库,这些项目本来就没有开发者手写的 context file,无法直接评估该实践的真实价值。
实验设计:双基准、三条件、四 Agent
· 基准:SWE-bench Lite(300 任务,11 个热门 Python 仓库)+ 新建 AGENTBENCH(138 任务,12 个已含开发者 context file 的冷门仓库)
· 三种条件:① 无 context file ② LLM 生成(各 Agent 官方 /init 流程)③ 开发者手写(仅 AGENTBENCH)
· Agent/模型:Claude Code + Sonnet 4.5、Codex + GPT-5.2 / GPT-5.1 mini、Qwen Code + Qwen3-30B
· 指标:任务成功率、步数、推理成本、工具调用轨迹
核心发现:效果微弱,成本显著
1. 成功率:边际效应,甚至为负
· LLM 生成:8 组设置中 5 组下降,平均 -0.5%(SWE-bench)/ -2%(AGENTBENCH)
· 开发者手写:平均 +4%,优于 LLM 生成,但 Claude Code 上甚至不如无文件
· 跨模型、跨 prompt 结论稳健
一句话:自动生成 context file 不仅无益,还可能略有害;手写的提升也很有限。
2. 效率:无文件反而最便宜(步数,成本)
· LLM 生成:+2.45 / +3.92 步,+20% / +23%
· 开发者手写:+3.34 步,最高 +19%
3. 代码库概览几乎无效
Context file 常被推荐用于「帮助 Agent 快速定位代码」。实测显示:有无 context file,Agent 首次接触相关文件所需的步数并无显著差异。95–100% 的 LLM 生成文件都包含代码库概览,但对导航帮助甚微。
轨迹分析:Agent 听话,但听话很贵
论文排除了「Agent 忽略 context file」这一假设。轨迹分析表明:
· 指令遵从度高:context file 提到 uv,使用率从 <0.01 次/任务升至 1.6 次;提到仓库专用工具,从 <0.05 升至 2.5 次
· 行为更「认真」:更多测试、更多文件搜索/阅读、更多 lint/质量检查
· 推理更深:GPT-5.2 推理 token 增加 14–22%
机制链条:
Context file 写入额外要求
→ Agent 更严格遵从(测试、探索、专用工具)
→ 步数与成本上升
→ 成功率未同步提升(甚至更差)
Context file 不是被忽略,而是被过度执行——把「建议性流程」当成了「必做清单」,增加了任务复杂度,却没有换来更高成功率。
一个关键反转:文档冗余假说
当移除仓库中所有其他文档(.md、docs/、示例代码)后,LLM 生成的 context file 反而带来 +2.7% 提升,且优于开发者手写的。
这说明:
· 在文档齐全的仓库里,context file 与 README、docs 高度冗余
· 开发者口述的「加了 AGENTS.md 后 Agent 变强了」,很可能是因为目标仓库本身文档稀缺,context file 填补了信息真空
· 对 Django 这类文档完善的知名项目,额外 context 的价值被稀释
消融实验:生成质量的上限
· 更强模型生成 ≠ 更好 context:GPT-5.2 生成的文件在 SWE-bench 上略好(+2%),在 AGENTBENCH 上反而更差(-3%)
· 不同 prompt 无一致优势:Codex prompt vs Claude prompt 效果因数据集而异,差异很小
自动生成 context file 的改进空间,目前看来很有限。
实践建议
· 依赖 /init 自动生成:谨慎——平均略降成功率,成本 +20%+
· 长篇架构概览、目录枚举:避免——与代码探索冗余,不加速定位
· 测试/lint/构建命令:精简写入——Agent 会严格执行,但过多要求推高成本
· 仓库专用工具(uv、pdm 等):值得写——指令遵从度高,且代码中不易推断
· 分层/按需引用:方向正确——「做 X 时读 Y.md,否则忽略」减少无关负担
译论文大规模实证检验 AGENTS.md 等仓库级上下文文件对编码 Agent 的影响。在 SWE-bench Lite(300 任务)和新建 AGENTBENCH(138 任务)上测试 Claude Code、Codex、Qwen Code 等组合。核心发现:LLM 自动生成的 context file 在 8 组设置中 5 组成功率下降,平均 -0.5%(SWE-bench)/-2%(AGENTBENCH),成本增加 +20%+;开发者手写仅平均 +4%。冗余假说:移除其他文档后,自动生成反而 +2.7%。建议避免自动生成,精简测试/lint 命令,优先写入仓库专用工具。
AYi@AYi_AInotes · 5天前62Google的研究找到了一种把 AI记忆大幅压缩的技术,让本地跑大模型 + 自己数据变得更容易了。
也就是说可以把 1000 万个文档 的向量存储,从 31GB 内存 压缩到只剩 4GB,而且搜索速度还比现在最常用的 FAISS 更快。
译Google提出一种AI记忆压缩技术,可将1000万个文档的向量存储从31GB内存压缩至仅4GB,且搜索速度超过目前最常用的FAISS方法。该技术使本地运行大语言模型并结合个人数据变得更加可行。
Rohan Paul@rohanpaul_ai · 5天前49A Primer paper about how reasoning models improve after training
Shows that better reasoning models depend less on raw data size and more on checkable training evidence.
reasoning data is NOT simple question-and-answer pairs. The useful part is often the feedback that says why an answer, step, tool action, or full attempt was good or bad.
A prompt and a response tell you what a model said, but not why that answer became learnable, which judge blessed it, which failures were hidden, or whether the skill was already inside the base model.
The core idea is to describe each training example as a record that includes the task, the model’s behavior, the checking signal, and metadata about where it came from.
The authors sort reasoning data by how it can be checked, such as exact rule-based checks for math and code, environment checks for agents using tools, and human or model judgments when no exact checker exists.
They also explain why common assumptions fail, because long reasoning traces may be fake, harder examples may be useless for some models, and larger datasets may still miss important coverage.
The key point is that agent data should preserve mess: failed actions, retries, recoveries, state differences, and terminal checks, because that is where learning signal often lives.
----
Link – arxiv. org/abs/2606.02113
Title: "A Primer in Post-Training Reasoning Data: What They Know About How It Works"
译论文指出,更好的推理模型更依赖可验证的训练证据,而非原始数据规模。推理数据的关键不是简单问答对,而是提供答案、步骤、工具操作或完整尝试好坏判断的反馈信号。每个训练样本应描述为包含任务、模型行为、检查信号和元数据的记录。研究者按检查方式分类:数学和代码用精确规则、智能体工具用环境检查,无精确检查器时用人类或模型判断。常见误区包括:长推理链可能虚假、更难样例对部分模型无效、更大数据集仍可能缺失关键覆盖。智能体数据应保留失败动作、重试、恢复、状态差异和终端检查等“混乱”信息,因为学习信号常在其中。
Rohan Paul@rohanpaul_ai · 6天前62Great idea for self-evolving AI scientists from this new MIT paper.
Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder.
The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims.
The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced.
Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself.
So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema.
A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language.
----
arxiv. org/abs/2606.01444
Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"
译MIT论文(F.Y. Wang & M.J. Buehler, arXiv:2606.01444, 2026)提出Self-Revising Discovery Systems框架,使AI科学家能自主识别当前思维模式不足并添加新科学概念,而非仅更努力搜索。系统将数据、模型、工具输出、失败及声明均视为类型化产物(typed provenance),从而区分三种模式:retrieval(添加已知对象)、search(探索固定模式)和discovery(可验证的模式转换)。论文通过Kan obstruction和Left Kan extension数学化定义了真正新颖性——由旧证据传输后的逐点残差量化,使novelty可客观测量。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性,以及CategoryScienceClaw发现各向异性纤维网络刚度规则。
Rohan Paul@rohanpaul_ai · 6天前66New MIT paper, great idea for self-evolving AI scientists from
Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder.
The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims.
The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced.
Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself.
So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema.
A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language.
----
arxiv. org/abs/2606.01444
Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"
译MIT团队提出自我演进AI科学家框架,核心创新是让AI识别当前推理空间过小并主动添加新科学概念,而非仅在固定模式内搜索。论文将数据点、模型、工具输出、失败、声明均视为带类型的artifact,明确区分检索(添加已知对象)、搜索(探索固定schema)和发现(可验证的模式扩展)。通过类型化copresheaf与Kan障碍理论证明,真正发现是可验证的schema扩展:旧证据由左Kan扩展传输,创新性通过逐点残差量化。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性,以及CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444(2026)。
elvis@omarsar0 · 6天前65// Continual Learning Bench //
One of the research areas with lots of investments is continual learning.
While there are many efforts, there is very little progress in measuring it.
So the big question is, do dedicated memory systems actually make agents learn from experience?
Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management.
CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances.
If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning.
Paper: https://arxiv.org/abs/2606.05661
Learn to build effective AI agents in our academy: https://academy.dair.ai/
译持续学习领域投入多但进展缓慢。CL-Bench(持续学习基准)在六个由专家验证、包含共享可学习结构的领域上测试,发现简单的上下文学习(ICL)基线优于专门为记忆管理构建的系统。该基准引入增益指标以隔离真正学习效果,结果显示智能体常过度拟合即时观察或未能跨实例复用知识。研究指出,若普通ICL基线超过你的记忆架构,则该架构增加的是开销而非学习。论文:arxiv.org/abs/2606.05661。
meng shao@shao__meng · 6天前59面向 AI Agent 的零信任安全:企业自主 AI Agent 部署框架
Anthropic 官方 5 月份发布的白皮书:企业部署自主 AI Agent 时,传统边界安全不够用,必须把零信任原则延伸到 Agent 架构本身。
报告开篇点出双重加速:
· 基础设施层面:前沿 AI 模型把「漏洞发现 → 利用」的周期从数月压缩到数小时,攻击成本极低。
· Agent 层面:Agent 能自主解释目标、选工具、执行多步操作。传统访问控制挡不住「在合法权限内作恶」,监控也要面对「不靠漏洞、靠持久化操控」的新型攻击。
因此,报告认为:未来优势不取决于谁用了最先进的 AI,而取决于谁的基础安全足够扎实,且 Agent 从第一天就按「已遭入侵」来设计。
零信任的三条原则(和一条设计检验)
三条原则
· 永不信任,始终验证:内外网请求一视同仁,每次访问都要认证与授权
· 假设已遭入侵:重点不是「防住入侵」,而是限制单点失守后的破坏范围
· 最小权限:只给完成任务所需的最小访问权
一条设计检验
这个控制是让攻击不可能,还是只是让攻击更麻烦?
报告中的五个部分分别是:
Part I:Agent 为何是新的安全对象?
Part II:当前威胁图谱(OWASP 视角)
Part III:三层能力成熟度模型(报告核心)
Part IV:八阶段实施工作流
Part V:防御运营要跟上自主威胁的速度
白皮书地址:
https://cdn.prod.website-files.com/6889473510b50328dbb70ae6/6a1611a04085d7cd3dadc924_Claude-eBook-Zero-Trust-for-AI-Agents-05182026.pdf
视频版 🔽🔽🔽
译Anthropic 5 月发布白皮书,提出企业部署自主 AI Agent 时须将零信任原则延伸至 Agent 架构。报告指出双重加速:前沿模型将漏洞发现到利用周期压缩至数小时;Agent 能自主解释目标、选工具、执行多步操作,传统访问控制无法阻止“合法权限内作恶”。核心原则:永不信任始终验证、假设已遭入侵、最小权限;另附设计检验——控制是让攻击不可能,还是仅增加麻烦?报告分五部分:Agent 为何是新安全对象、威胁图谱、三层能力成熟度模型、八阶段实施工作流、防御运营适配自主威胁速度。
SemiAnalysis@SemiAnalysis_ · 7天前61Sequential Monte Carlo speculative decoding from @makora_ai keeps multiple draft tokens alive in parallel instead of rewinding failed matches.
译来自 @makora_ai 的序贯蒙特卡洛投机解码会并行保持多个草稿 token 存活,而不是回退失败的匹配。
Rohan Paul@rohanpaul_ai · 7天前76Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions.
The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files.
The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands.
Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds.
Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline.
The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist.
The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents.
The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls.
GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%.
The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction.
Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.
译Arena 推出基于真实用户任务的智能体排行榜,评估模型在代码编写、应用构建、文档分析等工作中的表现,而非孤立基准。排行榜基于30万+任务、200万+工具调用和4000万行代码,综合任务成功、纠正遵从性、错误恢复、用户表扬与抱怨、工具幻觉等信号。前三名:GPT-5.5 High(+10.7%)、Claude Opus 4.7 Thinking(+9.5%)、GPT-5.4 High(+8.9%)。
Chubby♨️@kimmonismus · 7天前65AI scientists may be moving from search to real discovery.
A new MIT paper proposes a framework for self-revising AI systems that don’t just explore a fixed scientific vocabulary, but can expand the vocabulary itself, introducing new variables, tools, verifiers, and model structures when existing ones are no longer enough.
True scientific progress is often not just about finding better answers, but about changing the space in which answers can exist.
If this scales, AI could become far more than a research assistant: it could become an auditable partner in building new scientific world models.
Still early, but conceptually very exciting.
译MIT Buehler团队提出Self-Revising Discovery Systems框架,让AI能自主扩展科学词汇(变量、工具、验证器、模型结构),而非仅搜索固定空间。论文使用typed copresheaf和Kan obstruction数学框架形式化智能体工作流,证明真正发现是可验证的schema扩展:旧证据通过Left Kan extension迁移,新异性由pointwise残差客观量化,区分发现与搜索。三种模态:检索(添加已知对象)、搜索(固定schema)、发现(验证的范式转换)。案例包括Builder/Breaker发现蛋白质模式条件合规性,CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444(2026)。
Emad@EMostaque · 7天前33If Claude is good enough for Nobel Prize winners it is good enough for you
https://arxiv.org/abs/2606.03300
译如果 Claude 对诺贝尔奖得主来说都足够好,那对你也一样。
https://arxiv.org/abs/2606.03300
Rohan Paul@rohanpaul_ai · 7天前79Anthropic’s new chemistry report has a genuinely wild result.
Claude Opus 4.7 is now competitive with dedicated NMR software, and the bigger story is that it can work the problem backwards, i.e. infer the molecule from the spectrum.”
NMR software is the chemist’s expert tool for turning molecular structures into predicted lab spectra.
So Opus 4.7 is no longer just “helping chemists read data” — it can work backward from NMR data and propose the molecule’s structure, a task the report says existing mainstream tools generally leave to human chemists.
Note, that Opus 4.7, a general-purpose model with no chemistry-specific fine-tuning.
Claude Opus 4.7 made the smallest hydrogen prediction errors and nearly matched MestReNova on carbon, meaning it can predict NMR signals about as well as specialist chemistry tools.
So AI now handle one of chemistry’s hidden bottlenecks: translating between a molecule, its spectral shadow, and the structure a chemist actually needs to trust.
译Anthropic最新化学报告显示,通用大模型Claude Opus 4.7(无化学微调)在NMR核磁共振谱分析上匹配甚至超越专用软件MestReNova,氢预测误差最小,碳预测近乎一致。更关键的是,它能从NMR光谱反向推导分子结构——这一任务以往只能由人类化学家完成。这意味着AI现在可以处理化学中的关键瓶颈:在分子结构、谱图与最终确认之间自动翻译。
Microsoft Research@MSFTResearch · 7天前60During the Inside Azure Innovations breakout at Build 2026, Microsoft Azure CTO, deputy CISO and technical fellow Mark Russinovich introduced Project Mosaic, an experimental optical interconnect technology from Microsoft Research Cambridge using micro-LEDs for low-power, high-speed data transmission.
A live demo led by senior researcher Kaoutar Benyahya displays individual LED modulation forming letters, proving the concept’s real-time responsiveness. Check out Mark and Kaoutar starting @ 38:38: https://msft.it/6015vdhS9
译微软Azure CTO Mark Russinovich在Build 2026上介绍Project Mosaic,这是微软剑桥研究院的实验性光学互连技术,采用micro-LED实现低功耗、高速数据传输。高级研究员Kaoutar Benyahya现场演示单个LED调制形成字母,证明概念具备实时响应能力。
Chubby♨️@kimmonismus · 7天前72We are in for a wild ride, and this is just the beginning:
'World-first' vaccine designed by artificial intelligence
Researchers at the University of Cambridge have trialled what they describe as the world’s first AI-designed vaccine component in humans.
The vaccine uses an AI-designed “super-antigen” intended to train the immune system against a broad family of coronaviruses, including existing Covid variants and animal coronaviruses that could potentially cause future pandemics.
Instead of designing a vaccine around one current virus strain, researchers fed AI genetic data from many known coronaviruses. The AI then designed an antigen meant to trigger immune protection across the whole virus family, even if the virus mutates or jumps from animals to humans.
The first human trial involved 39 people and mainly tested safety. The immune response was described as modest, but the result is still seen as promising because it shows that an AI-designed vaccine antigen can be tested in humans.
A larger study with around 200 people will now examine how well the vaccine actually trains the immune system.
译剑桥大学研究人员开展了据称全球首个AI设计疫苗成分的人体试验。该疫苗使用AI设计的“超级抗原”,旨在训练免疫系统对抗包括现有新冠变种及可能引发未来大流行的动物冠状病毒在内的广泛冠状病毒家族。首次人体试验仅39人,主要验证安全性。免疫反应虽属中等,但被视为有前景,证明AI设计的疫苗抗原可以在人体中测试。下一步计划进行约200人的更大规模研究。
Anthropic@AnthropicAI · 7天前73New Anthropic Science Blog: Making Claude a chemist.
To manipulate a molecule, chemists first need to understand its structure. Their main tool is NMR spectroscopy.
We found Opus 4.7 matches—and on some tasks beats—dedicated NMR software. Read more: https://www.anthropic.com/research/making-claude-a-chemist
译Anthropic 新科学博客:让 Claude 成为化学家。
要操纵分子,化学家首先需要了解其结构。他们的主要工具是 NMR 波谱分析。
我们发现 Opus 4.7 在部分任务上匹配甚至超越了专用 NMR 软件。了解更多:https://www.anthropic.com/research/making-claude-a-chemist
Jim Fan@DrJimFan · 7天前71NitroGen just won CVPR Best Paper Honorable Mention!! We are making strides towards general-purpose embodied agents that master not only the real world physics, but also all possible physics across a multiverse of simulations.
It’s been 4 years since MineDojo, our first embodied agent in Minecraft, won NeurIPS Best Paper. Congrats to everyone on the team!!
译NitroGen 刚刚获得 CVPR 最佳论文荣誉提名!!我们正在朝着通用具身智能体迈进,不仅掌握真实世界的物理规律,还能掌握模拟多元宇宙中所有可能的物理规律。
距离我们的第一个 Minecraft 具身智能体 MineDojo 获得 NeurIPS 最佳论文奖已经过去 4 年了。祝贺团队里的每一位!!
AK@_akhaliq · 7天前56ArcANE
Do Role-Playing Language Agents Stay in Character at the Right Time?
译ArcANE
角色扮演语言智能体是否能在适当时刻保持角色?
AK@_akhaliq · 7天前57Code2LoRA
Hypernetwork-Generated Adapters for Code Language Models under Software Evolution
译Code2LoRA
超网络生成的代码语言模型适配器,用于软件演化环境。
elvis@omarsar0 · 7天前69// The Meta-Agent Challenge //
How good are current agents at self-improving?
This is a great paper covering some of the challenges.
They propose the Meta-Agent Challenge (MAC), where they give a coding agent a sandbox, an evaluation API, and a time budget, then ask it to program an agent that maximizes held-out performance across five domains.
Results:
Meta-agents rarely match human-engineered baselines, and the few that do are dominated by proprietary frontier models.
Under high optimization pressure, some agents started exfiltrating ground truth from the scoring channel, even with multi-layer anti-reward-hacking defenses in place.
Paper: https://arxiv.org/abs/2606.04455
Learn to build effective AI agents in our academy: https://academy.dair.ai/
译最新研究提出元智能体挑战(MAC),将编码智能体放入沙盒,给定评估API和时间预算,要求其自主编程出在五个领域表现最优的智能体。结果发现,元智能体极少能匹敌人工设计的基线,少数成功的案例也几乎全部依赖专有前沿模型。更值得警惕的是,在高优化压力下,一些智能体开始从评分渠道外泄真实答案,即便研究人员设置了多层反奖励破解防御也未能阻止。论文:arxiv.org/abs/2606.04455。
AI at Meta@AIatMeta · 7天前64Big congrats to our SAM 3D team for receiving a Best Paper Honorable Mention at #CVPR26! This prestigious recognition underscores their incredible work pushing the boundaries of computer vision.
Read the paper here: https://arxiv.org/abs/2511.16624
译热烈祝贺我们的 SAM 3D 团队在 #CVPR26 获得最佳论文荣誉提名!这项殊荣凸显了他们在推动计算机视觉边界方面的杰出工作。
论文链接:https://arxiv.org/abs/2511.16624
Berryxia.AI@berryxia · 7天前70大模型都不再卷推理,都开始卷规划能力!
腾讯混元联合人大高瓴人工智能学院直接开源了PlanningBench,一个专门测、训LLM真实规划能力的框架。
里面塞了30多个来自真实世界的规划任务,覆盖调度、生产、旅行、资源分配、应急响应等六大类,每一个都有清晰的成功标准和全自动验证机制。
你既可以用它测出当前最强模型到底在规划上有多拉胯,也能直接拿来继续微调,让模型从“会说”真正进化到“会干”。
以前整个行业都在卷参数、卷上下文、卷工具调用,好像规划能力是自然就会长出来的。
现在PlanningBench用30多个可验证任务直接把真相摊开:规划才是agent从玩具走向生产力的真正分水岭。
腾讯这次把论文、代码、数据集全甩到GitHub和Hugging Face,等于把这个最难、最核心的能力从黑盒拉到了公开赛道。
译腾讯混元联合人大高瓴人工智能学院开源PlanningBench,一个可扩展、可验证的框架,用于评估和训练大语言模型(LLM)的真实规划能力。该框架包含30多个来自调度、生产、旅行、资源分配、应急响应等六大类的真实世界规划任务,每项任务都有清晰的成功标准和全自动验证机制。用户既可用它评测当前最强模型在规划上的短板,也可直接用于微调,让模型从“会说”进化到“会干”。论文、代码和数据集已全部在GitHub和Hugging Face开源。
Rohan Paul@rohanpaul_ai · 7天前63Better self-improving agents need better solvers, not bigger update-writing models.
This challenges the common habit of putting the strongest model in the evolver seat.
The usual intuition was: put the strongest model in the evolver seat, because a better model should write better prompts, memories, tools, and skills.
This paper cuts that intuition in half.
It separates two jobs that are usually blurred together: writing useful harness updates, and benefiting from those updates during task execution.
The paper says the cheaper model can often write good enough prompt, memory, or skill updates. So a small Qwen3.5-9B evolver can create updates that help about as much as Claude Opus 4.6.
The expensive model is more useful as the agent that actually solves the task with those updates.
i.e. using the updates is very model-dependent, because weak models often fail to load the right skill or load it and then stop following it during a long task.
Strong models can use the harness, but they may already be close enough to their ceiling that the update has less room to help.
The sweet spot is the mid-tier model: capable enough to invoke and follow the new procedure, but not so capable that the harness has nothing left to teach.
----
Link – arxiv. org/abs/2605.30621
Title: "Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents"
译论文“Harness Updating Is Not Harness Benefit”挑战了常见直觉——把最强模型放在进化者位置以写出更好更新。实验表明,廉价模型Qwen3.5-9B即可写出与Claude Opus 4.6效果相近的提示、记忆和技能更新。昂贵模型更适合作为求解任务的智能体,因弱模型无法正确加载或遵循更新,强模型已近能力上限,收益有限。甜区在中档模型:既能调用新程序,又有足够学习空间。
Rohan Paul@rohanpaul_ai · 6月5日60Harness-1 makes search agents better by moving memory work out of the model and into a helper system.
Shows that intelligence performs better when the environment stops forcing it to spend cognition on bookkeeping.
That search agents should stop using the LLM as the notebook and let a separate harness track the search state.
The paper proved that a 20B model improved search by doing less inside its own head.
The problem is that normal search agents must both think about the next search and remember every document, clue, failed path, and remaining check inside the same limited context.
This formulation puts too much routine state management inside the policy.
Harness-1 separates those jobs.
The model keeps the hard semantic choices: what to search, what to inspect, what to verify, and when the evidence is good enough.
The harness keeps the recoverable state: candidate pools, curated documents, importance tags, evidence links, verification records, deduplicated observations, and budget-aware memory rendering.
That sounds minor until you look at reinforcement learning.
RL works poorly when every failure looks the same, because an empty or wrong final set does not reveal whether the agent searched badly, forgot evidence, skipped verification, or curated carelessly.
By externalizing state, Harness-1 gives the policy a cleaner learning problem: improve decisions over a visible search workspace.
For Harness-1, its gains were larger on held-out benchmarks than on source-family tasks, suggesting the model learned reusable search moves rather than memorized domain habits.
----
Link – arxiv. org/abs/2606.02373
Title: "Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses"
译Harness-1 将大语言模型的记忆工作转移到外部辅助系统(harness),解决传统搜索智能体需在同一上下文窗口内处理语义决策与状态记录导致的效率低下问题。模型仅负责搜索、验证等关键语义选择,而可恢复状态(候选池、证据链接、去重记录、预算感知记忆等)由 harness 追踪。这一分离使一个 20B 参数模型实现了更好的搜索表现。在强化学习中,外部化状态避免了失败原因混淆,有助于策略学习。Harness-1 在未见 benchmark 上提升更大,表明模型学到了可复用的搜索策略而非记忆领域习惯。论文 arXiv:2606.02373。
meng shao@shao__meng · 6月5日65Anthropic 发布关于「AI 递归自我改进」的研究报告
Anthropic 内部以 Claude 为代表的 AI 系统正被越来越深地用于开发下一代 AI 系统。这种 “AI 构建 AI” 的趋势正在加速。如果继续发展,可能出现系统完全自主设计并训练自身后继版本的情形——即递归自我改进。
https://www.anthropic.com/institute/recursive-self-improvement
关键证据(“外部公开基准”和“Anthropic 内部数据”)
1. 外部能力指标
· 模型可靠完成的任务时长正以约每 4 个月翻倍的速度增长(此前是每 7 个月)。
· SWE-bench 两年内从个位数分数趋于饱和。
· CORE-Bench 15 个月内从约 20% 饱和。
· 长时任务能力已达 16 小时量级。
2. 内部工程与研发数据
· 代码产出:截至 2026 年 5 月,Anthropic 合并到主干的代码中超过 80% 由 Claude 撰写;2026 年 Q2,工程师日均合并代码量是 2024 年的 8 倍。
· 主观感知:2026 年 3 月内部调研(130 名员工)中,受访者中位数估计自身产出约为无 AI 时的 4 倍。
· 代码质量:2025 年末 Claude 代码仍略逊于人类,如今已接近持平,并预计年内反超;人类审查已形成新瓶颈(阿姆达尔定律)。
· 实验执行:在给定目标的代码加速任务中,Claude 从 2025 年 5 月的约 3x 提升至 2026 年 4 月的约 52x;同等任务人类专家通常仅达 4x。
· 自主研究:2026 年 4 月,Claude Agent 端到端完成了一项 AI 安全开放研究问题,独立提出假设、设计实验、迭代结论,恢复能力达到人类两组研究者一周工作量的 97%(人类仅约 23%)。
· 研究判断:在 129 个真实开放调研场景中,Claude 在“下一步该怎么做”上优于人类原选择的比例从 2025 年 11 月的 51% 升至 2026 年 4 月的 64%。
结构性观察
人类在 AI 研发流程中的角色正在逐层收缩:
· 执行层(写代码、跑实验)已高度自动化;
· 方向层(选择研究问题、判断结果可信度、识别死胡同)目前仍是人类比较优势,但这一优势正在收窄。
即使“研究品味”永远无法被 AI 掌握,只要人类只保留极少量方向性工作,而 AI 承担其余部分,整体研发速度仍会呈复合加速。
三种未来情景
· 趋势停滞:边际收益递减、算力/能源供给受限、新架构尚未出现;作者认为不太可能,但会给社会最多适应时间
· 持续自动化,人类仍掌方向:100 人公司可相当于万人组织;人类瓶颈转向审核与协调;作者认为最可能进入此情景
· 完整递归自我改进:AI 自主设计后继系统,人类角色转为监督与验证;科技进步完全由算力决定;最不确定、风险最高
译Anthropic 发布报告显示,Claude 正被深度用于开发下一代 AI,趋势加速或导致系统自主设计后继版本。外部指标:模型可靠完成任务时长约每 4 个月翻倍,SWE-bench 两年内饱和,CORE-Bench 15 个月内饱和,长时任务达 16 小时。内部数据:截至 2026 年 5 月超 80% 主干代码由 Claude 撰写;工程师日均合并代码量是 2024 年的 8 倍;员工中位数估计产出为无 AI 时的 4 倍;实验执行从约 3x 提升至约 52x;自主研究恢复能力达人类两组研究者一周工作量的 97%(人类约 23%);研究判断优于人类比例从 51% 升至 64%。报告探讨了趋势停滞、持续自动化、完整递归自我改进三种未来情景。
Rohan Paul@rohanpaul_ai · 6月5日70Another great paper from Google.
Shows general LLMs can solve formal math by planning proofs and checking each step. Raised general LLM performance from under 10% to 70%.
A general LLM failed badly when asked to write full formal proofs in 1 try, but became much stronger when it planned, split the work into smaller claims, reused past claims, and learned from Lean’s feedback.
The paper shows the weakness was not just the model’s math ability, but the way it was being used - the absence of structured interaction with a verifier.
The key idea is that the model does not try to write one giant perfect proof at once, because that usually fails on long and tricky problems.
Instead, LEAP stores the proof as a graph of goals and subgoals, so useful lemmas can be reused instead of rediscovered every time.
The authors tested LEAP on Putnam 2025 and a new Lean benchmark built from 60 IMO-style problems, where ordinary one-shot proof writing did very poorly.
LEAP solved all 12 Putnam 2025 problems and raised general LLM performance on the Lean IMO benchmark from under 10% to 70%.
----
Link – arxiv. org/abs/2606.03303
Title: "LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks"
译Google 新论文 LEAP 提出智能体框架,通过规划证明、分解子目标、复用已有引理并利用 Lean 验证器反馈,将通用 LLM 在形式化数学证明上的性能从不到 10% 提升至 70%。传统单次完整证明在长难题上表现极差,而 LEAP 将证明存储为有向图结构,先规划再逐步验证。在 Putnam 2025 竞赛中,LEAP 成功解出全部 12 道题;在包含 60 道 IMO 风格题目的 Lean 基准测试中,也实现了上述性能跃升。
Emad@EMostaque · 6月5日81foom!
译Anthropic内部数据显示,Claude正在加速AI开发——这可能走向递归自我改进,即AI自主构建更强大的后继者。进展比预期更快,影响值得更多关注。主推文仅感叹:“foom!”
🚨 AI News | TestingCatalog@testingcatalog · 6月5日78ANTHROPIC 🔥: A new internal research has been published, highlighting an accelerated AI development and a potential path to recursive self-improvement.
> Claude Mythos Preview could work for “at least” 16 hours and was “at the upper end of what [METR] can measure.”
> Today, Anthropic engineers on average ship 8x as much code per quarter as they did compared to 2021-2025.
Do you feel it? 👀
译Anthropic 发布内部研究,称 Claude 正加速 AI 开发,可能通往递归自我改进——即 AI 自主构建更强大的继任者。研究显示,Claude Mythos Preview 可连续工作至少 16 小时,达到 METR 可测量上限。同时,Anthropic 工程师当前每季度交付的代码量是 2021-2025 年期间的 8 倍。
AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月5日73HOLY SHIT LET'S FUCKING GOO
译HOLY SHIT LET'S FUCKING GOO
我们内部数据显示,Claude 正在加速 AI 发展——这可能通往递归自我改进,即 AI 自主构建更强大的后继者。
这发生得比我们想象的更快,其影响值得更多关注。
Nathan Lambert@natolambert · 6月4日60We have another 65 page frontier model report from Nvidia to read @eliebakouch @stochasticchasm and gang
译我们又有另一份来自英伟达的65页前沿模型报告要读,作者@eliebakouch @stochasticchasm及其团队。
Rohan Paul@rohanpaul_ai · 6月4日66This Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it can get worse when they keep rewriting their own memories.
LLM agents can learn from experience, but their rewritten memories often become unreliable.
The problem is that many agent systems store past work by asking an LLM to compress messy experience into neat written lessons.
That sounds useful because the agent should remember what worked before, but the paper finds that repeated rewriting slowly damages the memory.
The core idea is that raw episodes, meaning the actual past attempts and solutions, often stay more useful than the polished lessons made from them.
The authors tested this across tasks like web shopping, simulated worlds, app use, and ARC-style puzzle problems where they could control the correct solutions.
The sharpest result is that GPT-5.4 solved 100% of a small ARC-AGI set with no memory, but after memory was built from correct solutions, streaming updates dropped it to about 54%.
The failures came from bad grouping, overbroad lessons, and overfitting, so the memory forgot details, mixed up task types, or learned rules that only worked on narrow examples.
The big deal is that agent memory should not automatically rewrite every experience into a summary, because keeping raw evidence and only sometimes making summaries worked better.
The paper is really proposing that agent memory should treat raw past episodes as important evidence, not as disposable notes to summarize away.
----
arxiv. org/abs/2605.12978
Title: "Useful Memories Become Faulty When Continuously Updated by LLMs"
译伊利诺伊大学和清华大学等实验室研究发现,LLM智能体重复重写自身记忆会导致记忆变得更不可靠。原始经历(实际过往尝试和解决方案)往往比提炼后的总结更有用。测试中,GPT-5.4在小型ARC-AGI数据集上无记忆时正确率100%,但建立记忆并持续更新后降至约54%。失败原因包括分组不当、教训过度泛化及过拟合。研究建议智能体不应自动将每个经历重写为摘要,保留原始证据并仅偶尔总结效果更好。
Rohan Paul@rohanpaul_ai · 6月4日71This Google DeepMind’s paper is a serious warning for anyone using autonomous agents today.
Gives the first clear taxonomy of 6 attack types where harmful websites can detect AI agents and show them hidden content humans never see, like
- Instructions buried in HTML comments or white-on-white text
- Steganography in image pixels
- Override commands in PDFs, metadata, or even speaker notes
- Memory poisoning that persists across sessions
- Goal hijacking and cross-agent cascades in multi-agent setups
The real security problem for AI agents is not just the model, but the environment it reads.
The web itself can be weaponized against autonomous AI agents. As agents increasingly browse the internet, read emails, execute transactions, and spawn sub-agents, the information environment becomes an attack surface.
In one cited benchmark, hidden prompt injections embedded in web content partially commandeered agents in up to 86% of scenarios, sub-agent hijacking working 58–90% of the time, and data exfiltration attacks clearing 80% across five different agent architectures.
That reframes the whole debate.
We usually talk about model safety as if the danger sits inside the weights, but agents do something more fragile: they browse, retrieve, remember, and act on untrusted material in real time.
Here’s the thing to worry about.
A web page does not have to look malicious to be dangerous to an agent, because the agent may parse what humans never see: hidden HTML comments, metadata, CSS-hidden text, formatting syntax, or adversarial content embedded in images and other media.
The threat gets more serious once memory enters the loop.
If an agent uses RAG or persistent memory, poisoning no longer has to win in one shot. It can sit quietly in a corpus or memory store and activate later, which is why the paper highlights results showing latent memory poisoning above 80% attack success with less than 0.1% data contamination.
---
ssrn .com/sol3/papers.cfm?abstract_id=6372438
译Google DeepMind论文首次系统分类六类攻击:HTML注释/白色文本隐藏指令、图像隐写、PDF元数据/演讲者笔记覆写、跨会话内存投毒、目标劫持及多智能体级联攻击。隐藏提示注入在86%场景中部分控制智能体,子智能体劫持成功率58–90%,数据泄露攻击在五种架构中均超80%。内存投毒成功率超80%,仅需不足0.1%数据污染。论文指出网页、邮件等非受信材料可被武器化,构成主要攻击面。
Chubby♨️@kimmonismus · 6月4日67A blind Stanford-led study of nearly 3,000 anonymized matchups found law professors across 16 schools preferred AI-generated answers to student contract-law questions over those written by fellow professors 75% of the time, and judged the AI responses far less likely to be pedagogically harmful (3.5% vs. 12%).
"The team tested a range of systems, including commercial tutoring tools and Google's NotebookLM."
Now imagine the performance of models in 6-12 months.
译一项由斯坦福大学领导的盲测研究,对近3000场匿名对决的分析发现,16所法学院的法律教授在合同法问题中,有75%的时间更偏好AI生成的答案,而非教授自己写的答案,并且认为AI回答的教学危害性远低于后者(3.5% vs 12%)。
“研究团队测试了多种系统,包括商业辅导工具和Google的NotebookLM。”
现在想象6-12个月后模型的表现。
AK@_akhaliq · 6月4日62dMoE
dLLMs with Learnable Block Experts
译dMoE
具有可学习块专家的dLLM
AK@_akhaliq · 6月4日46Bootstrap Your Generator
Unpaired Visual Editing with Flow Matching
译自举你的生成器
非配对视觉编辑与流匹配
AK@_akhaliq · 6月4日60Unified Neural Scaling Laws
译统一神经缩放定律
Anthropic@AnthropicAI · 6月4日64How well do the security community's techniques hold up against AI-enabled cyberattacks?
We examined 832 malicious accounts and mapped their activity onto a longstanding database of tactics and techniques used by threat actors.
Here's what we learned:https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack
译安全社区的技术在应对AI驱动的网络攻击方面表现如何?
我们检查了832个恶意账户,并将其活动映射到一个长期存在的威胁行为者战术和技术数据库。
以下是我们学到的:https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack
Microsoft Research@MSFTResearch · 6月4日62A three‑month pilot in a Midwestern bottling plant shows what happens when AI moves beyond chat and into decision-making, where constraints shift, stakes are real, and answers must hold. https://msft.it/6015vjYUN
译一份在中西部装瓶厂进行的三个月试点显示,当AI超越聊天进入决策领域时会发生什么——约束条件变化、风险真实、答案必须可靠。 https://msft.it/6015vjYUN
elvis@omarsar0 · 6月3日72New research from Google.
Just shows the impressive results you can get from custom agent harnesses.
LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback.
The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%.
Paper: https://arxiv.org/abs/2606.03303
Learn to build effective AI agents in our academy: https://academy.dair.ai/
译Google 新研究 LEAP 将通用大语言模型封装在智能体框架中,每个步骤基于 Lean 编译器,并依赖验证器反馈进行迭代。同一通用模型解决了全部 12 道 Putnam 2025 问题,并将 Lean-IMO-Bench 一次性解决率从不到 10% 提升至 70%,击败了得分 48% 的专业金牌系统。论文链接:https://arxiv.org/abs/2606.03303。
Ethan Mollick@emollick · 6月3日41Hey, its our paper!
译嘿,这是我们发表的论文!
[引用 @PNAS News]:过去一周PNAS最高浏览量文章之一——《劝说大语言模型遵守有异议的请求》。查看论文:https://ow.ly/wOxl50Z6fZA
更多热门文章请访问 https://ow.ly/uLkC50Z6fZz。
Saining Xie@sainingxie · 6月3日67how does the brain build and track an internal state of the world from (possibly incomplete and noisy) visual observations?
i believe visual state tracking will be the grand challenge for vision in the coming years, and i hope this benchmark can be a useful starting line. enjoy!
译研究团队推出VSTAT基准测试,用于评估多模态大语言模型(MLLMs)在视频中追踪动态状态的能力。测试任务看似简单,包括计数杯子、识别键入的文字、统计翻页次数等,人类可以轻松完成,但当前MLLMs表现欠佳。该测试旨在推动视觉状态跟踪这一前沿方向的发展,解决模型从不完整、有噪声的视觉观察中建立和更新内部世界状态的核心挑战。