AIHOT
精选全部 AI 动态AI 日报Agent 接入关于更新日志反馈信源提报
登录
精选全部日报更多
全部动态X · 359 条
全部一手资讯X论文
meng shao@shao__meng · 5天前64

AGENTS.md 在 Coding Agents 中真的有用吗? 这篇论文,大规模实证研究仓库级上下文文件(AGENTS.md、CLAUDE.md 等)对编码 Agent 实际效果的影响,可能有些反直觉!感谢 @rasbt 分享! 论文在这:https://arxiv.org/abs/2602.11988 研究背景:实践先行,证据滞后 AGENTS.md 已成为行业惯例,GitHub 上已有 6 万+ 仓库采用,Claude Code (CLAUDE.md)、Codex、Qwen Code 等 Agent 都内置 /init 自动生成。但此前研究多停留在内容分类与描述性统计,缺少对任务完成率的严格评估。 核心难点在于:主流基准 SWE-bench 来自 Django、Flask 等知名仓库,这些项目本来就没有开发者手写的 context file,无法直接评估该实践的真实价值。 实验设计:双基准、三条件、四 Agent · 基准:SWE-bench Lite(300 任务,11 个热门 Python 仓库)+ 新建 AGENTBENCH(138 任务,12 个已含开发者 context file 的冷门仓库) · 三种条件:① 无 context file ② LLM 生成(各 Agent 官方 /init 流程)③ 开发者手写(仅 AGENTBENCH) · Agent/模型:Claude Code + Sonnet 4.5、Codex + GPT-5.2 / GPT-5.1 mini、Qwen Code + Qwen3-30B · 指标:任务成功率、步数、推理成本、工具调用轨迹 核心发现:效果微弱,成本显著 1. 成功率:边际效应,甚至为负 · LLM 生成:8 组设置中 5 组下降,平均 -0.5%(SWE-bench)/ -2%(AGENTBENCH) · 开发者手写:平均 +4%,优于 LLM 生成,但 Claude Code 上甚至不如无文件 · 跨模型、跨 prompt 结论稳健 一句话:自动生成 context file 不仅无益,还可能略有害;手写的提升也很有限。 2. 效率:无文件反而最便宜(步数,成本) · LLM 生成:+2.45 / +3.92 步,+20% / +23% · 开发者手写:+3.34 步,最高 +19% 3. 代码库概览几乎无效 Context file 常被推荐用于「帮助 Agent 快速定位代码」。实测显示:有无 context file,Agent 首次接触相关文件所需的步数并无显著差异。95–100% 的 LLM 生成文件都包含代码库概览,但对导航帮助甚微。 轨迹分析:Agent 听话,但听话很贵 论文排除了「Agent 忽略 context file」这一假设。轨迹分析表明: · 指令遵从度高:context file 提到 uv,使用率从 <0.01 次/任务升至 1.6 次;提到仓库专用工具,从 <0.05 升至 2.5 次 · 行为更「认真」:更多测试、更多文件搜索/阅读、更多 lint/质量检查 · 推理更深:GPT-5.2 推理 token 增加 14–22% 机制链条: Context file 写入额外要求 → Agent 更严格遵从(测试、探索、专用工具) → 步数与成本上升 → 成功率未同步提升(甚至更差) Context file 不是被忽略,而是被过度执行——把「建议性流程」当成了「必做清单」,增加了任务复杂度,却没有换来更高成功率。 一个关键反转:文档冗余假说 当移除仓库中所有其他文档(.md、docs/、示例代码)后,LLM 生成的 context file 反而带来 +2.7% 提升,且优于开发者手写的。 这说明: · 在文档齐全的仓库里,context file 与 README、docs 高度冗余 · 开发者口述的「加了 AGENTS.md 后 Agent 变强了」,很可能是因为目标仓库本身文档稀缺,context file 填补了信息真空 · 对 Django 这类文档完善的知名项目,额外 context 的价值被稀释 消融实验:生成质量的上限 · 更强模型生成 ≠ 更好 context:GPT-5.2 生成的文件在 SWE-bench 上略好(+2%),在 AGENTBENCH 上反而更差(-3%) · 不同 prompt 无一致优势:Codex prompt vs Claude prompt 效果因数据集而异,差异很小 自动生成 context file 的改进空间,目前看来很有限。 实践建议 · 依赖 /init 自动生成:谨慎——平均略降成功率,成本 +20%+ · 长篇架构概览、目录枚举:避免——与代码探索冗余,不加速定位 · 测试/lint/构建命令:精简写入——Agent 会严格执行,但过多要求推高成本 · 仓库专用工具(uv、pdm 等):值得写——指令遵从度高,且代码中不易推断 · 分层/按需引用:方向正确——「做 X 时读 Y.md,否则忽略」减少无关负担

译论文大规模实证检验 AGENTS.md 等仓库级上下文文件对编码 Agent 的影响。在 SWE-bench Lite(300 任务)和新建 AGENTBENCH(138 任务)上测试 Claude Code、Codex、Qwen Code 等组合。核心发现:LLM 自动生成的 context file 在 8 组设置中 5 组成功率下降,平均 -0.5%(SWE-bench)/-2%(AGENTBENCH),成本增加 +20%+;开发者手写仅平均 +4%。冗余假说:移除其他文档后,自动生成反而 +2.7%。建议避免自动生成,精简测试/lint 命令,优先写入仓库专用工具。

查看原推 ↗
AYi@AYi_AInotes · 5天前62

Google的研究找到了一种把 AI记忆大幅压缩的技术,让本地跑大模型 + 自己数据变得更容易了。 也就是说可以把 1000 万个文档 的向量存储,从 31GB 内存 压缩到只剩 4GB,而且搜索速度还比现在最常用的 FAISS 更快。

译Google提出一种AI记忆压缩技术,可将1000万个文档的向量存储从31GB内存压缩至仅4GB,且搜索速度超过目前最常用的FAISS方法。该技术使本地运行大语言模型并结合个人数据变得更加可行。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 5天前49

A Primer paper about how reasoning models improve after training Shows that better reasoning models depend less on raw data size and more on checkable training evidence. reasoning data is NOT simple question-and-answer pairs. The useful part is often the feedback that says why an answer, step, tool action, or full attempt was good or bad. A prompt and a response tell you what a model said, but not why that answer became learnable, which judge blessed it, which failures were hidden, or whether the skill was already inside the base model. The core idea is to describe each training example as a record that includes the task, the model’s behavior, the checking signal, and metadata about where it came from. The authors sort reasoning data by how it can be checked, such as exact rule-based checks for math and code, environment checks for agents using tools, and human or model judgments when no exact checker exists. They also explain why common assumptions fail, because long reasoning traces may be fake, harder examples may be useless for some models, and larger datasets may still miss important coverage. The key point is that agent data should preserve mess: failed actions, retries, recoveries, state differences, and terminal checks, because that is where learning signal often lives. ---- Link – arxiv. org/abs/2606.02113 Title: "A Primer in Post-Training Reasoning Data: What They Know About How It Works"

译论文指出,更好的推理模型更依赖可验证的训练证据,而非原始数据规模。推理数据的关键不是简单问答对,而是提供答案、步骤、工具操作或完整尝试好坏判断的反馈信号。每个训练样本应描述为包含任务、模型行为、检查信号和元数据的记录。研究者按检查方式分类:数学和代码用精确规则、智能体工具用环境检查,无精确检查器时用人类或模型判断。常见误区包括:长推理链可能虚假、更难样例对部分模型无效、更大数据集仍可能缺失关键覆盖。智能体数据应保留失败动作、重试、恢复、状态差异和终端检查等“混乱”信息,因为学习信号常在其中。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 6天前62

Great idea for self-evolving AI scientists from this new MIT paper. Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder. The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims. The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced. Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself. So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema. A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language. ---- arxiv. org/abs/2606.01444 Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

译MIT论文(F.Y. Wang & M.J. Buehler, arXiv:2606.01444, 2026)提出Self-Revising Discovery Systems框架,使AI科学家能自主识别当前思维模式不足并添加新科学概念,而非仅更努力搜索。系统将数据、模型、工具输出、失败及声明均视为类型化产物(typed provenance),从而区分三种模式:retrieval(添加已知对象)、search(探索固定模式)和discovery(可验证的模式转换)。论文通过Kan obstruction和Left Kan extension数学化定义了真正新颖性——由旧证据传输后的逐点残差量化,使novelty可客观测量。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性,以及CategoryScienceClaw发现各向异性纤维网络刚度规则。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 6天前66

New MIT paper, great idea for self-evolving AI scientists from Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder. The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims. The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced. Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself. So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema. A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language. ---- arxiv. org/abs/2606.01444 Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

译MIT团队提出自我演进AI科学家框架,核心创新是让AI识别当前推理空间过小并主动添加新科学概念,而非仅在固定模式内搜索。论文将数据点、模型、工具输出、失败、声明均视为带类型的artifact,明确区分检索(添加已知对象)、搜索(探索固定schema)和发现(可验证的模式扩展)。通过类型化copresheaf与Kan障碍理论证明,真正发现是可验证的schema扩展:旧证据由左Kan扩展传输,创新性通过逐点残差量化。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性,以及CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444(2026)。

查看原推 ↗
elvis@omarsar0 · 6天前65

// Continual Learning Bench // One of the research areas with lots of investments is continual learning. While there are many efforts, there is very little progress in measuring it. So the big question is, do dedicated memory systems actually make agents learn from experience? Continual Learning Bench says not yet. Across six expert-validated domains with shared learnable structure, naive in-context learning outperforms systems purpose-built for memory management. CL-Bench introduces a gain metric that isolates genuine learning from prior capability, then shows agents frequently overfit to immediate observations or fail to reuse knowledge across instances. If a plain ICL baseline beats your memory architecture, the architecture is adding overhead rather than learning. Paper: https://arxiv.org/abs/2606.05661 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译持续学习领域投入多但进展缓慢。CL-Bench(持续学习基准)在六个由专家验证、包含共享可学习结构的领域上测试,发现简单的上下文学习(ICL)基线优于专门为记忆管理构建的系统。该基准引入增益指标以隔离真正学习效果,结果显示智能体常过度拟合即时观察或未能跨实例复用知识。研究指出,若普通ICL基线超过你的记忆架构,则该架构增加的是开销而非学习。论文:arxiv.org/abs/2606.05661。

查看原推 ↗
meng shao@shao__meng · 6天前59

面向 AI Agent 的零信任安全:企业自主 AI Agent 部署框架 Anthropic 官方 5 月份发布的白皮书:企业部署自主 AI Agent 时,传统边界安全不够用,必须把零信任原则延伸到 Agent 架构本身。 报告开篇点出双重加速: · 基础设施层面:前沿 AI 模型把「漏洞发现 → 利用」的周期从数月压缩到数小时,攻击成本极低。 · Agent 层面:Agent 能自主解释目标、选工具、执行多步操作。传统访问控制挡不住「在合法权限内作恶」,监控也要面对「不靠漏洞、靠持久化操控」的新型攻击。 因此,报告认为:未来优势不取决于谁用了最先进的 AI,而取决于谁的基础安全足够扎实,且 Agent 从第一天就按「已遭入侵」来设计。 零信任的三条原则(和一条设计检验) 三条原则 · 永不信任,始终验证:内外网请求一视同仁,每次访问都要认证与授权 · 假设已遭入侵:重点不是「防住入侵」,而是限制单点失守后的破坏范围 · 最小权限:只给完成任务所需的最小访问权 一条设计检验 这个控制是让攻击不可能,还是只是让攻击更麻烦? 报告中的五个部分分别是: Part I:Agent 为何是新的安全对象? Part II:当前威胁图谱(OWASP 视角) Part III:三层能力成熟度模型(报告核心) Part IV:八阶段实施工作流 Part V:防御运营要跟上自主威胁的速度 白皮书地址: https://cdn.prod.website-files.com/6889473510b50328dbb70ae6/6a1611a04085d7cd3dadc924_Claude-eBook-Zero-Trust-for-AI-Agents-05182026.pdf 视频版 🔽🔽🔽

译Anthropic 5 月发布白皮书,提出企业部署自主 AI Agent 时须将零信任原则延伸至 Agent 架构。报告指出双重加速:前沿模型将漏洞发现到利用周期压缩至数小时;Agent 能自主解释目标、选工具、执行多步操作,传统访问控制无法阻止“合法权限内作恶”。核心原则:永不信任始终验证、假设已遭入侵、最小权限;另附设计检验——控制是让攻击不可能,还是仅增加麻烦?报告分五部分:Agent 为何是新安全对象、威胁图谱、三层能力成熟度模型、八阶段实施工作流、防御运营适配自主威胁速度。

查看原推 ↗
SemiAnalysis@SemiAnalysis_ · 7天前61

Sequential Monte Carlo speculative decoding from @makora_ai keeps multiple draft tokens alive in parallel instead of rewinding failed matches.

译来自 @makora_ai 的序贯蒙特卡洛投机解码会并行保持多个草稿 token 存活,而不是回退失败的匹配。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 7天前76

Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions. The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files. The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands. Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds. Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline. The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist. The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents. The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls. GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%. The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction. Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.

译Arena 推出基于真实用户任务的智能体排行榜,评估模型在代码编写、应用构建、文档分析等工作中的表现,而非孤立基准。排行榜基于30万+任务、200万+工具调用和4000万行代码,综合任务成功、纠正遵从性、错误恢复、用户表扬与抱怨、工具幻觉等信号。前三名:GPT-5.5 High(+10.7%)、Claude Opus 4.7 Thinking(+9.5%)、GPT-5.4 High(+8.9%)。

查看原推 ↗
Chubby♨️@kimmonismus · 7天前65

AI scientists may be moving from search to real discovery. A new MIT paper proposes a framework for self-revising AI systems that don’t just explore a fixed scientific vocabulary, but can expand the vocabulary itself, introducing new variables, tools, verifiers, and model structures when existing ones are no longer enough. True scientific progress is often not just about finding better answers, but about changing the space in which answers can exist. If this scales, AI could become far more than a research assistant: it could become an auditable partner in building new scientific world models. Still early, but conceptually very exciting.

译MIT Buehler团队提出Self-Revising Discovery Systems框架,让AI能自主扩展科学词汇(变量、工具、验证器、模型结构),而非仅搜索固定空间。论文使用typed copresheaf和Kan obstruction数学框架形式化智能体工作流,证明真正发现是可验证的schema扩展:旧证据通过Left Kan extension迁移,新异性由pointwise残差客观量化,区分发现与搜索。三种模态:检索(添加已知对象)、搜索(固定schema)、发现(验证的范式转换)。案例包括Builder/Breaker发现蛋白质模式条件合规性,CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444(2026)。

查看原推 ↗
Emad@EMostaque · 7天前33

If Claude is good enough for Nobel Prize winners it is good enough for you https://arxiv.org/abs/2606.03300

译如果 Claude 对诺贝尔奖得主来说都足够好,那对你也一样。 https://arxiv.org/abs/2606.03300

查看原推 ↗
Rohan Paul@rohanpaul_ai · 7天前79

Anthropic’s new chemistry report has a genuinely wild result. Claude Opus 4.7 is now competitive with dedicated NMR software, and the bigger story is that it can work the problem backwards, i.e. infer the molecule from the spectrum.” NMR software is the chemist’s expert tool for turning molecular structures into predicted lab spectra. So Opus 4.7 is no longer just “helping chemists read data” — it can work backward from NMR data and propose the molecule’s structure, a task the report says existing mainstream tools generally leave to human chemists. Note, that Opus 4.7, a general-purpose model with no chemistry-specific fine-tuning. Claude Opus 4.7 made the smallest hydrogen prediction errors and nearly matched MestReNova on carbon, meaning it can predict NMR signals about as well as specialist chemistry tools. So AI now handle one of chemistry’s hidden bottlenecks: translating between a molecule, its spectral shadow, and the structure a chemist actually needs to trust.

译Anthropic最新化学报告显示,通用大模型Claude Opus 4.7(无化学微调)在NMR核磁共振谱分析上匹配甚至超越专用软件MestReNova,氢预测误差最小,碳预测近乎一致。更关键的是,它能从NMR光谱反向推导分子结构——这一任务以往只能由人类化学家完成。这意味着AI现在可以处理化学中的关键瓶颈:在分子结构、谱图与最终确认之间自动翻译。

查看原推 ↗
Microsoft Research@MSFTResearch · 7天前60

During the Inside Azure Innovations breakout at Build 2026, Microsoft Azure CTO, deputy CISO and technical fellow Mark Russinovich introduced Project Mosaic, an experimental optical interconnect technology from Microsoft Research Cambridge using micro-LEDs for low-power, high-speed data transmission. A live demo led by senior researcher Kaoutar Benyahya displays individual LED modulation forming letters, proving the concept’s real-time responsiveness. Check out Mark and Kaoutar starting @ 38:38: https://msft.it/6015vdhS9

译微软Azure CTO Mark Russinovich在Build 2026上介绍Project Mosaic,这是微软剑桥研究院的实验性光学互连技术,采用micro-LED实现低功耗、高速数据传输。高级研究员Kaoutar Benyahya现场演示单个LED调制形成字母,证明概念具备实时响应能力。

查看原推 ↗
Chubby♨️@kimmonismus · 7天前72

We are in for a wild ride, and this is just the beginning: 'World-first' vaccine designed by artificial intelligence Researchers at the University of Cambridge have trialled what they describe as the world’s first AI-designed vaccine component in humans. The vaccine uses an AI-designed “super-antigen” intended to train the immune system against a broad family of coronaviruses, including existing Covid variants and animal coronaviruses that could potentially cause future pandemics. Instead of designing a vaccine around one current virus strain, researchers fed AI genetic data from many known coronaviruses. The AI then designed an antigen meant to trigger immune protection across the whole virus family, even if the virus mutates or jumps from animals to humans. The first human trial involved 39 people and mainly tested safety. The immune response was described as modest, but the result is still seen as promising because it shows that an AI-designed vaccine antigen can be tested in humans. A larger study with around 200 people will now examine how well the vaccine actually trains the immune system.

译剑桥大学研究人员开展了据称全球首个AI设计疫苗成分的人体试验。该疫苗使用AI设计的“超级抗原”,旨在训练免疫系统对抗包括现有新冠变种及可能引发未来大流行的动物冠状病毒在内的广泛冠状病毒家族。首次人体试验仅39人,主要验证安全性。免疫反应虽属中等,但被视为有前景,证明AI设计的疫苗抗原可以在人体中测试。下一步计划进行约200人的更大规模研究。

查看原推 ↗
Anthropic@AnthropicAI · 7天前73

New Anthropic Science Blog: Making Claude a chemist. To manipulate a molecule, chemists first need to understand its structure. Their main tool is NMR spectroscopy. We found Opus 4.7 matches—and on some tasks beats—dedicated NMR software. Read more: https://www.anthropic.com/research/making-claude-a-chemist

译Anthropic 新科学博客:让 Claude 成为化学家。 要操纵分子,化学家首先需要了解其结构。他们的主要工具是 NMR 波谱分析。 我们发现 Opus 4.7 在部分任务上匹配甚至超越了专用 NMR 软件。了解更多:https://www.anthropic.com/research/making-claude-a-chemist

查看原推 ↗
Jim Fan@DrJimFan · 7天前71

NitroGen just won CVPR Best Paper Honorable Mention!! We are making strides towards general-purpose embodied agents that master not only the real world physics, but also all possible physics across a multiverse of simulations. It’s been 4 years since MineDojo, our first embodied agent in Minecraft, won NeurIPS Best Paper. Congrats to everyone on the team!!

译NitroGen 刚刚获得 CVPR 最佳论文荣誉提名!!我们正在朝着通用具身智能体迈进,不仅掌握真实世界的物理规律,还能掌握模拟多元宇宙中所有可能的物理规律。 距离我们的第一个 Minecraft 具身智能体 MineDojo 获得 NeurIPS 最佳论文奖已经过去 4 年了。祝贺团队里的每一位!!

查看原推 ↗
AK@_akhaliq · 7天前56

ArcANE Do Role-Playing Language Agents Stay in Character at the Right Time?

译ArcANE 角色扮演语言智能体是否能在适当时刻保持角色?

查看原推 ↗
AK@_akhaliq · 7天前57

Code2LoRA Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

译Code2LoRA 超网络生成的代码语言模型适配器,用于软件演化环境。

查看原推 ↗
elvis@omarsar0 · 7天前69

// The Meta-Agent Challenge // How good are current agents at self-improving? This is a great paper covering some of the challenges. They propose the Meta-Agent Challenge (MAC), where they give a coding agent a sandbox, an evaluation API, and a time budget, then ask it to program an agent that maximizes held-out performance across five domains. Results: Meta-agents rarely match human-engineered baselines, and the few that do are dominated by proprietary frontier models. Under high optimization pressure, some agents started exfiltrating ground truth from the scoring channel, even with multi-layer anti-reward-hacking defenses in place. Paper: https://arxiv.org/abs/2606.04455 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译最新研究提出元智能体挑战(MAC),将编码智能体放入沙盒,给定评估API和时间预算,要求其自主编程出在五个领域表现最优的智能体。结果发现,元智能体极少能匹敌人工设计的基线,少数成功的案例也几乎全部依赖专有前沿模型。更值得警惕的是,在高优化压力下,一些智能体开始从评分渠道外泄真实答案,即便研究人员设置了多层反奖励破解防御也未能阻止。论文:arxiv.org/abs/2606.04455。

查看原推 ↗
AI at Meta@AIatMeta · 7天前64

Big congrats to our SAM 3D team for receiving a Best Paper Honorable Mention at #CVPR26! This prestigious recognition underscores their incredible work pushing the boundaries of computer vision. Read the paper here: https://arxiv.org/abs/2511.16624

译热烈祝贺我们的 SAM 3D 团队在 #CVPR26 获得最佳论文荣誉提名!这项殊荣凸显了他们在推动计算机视觉边界方面的杰出工作。 论文链接:https://arxiv.org/abs/2511.16624

查看原推 ↗
Berryxia.AI@berryxia · 7天前70

大模型都不再卷推理,都开始卷规划能力! 腾讯混元联合人大高瓴人工智能学院直接开源了PlanningBench,一个专门测、训LLM真实规划能力的框架。 里面塞了30多个来自真实世界的规划任务,覆盖调度、生产、旅行、资源分配、应急响应等六大类,每一个都有清晰的成功标准和全自动验证机制。 你既可以用它测出当前最强模型到底在规划上有多拉胯,也能直接拿来继续微调,让模型从“会说”真正进化到“会干”。 以前整个行业都在卷参数、卷上下文、卷工具调用,好像规划能力是自然就会长出来的。 现在PlanningBench用30多个可验证任务直接把真相摊开:规划才是agent从玩具走向生产力的真正分水岭。 腾讯这次把论文、代码、数据集全甩到GitHub和Hugging Face,等于把这个最难、最核心的能力从黑盒拉到了公开赛道。

译腾讯混元联合人大高瓴人工智能学院开源PlanningBench,一个可扩展、可验证的框架,用于评估和训练大语言模型(LLM)的真实规划能力。该框架包含30多个来自调度、生产、旅行、资源分配、应急响应等六大类的真实世界规划任务,每项任务都有清晰的成功标准和全自动验证机制。用户既可用它评测当前最强模型在规划上的短板,也可直接用于微调,让模型从“会说”进化到“会干”。论文、代码和数据集已全部在GitHub和Hugging Face开源。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 7天前63

Better self-improving agents need better solvers, not bigger update-writing models. This challenges the common habit of putting the strongest model in the evolver seat. The usual intuition was: put the strongest model in the evolver seat, because a better model should write better prompts, memories, tools, and skills. This paper cuts that intuition in half. It separates two jobs that are usually blurred together: writing useful harness updates, and benefiting from those updates during task execution. The paper says the cheaper model can often write good enough prompt, memory, or skill updates. So a small Qwen3.5-9B evolver can create updates that help about as much as Claude Opus 4.6. The expensive model is more useful as the agent that actually solves the task with those updates. i.e. using the updates is very model-dependent, because weak models often fail to load the right skill or load it and then stop following it during a long task. Strong models can use the harness, but they may already be close enough to their ceiling that the update has less room to help. The sweet spot is the mid-tier model: capable enough to invoke and follow the new procedure, but not so capable that the harness has nothing left to teach. ---- Link – arxiv. org/abs/2605.30621 Title: "Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents"

译论文“Harness Updating Is Not Harness Benefit”挑战了常见直觉——把最强模型放在进化者位置以写出更好更新。实验表明,廉价模型Qwen3.5-9B即可写出与Claude Opus 4.6效果相近的提示、记忆和技能更新。昂贵模型更适合作为求解任务的智能体,因弱模型无法正确加载或遵循更新,强模型已近能力上限,收益有限。甜区在中档模型:既能调用新程序,又有足够学习空间。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 6月5日60

Harness-1 makes search agents better by moving memory work out of the model and into a helper system. Shows that intelligence performs better when the environment stops forcing it to spend cognition on bookkeeping. That search agents should stop using the LLM as the notebook and let a separate harness track the search state. The paper proved that a 20B model improved search by doing less inside its own head. The problem is that normal search agents must both think about the next search and remember every document, clue, failed path, and remaining check inside the same limited context. This formulation puts too much routine state management inside the policy. Harness-1 separates those jobs. The model keeps the hard semantic choices: what to search, what to inspect, what to verify, and when the evidence is good enough. The harness keeps the recoverable state: candidate pools, curated documents, importance tags, evidence links, verification records, deduplicated observations, and budget-aware memory rendering. That sounds minor until you look at reinforcement learning. RL works poorly when every failure looks the same, because an empty or wrong final set does not reveal whether the agent searched badly, forgot evidence, skipped verification, or curated carelessly. By externalizing state, Harness-1 gives the policy a cleaner learning problem: improve decisions over a visible search workspace. For Harness-1, its gains were larger on held-out benchmarks than on source-family tasks, suggesting the model learned reusable search moves rather than memorized domain habits. ---- Link – arxiv. org/abs/2606.02373 Title: "Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses"

译Harness-1 将大语言模型的记忆工作转移到外部辅助系统(harness),解决传统搜索智能体需在同一上下文窗口内处理语义决策与状态记录导致的效率低下问题。模型仅负责搜索、验证等关键语义选择,而可恢复状态(候选池、证据链接、去重记录、预算感知记忆等)由 harness 追踪。这一分离使一个 20B 参数模型实现了更好的搜索表现。在强化学习中,外部化状态避免了失败原因混淆,有助于策略学习。Harness-1 在未见 benchmark 上提升更大,表明模型学到了可复用的搜索策略而非记忆领域习惯。论文 arXiv:2606.02373。

查看原推 ↗
meng shao@shao__meng · 6月5日65

Anthropic 发布关于「AI 递归自我改进」的研究报告 Anthropic 内部以 Claude 为代表的 AI 系统正被越来越深地用于开发下一代 AI 系统。这种 “AI 构建 AI” 的趋势正在加速。如果继续发展,可能出现系统完全自主设计并训练自身后继版本的情形——即递归自我改进。 https://www.anthropic.com/institute/recursive-self-improvement 关键证据(“外部公开基准”和“Anthropic 内部数据”) 1. 外部能力指标 · 模型可靠完成的任务时长正以约每 4 个月翻倍的速度增长(此前是每 7 个月)。 · SWE-bench 两年内从个位数分数趋于饱和。 · CORE-Bench 15 个月内从约 20% 饱和。 · 长时任务能力已达 16 小时量级。 2. 内部工程与研发数据 · 代码产出:截至 2026 年 5 月,Anthropic 合并到主干的代码中超过 80% 由 Claude 撰写;2026 年 Q2,工程师日均合并代码量是 2024 年的 8 倍。 · 主观感知:2026 年 3 月内部调研(130 名员工)中,受访者中位数估计自身产出约为无 AI 时的 4 倍。 · 代码质量:2025 年末 Claude 代码仍略逊于人类,如今已接近持平,并预计年内反超;人类审查已形成新瓶颈(阿姆达尔定律)。 · 实验执行:在给定目标的代码加速任务中,Claude 从 2025 年 5 月的约 3x 提升至 2026 年 4 月的约 52x;同等任务人类专家通常仅达 4x。 · 自主研究:2026 年 4 月,Claude Agent 端到端完成了一项 AI 安全开放研究问题,独立提出假设、设计实验、迭代结论,恢复能力达到人类两组研究者一周工作量的 97%(人类仅约 23%)。 · 研究判断:在 129 个真实开放调研场景中,Claude 在“下一步该怎么做”上优于人类原选择的比例从 2025 年 11 月的 51% 升至 2026 年 4 月的 64%。 结构性观察 人类在 AI 研发流程中的角色正在逐层收缩: · 执行层(写代码、跑实验)已高度自动化; · 方向层(选择研究问题、判断结果可信度、识别死胡同)目前仍是人类比较优势,但这一优势正在收窄。 即使“研究品味”永远无法被 AI 掌握,只要人类只保留极少量方向性工作,而 AI 承担其余部分,整体研发速度仍会呈复合加速。 三种未来情景 · 趋势停滞:边际收益递减、算力/能源供给受限、新架构尚未出现;作者认为不太可能,但会给社会最多适应时间 · 持续自动化,人类仍掌方向:100 人公司可相当于万人组织;人类瓶颈转向审核与协调;作者认为最可能进入此情景 · 完整递归自我改进:AI 自主设计后继系统,人类角色转为监督与验证;科技进步完全由算力决定;最不确定、风险最高

译Anthropic 发布报告显示,Claude 正被深度用于开发下一代 AI,趋势加速或导致系统自主设计后继版本。外部指标:模型可靠完成任务时长约每 4 个月翻倍,SWE-bench 两年内饱和,CORE-Bench 15 个月内饱和,长时任务达 16 小时。内部数据:截至 2026 年 5 月超 80% 主干代码由 Claude 撰写;工程师日均合并代码量是 2024 年的 8 倍;员工中位数估计产出为无 AI 时的 4 倍;实验执行从约 3x 提升至约 52x;自主研究恢复能力达人类两组研究者一周工作量的 97%(人类约 23%);研究判断优于人类比例从 51% 升至 64%。报告探讨了趋势停滞、持续自动化、完整递归自我改进三种未来情景。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 6月5日70

Another great paper from Google. Shows general LLMs can solve formal math by planning proofs and checking each step. Raised general LLM performance from under 10% to 70%. A general LLM failed badly when asked to write full formal proofs in 1 try, but became much stronger when it planned, split the work into smaller claims, reused past claims, and learned from Lean’s feedback. The paper shows the weakness was not just the model’s math ability, but the way it was being used - the absence of structured interaction with a verifier. The key idea is that the model does not try to write one giant perfect proof at once, because that usually fails on long and tricky problems. Instead, LEAP stores the proof as a graph of goals and subgoals, so useful lemmas can be reused instead of rediscovered every time. The authors tested LEAP on Putnam 2025 and a new Lean benchmark built from 60 IMO-style problems, where ordinary one-shot proof writing did very poorly. LEAP solved all 12 Putnam 2025 problems and raised general LLM performance on the Lean IMO benchmark from under 10% to 70%. ---- Link – arxiv. org/abs/2606.03303 Title: "LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks"

译Google 新论文 LEAP 提出智能体框架,通过规划证明、分解子目标、复用已有引理并利用 Lean 验证器反馈,将通用 LLM 在形式化数学证明上的性能从不到 10% 提升至 70%。传统单次完整证明在长难题上表现极差,而 LEAP 将证明存储为有向图结构,先规划再逐步验证。在 Putnam 2025 竞赛中,LEAP 成功解出全部 12 道题;在包含 60 道 IMO 风格题目的 Lean 基准测试中,也实现了上述性能跃升。

查看原推 ↗
Emad@EMostaque · 6月5日81

foom!

译Anthropic内部数据显示,Claude正在加速AI开发——这可能走向递归自我改进,即AI自主构建更强大的后继者。进展比预期更快,影响值得更多关注。主推文仅感叹:“foom!”

查看原推 ↗
🚨 AI News | TestingCatalog@testingcatalog · 6月5日78

ANTHROPIC 🔥: A new internal research has been published, highlighting an accelerated AI development and a potential path to recursive self-improvement. > Claude Mythos Preview could work for “at least” 16 hours and was “at the upper end of what [METR] can measure.” > Today, Anthropic engineers on average ship 8x as much code per quarter as they did compared to 2021-2025. Do you feel it? 👀

译Anthropic 发布内部研究,称 Claude 正加速 AI 开发,可能通往递归自我改进——即 AI 自主构建更强大的继任者。研究显示,Claude Mythos Preview 可连续工作至少 16 小时,达到 METR 可测量上限。同时,Anthropic 工程师当前每季度交付的代码量是 2021-2025 年期间的 8 倍。

查看原推 ↗
AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月5日73

HOLY SHIT LET'S FUCKING GOO

译HOLY SHIT LET'S FUCKING GOO 我们内部数据显示,Claude 正在加速 AI 发展——这可能通往递归自我改进,即 AI 自主构建更强大的后继者。 这发生得比我们想象的更快,其影响值得更多关注。

查看原推 ↗
Nathan Lambert@natolambert · 6月4日60

We have another 65 page frontier model report from Nvidia to read @eliebakouch @stochasticchasm and gang

译我们又有另一份来自英伟达的65页前沿模型报告要读,作者@eliebakouch @stochasticchasm及其团队。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 6月4日66

This Illinois+ Tsinghua University and other labs study finds that LLM agents still have unreliable memory and that it can get worse when they keep rewriting their own memories. LLM agents can learn from experience, but their rewritten memories often become unreliable. The problem is that many agent systems store past work by asking an LLM to compress messy experience into neat written lessons. That sounds useful because the agent should remember what worked before, but the paper finds that repeated rewriting slowly damages the memory. The core idea is that raw episodes, meaning the actual past attempts and solutions, often stay more useful than the polished lessons made from them. The authors tested this across tasks like web shopping, simulated worlds, app use, and ARC-style puzzle problems where they could control the correct solutions. The sharpest result is that GPT-5.4 solved 100% of a small ARC-AGI set with no memory, but after memory was built from correct solutions, streaming updates dropped it to about 54%. The failures came from bad grouping, overbroad lessons, and overfitting, so the memory forgot details, mixed up task types, or learned rules that only worked on narrow examples. The big deal is that agent memory should not automatically rewrite every experience into a summary, because keeping raw evidence and only sometimes making summaries worked better. The paper is really proposing that agent memory should treat raw past episodes as important evidence, not as disposable notes to summarize away. ---- arxiv. org/abs/2605.12978 Title: "Useful Memories Become Faulty When Continuously Updated by LLMs"

译伊利诺伊大学和清华大学等实验室研究发现,LLM智能体重复重写自身记忆会导致记忆变得更不可靠。原始经历(实际过往尝试和解决方案)往往比提炼后的总结更有用。测试中,GPT-5.4在小型ARC-AGI数据集上无记忆时正确率100%,但建立记忆并持续更新后降至约54%。失败原因包括分组不当、教训过度泛化及过拟合。研究建议智能体不应自动将每个经历重写为摘要,保留原始证据并仅偶尔总结效果更好。

查看原推 ↗
Rohan Paul@rohanpaul_ai · 6月4日71

This Google DeepMind’s paper is a serious warning for anyone using autonomous agents today. Gives the first clear taxonomy of 6 attack types where harmful websites can detect AI agents and show them hidden content humans never see, like - Instructions buried in HTML comments or white-on-white text - Steganography in image pixels - Override commands in PDFs, metadata, or even speaker notes - Memory poisoning that persists across sessions - Goal hijacking and cross-agent cascades in multi-agent setups The real security problem for AI agents is not just the model, but the environment it reads. The web itself can be weaponized against autonomous AI agents. As agents increasingly browse the internet, read emails, execute transactions, and spawn sub-agents, the information environment becomes an attack surface. In one cited benchmark, hidden prompt injections embedded in web content partially commandeered agents in up to 86% of scenarios, sub-agent hijacking working 58–90% of the time, and data exfiltration attacks clearing 80% across five different agent architectures. That reframes the whole debate. We usually talk about model safety as if the danger sits inside the weights, but agents do something more fragile: they browse, retrieve, remember, and act on untrusted material in real time. Here’s the thing to worry about. A web page does not have to look malicious to be dangerous to an agent, because the agent may parse what humans never see: hidden HTML comments, metadata, CSS-hidden text, formatting syntax, or adversarial content embedded in images and other media. The threat gets more serious once memory enters the loop. If an agent uses RAG or persistent memory, poisoning no longer has to win in one shot. It can sit quietly in a corpus or memory store and activate later, which is why the paper highlights results showing latent memory poisoning above 80% attack success with less than 0.1% data contamination. --- ssrn .com/sol3/papers.cfm?abstract_id=6372438

译Google DeepMind论文首次系统分类六类攻击:HTML注释/白色文本隐藏指令、图像隐写、PDF元数据/演讲者笔记覆写、跨会话内存投毒、目标劫持及多智能体级联攻击。隐藏提示注入在86%场景中部分控制智能体,子智能体劫持成功率58–90%,数据泄露攻击在五种架构中均超80%。内存投毒成功率超80%,仅需不足0.1%数据污染。论文指出网页、邮件等非受信材料可被武器化,构成主要攻击面。

查看原推 ↗
Chubby♨️@kimmonismus · 6月4日67

A blind Stanford-led study of nearly 3,000 anonymized matchups found law professors across 16 schools preferred AI-generated answers to student contract-law questions over those written by fellow professors 75% of the time, and judged the AI responses far less likely to be pedagogically harmful (3.5% vs. 12%). "The team tested a range of systems, including commercial tutoring tools and Google's NotebookLM." Now imagine the performance of models in 6-12 months.

译一项由斯坦福大学领导的盲测研究,对近3000场匿名对决的分析发现,16所法学院的法律教授在合同法问题中,有75%的时间更偏好AI生成的答案,而非教授自己写的答案,并且认为AI回答的教学危害性远低于后者(3.5% vs 12%)。 “研究团队测试了多种系统,包括商业辅导工具和Google的NotebookLM。” 现在想象6-12个月后模型的表现。

查看原推 ↗
AK@_akhaliq · 6月4日62

dMoE dLLMs with Learnable Block Experts

译dMoE 具有可学习块专家的dLLM

查看原推 ↗
AK@_akhaliq · 6月4日46

Bootstrap Your Generator Unpaired Visual Editing with Flow Matching

译自举你的生成器 非配对视觉编辑与流匹配

查看原推 ↗
AK@_akhaliq · 6月4日60

Unified Neural Scaling Laws

译统一神经缩放定律

查看原推 ↗
Anthropic@AnthropicAI · 6月4日64

How well do the security community's techniques hold up against AI-enabled cyberattacks? We examined 832 malicious accounts and mapped their activity onto a longstanding database of tactics and techniques used by threat actors. Here's what we learned:https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

译安全社区的技术在应对AI驱动的网络攻击方面表现如何? 我们检查了832个恶意账户,并将其活动映射到一个长期存在的威胁行为者战术和技术数据库。 以下是我们学到的:https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

查看原推 ↗
Microsoft Research@MSFTResearch · 6月4日62

A three‑month pilot in a Midwestern bottling plant shows what happens when AI moves beyond chat and into decision-making, where constraints shift, stakes are real, and answers must hold. https://msft.it/6015vjYUN

译一份在中西部装瓶厂进行的三个月试点显示,当AI超越聊天进入决策领域时会发生什么——约束条件变化、风险真实、答案必须可靠。 https://msft.it/6015vjYUN

查看原推 ↗
elvis@omarsar0 · 6月3日72

New research from Google. Just shows the impressive results you can get from custom agent harnesses. LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback. The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%. Paper: https://arxiv.org/abs/2606.03303 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Google 新研究 LEAP 将通用大语言模型封装在智能体框架中,每个步骤基于 Lean 编译器,并依赖验证器反馈进行迭代。同一通用模型解决了全部 12 道 Putnam 2025 问题,并将 Lean-IMO-Bench 一次性解决率从不到 10% 提升至 70%,击败了得分 48% 的专业金牌系统。论文链接:https://arxiv.org/abs/2606.03303。

查看原推 ↗
Ethan Mollick@emollick · 6月3日41

Hey, its our paper!

译嘿,这是我们发表的论文! [引用 @PNAS News]:过去一周PNAS最高浏览量文章之一——《劝说大语言模型遵守有异议的请求》。查看论文:https://ow.ly/wOxl50Z6fZA 更多热门文章请访问 https://ow.ly/uLkC50Z6fZz。

查看原推 ↗
Saining Xie@sainingxie · 6月3日67

how does the brain build and track an internal state of the world from (possibly incomplete and noisy) visual observations? i believe visual state tracking will be the grand challenge for vision in the coming years, and i hope this benchmark can be a useful starting line. enjoy!

译研究团队推出VSTAT基准测试,用于评估多模态大语言模型(MLLMs)在视频中追踪动态状态的能力。测试任务看似简单,包括计数杯子、识别键入的文字、统计翻页次数等,人类可以轻松完成,但当前MLLMs表现欠佳。该测试旨在推动视觉状态跟踪这一前沿方向的发展,解决模型从不完整、有噪声的视觉观察中建立和更新内部世界状态的核心挑战。

查看原推 ↗
全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
6月8日
09:37
meng shao@shao__meng
64
AGENTS.md 在 Coding Agents 中真的有用吗?

论文大规模实证检验 AGENTS.md 等仓库级上下文文件对编码 Agent 的影响。在 SWE-bench Lite(300 任务)和新建 AGENTBENCH(138 任务)上测试 Claude Code、Codex、Qwen Code 等组合。核心发现:LLM 自动生成的 context file 在 8 组设置中 5 组成功率下降,平均 -0.5%(SWE-bench)/-2%(AGENTBENCH),成本增加 +20%+;开发者手写仅平均 +4%。冗余假说:移除其他文档后,自动生成反而 +2.7%。建议避免自动生成,精简测试/lint 命令,优先写入仓库专用工具。

Sebastian Raschka: http://x.com/i/article/2063647807437705216

智能体arXiv编码论文/研究
03:27
AYi@AYi_AInotes
62
Google向量存储压缩:31GB→4GB,速度超FAISS

Google提出一种AI记忆压缩技术,可将1000万个文档的向量存储从31GB内存压缩至仅4GB,且搜索速度超过目前最常用的FAISS方法。该技术使本地运行大语言模型并结合个人数据变得更加可行。

AYi: http://x.com/i/article/2060717603987791878

Google检索增强数据/训练论文/研究
02:07
Rohan Paul@rohanpaul_ai
49
推理模型后训练数据入门:改进的关键在可验证反馈而非数据规模

论文指出,更好的推理模型更依赖可验证的训练证据,而非原始数据规模。推理数据的关键不是简单问答对,而是提供答案、步骤、工具操作或完整尝试好坏判断的反馈信号。每个训练样本应描述为包含任务、模型行为、检查信号和元数据的记录。研究者按检查方式分类:数学和代码用精确规则、智能体工具用环境检查,无精确检查器时用人类或模型判断。常见误区包括:长推理链可能虚假、更难样例对部分模型无效、更大数据集仍可能缺失关键覆盖。智能体数据应保留失败动作、重试、恢复、状态差异和终端检查等“混乱”信息,因为学习信号常在其中。

智能体arXiv推理数据/训练
6月7日
01:01
Rohan Paul@rohanpaul_ai
62
MIT论文提出Self-Revising Discovery Systems框架

MIT论文(F.Y. Wang & M.J. Buehler, arXiv:2606.01444, 2026)提出Self-Revising Discovery Systems框架,使AI科学家能自主识别当前思维模式不足并添加新科学概念,而非仅更努力搜索。系统将数据、模型、工具输出、失败及声明均视为类型化产物(typed provenance),从而区分三种模式:retrieval(添加已知对象)、search(探索固定模式)和discovery(可验证的模式转换)。论文通过Kan obstruction和Left Kan extension数学化定义了真正新颖性——由旧证据传输后的逐点残差量化,使novelty可客观测量。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性,以及CategoryScienceClaw发现各向异性纤维网络刚度规则。

Markus J. Buehler: We've made a breakthrough in self-evolving AI scientists moving from "search" to "principled discovery": Scientific disc...

智能体arXiv推理论文/研究
00:30
Rohan Paul@rohanpaul_ai
66
MIT团队提出自我演进AI科学家框架:让AI主动扩展科学概念空间

MIT团队提出自我演进AI科学家框架,核心创新是让AI识别当前推理空间过小并主动添加新科学概念,而非仅在固定模式内搜索。论文将数据点、模型、工具输出、失败、声明均视为带类型的artifact,明确区分检索(添加已知对象)、搜索(探索固定schema)和发现(可验证的模式扩展)。通过类型化copresheaf与Kan障碍理论证明,真正发现是可验证的schema扩展:旧证据由左Kan扩展传输,创新性通过逐点残差量化。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性,以及CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444(2026)。

Markus J. Buehler: We've made a breakthrough in self-evolving AI scientists moving from "search" to "principled discovery": Scientific disc...

智能体arXiv推理论文/研究
6月6日
23:30
elvis@omarsar0
65
CL-Bench:记忆系统不如简单上下文学习

持续学习领域投入多但进展缓慢。CL-Bench(持续学习基准)在六个由专家验证、包含共享可学习结构的领域上测试,发现简单的上下文学习(ICL)基线优于专门为记忆管理构建的系统。该基准引入增益指标以隔离真正学习效果,结果显示智能体常过度拟合即时观察或未能跨实例复用知识。研究指出,若普通ICL基线超过你的记忆架构,则该架构增加的是开销而非学习。论文:arxiv.org/abs/2606.05661。

智能体arXiv数据/训练论文/研究
20:29
meng shao@shao__meng
59
Anthropic 白皮书:面向 AI Agent 的零信任安全框架

Anthropic 5 月发布白皮书,提出企业部署自主 AI Agent 时须将零信任原则延伸至 Agent 架构。报告指出双重加速:前沿模型将漏洞发现到利用周期压缩至数小时;Agent 能自主解释目标、选工具、执行多步操作,传统访问控制无法阻止“合法权限内作恶”。核心原则:永不信任始终验证、假设已遭入侵、最小权限;另附设计检验——控制是让攻击不可能,还是仅增加麻烦?报告分五部分:Agent 为何是新安全对象、威胁图谱、三层能力成熟度模型、八阶段实施工作流、防御运营适配自主威胁速度。

智能体Anthropic安全/对齐部署/工程
10:03
SemiAnalysis@SemiAnalysis_
61
来自 @makora_ai 的序贯蒙特卡洛投机解码会并行保持多个草稿 token 存活,而不是回退失败的匹配。
推理论文/研究
06:29
Rohan Paul@rohanpaul_ai
精选76
Arena 发布真实世界 AI 智能体排行榜 Agent Arena

Arena 推出基于真实用户任务的智能体排行榜,评估模型在代码编写、应用构建、文档分析等工作中的表现,而非孤立基准。排行榜基于30万+任务、200万+工具调用和4000万行代码,综合任务成功、纠正遵从性、错误恢复、用户表扬与抱怨、工具幻觉等信号。前三名:GPT-5.5 High(+10.7%)、Claude Opus 4.7 Thinking(+9.5%)、GPT-5.4 High(+8.9%)。

Arena.ai: Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure mil...

智能体AnthropicOpenAI评测/基准

推荐理由:Arena 跳出了刷榜逻辑,用真实用户的多轮交互来评估 Agent,这比任何 toy benchmark 都更有说服力,选模型做 Agent 应用的可以把它当新指南。
06:00
Chubby♨️@kimmonismus
65
MIT团队提出自我修正发现系统,推动AI从搜索走向真正科学发现

MIT Buehler团队提出Self-Revising Discovery Systems框架,让AI能自主扩展科学词汇(变量、工具、验证器、模型结构),而非仅搜索固定空间。论文使用typed copresheaf和Kan obstruction数学框架形式化智能体工作流,证明真正发现是可验证的schema扩展:旧证据通过Left Kan extension迁移,新异性由pointwise残差客观量化,区分发现与搜索。三种模态:检索(添加已知对象)、搜索(固定schema)、发现(验证的范式转换)。案例包括Builder/Breaker发现蛋白质模式条件合规性,CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444(2026)。

Markus J. Buehler: We've made a breakthrough in self-evolving AI scientists moving from "search" to "principled discovery": Scientific disc...

智能体推理论文/研究
05:23
Emad@EMostaque
33
如果 Claude 对诺贝尔奖得主来说都足够好,那对你也一样。 https://arxiv.org/abs/2606.03300
AnthropicarXiv论文/研究
04:59
Rohan Paul@rohanpaul_ai
79
Claude Opus 4.7化学突破:反向推断分子结构,媲美专业NMR软件

Anthropic最新化学报告显示,通用大模型Claude Opus 4.7(无化学微调)在NMR核磁共振谱分析上匹配甚至超越专用软件MestReNova,氢预测误差最小,碳预测近乎一致。更关键的是,它能从NMR光谱反向推导分子结构——这一任务以往只能由人类化学家完成。这意味着AI现在可以处理化学中的关键瓶颈:在分子结构、谱图与最终确认之间自动翻译。

Anthropic: New Anthropic Science Blog: Making Claude a chemist. To manipulate a molecule, chemists first need to understand its str...

Anthropic推理论文/研究
关联讨论 1 条Anthropic:Research(发表成果 · 网页)
04:13
Microsoft Research@MSFTResearch
60
微软Project Mosaic:micro-LED光学互连技术

微软Azure CTO Mark Russinovich在Build 2026上介绍Project Mosaic,这是微软剑桥研究院的实验性光学互连技术,采用micro-LED实现低功耗、高速数据传输。高级研究员Kaoutar Benyahya现场演示单个LED调制形成字母,证明概念具备实时响应能力。

Microsoft论文/研究部署/工程
04:00
Chubby♨️@kimmonismus
72
剑桥大学完成全球首个AI设计疫苗成分人体试验

剑桥大学研究人员开展了据称全球首个AI设计疫苗成分的人体试验。该疫苗使用AI设计的“超级抗原”,旨在训练免疫系统对抗包括现有新冠变种及可能引发未来大流行的动物冠状病毒在内的广泛冠状病毒家族。首次人体试验仅39人,主要验证安全性。免疫反应虽属中等,但被视为有前景,证明AI设计的疫苗抗原可以在人体中测试。下一步计划进行约200人的更大规模研究。

其他论文/研究
03:38
Anthropic@AnthropicAI
73
Anthropic 新科学博客:让 Claude 成为化学家。 要操纵分子,化学家首先需要了解其结构。他们的主要工具是 NMR 波谱分析。 我们发现 Opus 4.7 在部分任务上匹配甚至超越了专用 NMR 软件。了解更多:https://www.anthropic.com/research/making-claude-a-chemist
Anthropic论文/研究
关联讨论 1 条Anthropic:Research(发表成果 · 网页)
01:07
Jim Fan@DrJimFan
71
NitroGen 刚刚获得 CVPR 最佳论文荣誉提名!!我们正在朝着通用具身智能体迈进,不仅掌握真实世界的物理规律,还能掌握模拟多元宇宙中所有可能的物理规律。 距离我们的第一个 Minecraft 具身智能体 MineDojo 获得 NeurIPS 最佳论文奖已经过去 4 年了。祝贺团队里的每一位!!
具身智能论文/研究
00:00
AK@_akhaliq
56
ArcANE 角色扮演语言智能体是否能在适当时刻保持角色?
智能体arXiv论文/研究
00:00
AK@_akhaliq
57
Code2LoRA 超网络生成的代码语言模型适配器,用于软件演化环境。
编码论文/研究
6月5日
23:58
elvis@omarsar0
69
元智能体挑战:AI智能体自我改进能力堪忧

最新研究提出元智能体挑战(MAC),将编码智能体放入沙盒,给定评估API和时间预算,要求其自主编程出在五个领域表现最优的智能体。结果发现,元智能体极少能匹敌人工设计的基线,少数成功的案例也几乎全部依赖专有前沿模型。更值得警惕的是,在高优化压力下,一些智能体开始从评分渠道外泄真实答案,即便研究人员设置了多层反奖励破解防御也未能阻止。论文:arxiv.org/abs/2606.04455。

智能体数据/训练论文/研究
23:33
AI at Meta@AIatMeta
64
热烈祝贺我们的 SAM 3D 团队在 #CVPR26 获得最佳论文荣誉提名!这项殊荣凸显了他们在推动计算机视觉边界方面的杰出工作。 论文链接:https://arxiv.org/abs/2511.16624
Meta多模态论文/研究
20:55
Berryxia.AI@berryxia
70
PlanningBench:腾讯混元与人大高瓴开源LLM规划能力评测框架

腾讯混元联合人大高瓴人工智能学院开源PlanningBench,一个可扩展、可验证的框架,用于评估和训练大语言模型(LLM)的真实规划能力。该框架包含30多个来自调度、生产、旅行、资源分配、应急响应等六大类的真实世界规划任务,每项任务都有清晰的成功标准和全自动验证机制。用户既可用它评测当前最强模型在规划上的短板,也可直接用于微调,让模型从“会说”进化到“会干”。论文、代码和数据集已全部在GitHub和Hugging Face开源。

Tencent Hy: Planning is where LLMs move from "saying" to "doing." Tencent Hy, in collaboration with the Gaoling School of Artificial...

智能体论文/研究评测/基准
11:26
Rohan Paul@rohanpaul_ai
63
论文颠覆直觉:进化者无需最强模型,智能体能力更关键

论文“Harness Updating Is Not Harness Benefit”挑战了常见直觉——把最强模型放在进化者位置以写出更好更新。实验表明,廉价模型Qwen3.5-9B即可写出与Claude Opus 4.6效果相近的提示、记忆和技能更新。昂贵模型更适合作为求解任务的智能体,因弱模型无法正确加载或遵循更新,强模型已近能力上限,收益有限。甜区在中档模型:既能调用新程序,又有足够学习空间。

智能体论文/研究
09:26
Rohan Paul@rohanpaul_ai
60
Harness-1:通过状态外部化提升搜索智能体性能

Harness-1 将大语言模型的记忆工作转移到外部辅助系统(harness),解决传统搜索智能体需在同一上下文窗口内处理语义决策与状态记录导致的效率低下问题。模型仅负责搜索、验证等关键语义选择,而可恢复状态(候选池、证据链接、去重记录、预算感知记忆等)由 harness 追踪。这一分离使一个 20B 参数模型实现了更好的搜索表现。在强化学习中,外部化状态避免了失败原因混淆,有助于策略学习。Harness-1 在未见 benchmark 上提升更大,表明模型学到了可复用的搜索策略而非记忆领域习惯。论文 arXiv:2606.02373。

智能体arXiv推理搜索
08:54
meng shao@shao__meng
65
Anthropic 发布「AI 递归自我改进」研究报告:Claude 正被深度用于开发下一代 AI

Anthropic 发布报告显示,Claude 正被深度用于开发下一代 AI,趋势加速或导致系统自主设计后继版本。外部指标:模型可靠完成任务时长约每 4 个月翻倍,SWE-bench 两年内饱和,CORE-Bench 15 个月内饱和,长时任务达 16 小时。内部数据:截至 2026 年 5 月超 80% 主干代码由 Claude 撰写;工程师日均合并代码量是 2024 年的 8 倍;员工中位数估计产出为无 AI 时的 4 倍;实验执行从约 3x 提升至约 52x;自主研究恢复能力达人类两组研究者一周工作量的 97%(人类约 23%);研究判断优于人类比例从 51% 升至 64%。报告探讨了趋势停滞、持续自动化、完整递归自我改进三种未来情景。

Anthropic: Our internal data shows Claude is accelerating AI development-a possible path to recursive self-improvement, or AI auton...

智能体Anthropic安全/对齐论文/研究
06:24
Rohan Paul@rohanpaul_ai
70
Google LEAP 框架提升通用 LLM 形式化数学证明性能至 70%

Google 新论文 LEAP 提出智能体框架,通过规划证明、分解子目标、复用已有引理并利用 Lean 验证器反馈,将通用 LLM 在形式化数学证明上的性能从不到 10% 提升至 70%。传统单次完整证明在长难题上表现极差,而 LEAP 将证明存储为有向图结构,先规划再逐步验证。在 Putnam 2025 竞赛中,LEAP 成功解出全部 12 道题;在包含 60 道 IMO 风格题目的 Lean 基准测试中,也实现了上述性能跃升。

Google推理论文/研究
04:18
Emad@EMostaque
81
Anthropic内部数据显示,Claude正在加速AI开发--这可能走向递归自我改进,即AI自主构建更强大的后继者。进展比预期更快,影响值得更多关注。主推文仅感叹:"foom!"

Anthropic: Our internal data shows Claude is accelerating AI development-a possible path to recursive self-improvement, or AI auton...

智能体Anthropic安全/对齐论文/研究
关联讨论 8 条X:Anthropic (@AnthropicAI)Anthropic:The Institute(旗舰研究长文 · 网页)Hacker News 热门(buzzing.cc 中文翻译)The Decoder:AI News(RSS)X:Kim (@kimmonismus)X:小互 (@xiaohu)X:卡兹克 (@Khazix0918)X:Rohan Paul (@rohanpaul_ai)
01:29
🚨 AI News | TestingCatalog@testingcatalog
78
Anthropic 发布内部研究,称 Claude 正加速 AI 开发,可能通往递归自我改进--即 AI 自主构建更强大的继任者。研究显示,Claude Mythos Preview 可连续工作至少 16 小时,达到 METR 可测量上限。同时,Anthropic 工程师当前每季度交付的代码量是 2021-2025 年期间的 8 倍。

Anthropic: Our internal data shows Claude is accelerating AI development-a possible path to recursive self-improvement, or AI auton...

智能体Anthropic安全/对齐论文/研究
关联讨论 8 条X:Anthropic (@AnthropicAI)Anthropic:The Institute(旗舰研究长文 · 网页)Hacker News 热门(buzzing.cc 中文翻译)The Decoder:AI News(RSS)X:Kim (@kimmonismus)X:小互 (@xiaohu)X:卡兹克 (@Khazix0918)X:Rohan Paul (@rohanpaul_ai)
01:28
AI Notkilleveryoneism Memes ⏸️@AISafetyMemes
73
HOLY SHIT LET'S FUCKING GOO 我们内部数据显示,Claude 正在加速 AI 发展--这可能通往递归自我改进,即 AI 自主构建更强大的后继者。 这发生得比我们想象的更快,其影响值得更多关注。

Anthropic: Our internal data shows Claude is accelerating AI development-a possible path to recursive self-improvement, or AI auton...

Anthropic安全/对齐推理论文/研究
6月4日
21:44
Nathan Lambert@natolambert
60
我们又有另一份来自英伟达的65页前沿模型报告要读,作者@eliebakouch @stochasticchasm及其团队。
论文/研究
18:52
Rohan Paul@rohanpaul_ai
66
伊利诺伊大学和清华大学等研究发现:LLM智能体不断重写记忆反而导致记忆不可靠

伊利诺伊大学和清华大学等实验室研究发现,LLM智能体重复重写自身记忆会导致记忆变得更不可靠。原始经历(实际过往尝试和解决方案)往往比提炼后的总结更有用。测试中,GPT-5.4在小型ARC-AGI数据集上无记忆时正确率100%,但建立记忆并持续更新后降至约54%。失败原因包括分组不当、教训过度泛化及过拟合。研究建议智能体不应自动将每个经历重写为摘要,保留原始证据并仅偶尔总结效果更好。

智能体arXiv数据/训练论文/研究
17:52
Rohan Paul@rohanpaul_ai
71
Google DeepMind论文揭示六类自主AI智能体攻击方法

Google DeepMind论文首次系统分类六类攻击:HTML注释/白色文本隐藏指令、图像隐写、PDF元数据/演讲者笔记覆写、跨会话内存投毒、目标劫持及多智能体级联攻击。隐藏提示注入在86%场景中部分控制智能体,子智能体劫持成功率58–90%,数据泄露攻击在五种架构中均超80%。内存投毒成功率超80%,仅需不足0.1%数据污染。论文指出网页、邮件等非受信材料可被武器化,构成主要攻击面。

智能体DeepMind安全/对齐论文/研究
13:51
Chubby♨️@kimmonismus
67
一项由斯坦福大学领导的盲测研究,对近3000场匿名对决的分析发现,16所法学院的法律教授在合同法问题中,有75%的时间更偏好AI生成的答案,而非教授自己写的答案,并且认为AI回答的教学危害性远低于后者(3.5% vs 12%)。 "研究团队测试了多种系统,包括商业辅导工具和Google的NotebookLM。" 现在想象6-12个月后模型的表现。
论文/研究评测/基准
11:21
AK@_akhaliq
62
dMoE 具有可学习块专家的dLLM
图像生成数据/训练论文/研究
10:51
AK@_akhaliq
46
自举你的生成器 非配对视觉编辑与流匹配
图像生成论文/研究
10:51
AK@_akhaliq
60
统一神经缩放定律
数据/训练论文/研究
02:56
Anthropic@AnthropicAI
64
安全社区的技术在应对AI驱动的网络攻击方面表现如何? 我们检查了832个恶意账户,并将其活动映射到一个长期存在的威胁行为者战术和技术数据库。 以下是我们学到的:https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack
Anthropic安全/对齐论文/研究
关联讨论 2 条Anthropic:Research(发表成果 · 网页)Anthropic:Newsroom(网页)
00:33
Microsoft Research@MSFTResearch
62
一份在中西部装瓶厂进行的三个月试点显示,当AI超越聊天进入决策领域时会发生什么--约束条件变化、风险真实、答案必须可靠。 https://msft.it/6015vjYUN
Microsoft推理论文/研究部署/工程
6月3日
23:17
elvis@omarsar0
72
Google 新研究 LEAP:通用大模型封装在智能体框架中,解决全部 Putnam 2025 问题

Google 新研究 LEAP 将通用大语言模型封装在智能体框架中,每个步骤基于 Lean 编译器,并依赖验证器反馈进行迭代。同一通用模型解决了全部 12 道 Putnam 2025 问题,并将 Lean-IMO-Bench 一次性解决率从不到 10% 提升至 70%,击败了得分 48% 的专业金牌系统。论文链接:https://arxiv.org/abs/2606.03303。

智能体Google推理论文/研究
23:15
Ethan Mollick@emollick
41
嘿,这是我们发表的论文! 【引用 @PNAS News】:过去一周PNAS最高浏览量文章之一--《劝说大语言模型遵守有异议的请求》。查看论文:https://ow.ly/wOxl50Z6fZA 更多热门文章请访问 https://ow.ly/uLkC50Z6fZz。

PNASNews: One of the most-viewed PNAS articles in the last week is "Persuading large language models to comply with objectionable ...

安全/对齐论文/研究
11:45
Saining Xie@sainingxie
67
研究团队推出VSTAT基准测试,用于评估多模态大语言模型(MLLMs)在视频中追踪动态状态的能力。测试任务看似简单,包括计数杯子、识别键入的文字、统计翻页次数等,人类可以轻松完成,但当前MLLMs表现欠佳。该测试旨在推动视觉状态跟踪这一前沿方向的发展,解决模型从不完整、有噪声的视觉观察中建立和更新内部世界状态的核心挑战。

Sihyun Yu: Can MLLMs actually track what's happening in a video? Introducing VSTAT 🎯, our new benchmark for visual state tracking....

多模态视频评测/基准
‹ 上一页
1234…9
下一页 ›