斯坦福团队研究发现,使用未过滤Common Crawl数据训练模型时,在计算量充足下效果可能优于清洗后数据,结论呈现模型规模依赖性:小模型(15M)上过滤数据全面领先,但大模型(330M、1B)未过滤数据在充分训练后反而超越过滤版本,原因是大模型参数容量足够大,可在训练中自行隔离噪声与有效信息。
今天读到斯坦福大学研究团队的一个论文,有点跟直觉不一样。 把没过滤的Common Crawl数据喂给大模型,发现计算量足够大时,不过滤数据效果反而比清洗后的数据效果好。 在 15M 小模型上,过滤数据全面领先,未过滤的很差。 但当模型规模达到 330M 和 1B 时,情况完全反转,未过滤的在充分训练后超越了所有过滤版本。 小模型怕垃圾,大模型不怕。 模型大,秩(参数量)多,就有足够空间把垃圾和有用信息隔离开。 论文解读和原始PDF见评论区
译斯坦福团队研究发现,使用未过滤Common Crawl数据训练模型时,在计算量充足下效果可能优于清洗后数据,结论呈现模型规模依赖性:小模型(15M)上过滤数据全面领先,但大模型(330M、1B)未过滤数据在充分训练后反而超越过滤版本,原因是大模型参数容量足够大,可在训练中自行隔离噪声与有效信息。
兄弟们,Google DeepMind 团队又来整活儿! Google DeepMind的最新发布,直接把“AI能帮科学家干嘛”这个老问题彻底翻篇了。 他们把Gemini做成了一个叫Co-Scientist的多Agent系统。 不是简单问答工具,是完整复制了科学家从idea到验证的整个循环:生成上千个假设、举办“idea锦标赛”、让多个Agent展开科学辩论、互相批判精炼,最后用文献、数据和搜索工具把每个主张落地验证。 以前科研最卡的环节,就是一个人脑力有限,生成好假设、反复辩论、跨领域拉新知识都要靠自己。 现在Co-Scientist把这个过程变成可规模化的流水线。 过去一年他们和全球顶尖科学家一起测,在肝纤维化新靶点、肌萎缩侧索硬化(ALS)新疗法、逆转衰老的遗传线索这些超级复杂的问题上,都拿出了真正有潜力的新方向。 最反直觉的一点是:它不是来取代科学家的,只是真正成了“专职研究伙伴”。 科学家终于可以把脑力从“反复想假设、反复查文献”里解放出来,专注在最有创造力的判断和实验设计上。 AI把以前只有顶尖团队才玩得起的“高强度idea迭代”变成了人人可用的基础设施。 现在他们已经把Hypothesis Generation功能开放给个人研究者,直接通过Gemini for Science就能用。 普通研究员也能拥有一个24小时不睡觉、能辩论、能验证、还能不断进化的AI合作者。 这其实戳破了当前最主流的误解:很多人以为AI会让科学家失业,结果真实路径是AI把科学发现的速度和广度直接拉高一个数量级,让更多人能真正参与到突破性研究里。
译Google DeepMind发布了基于Gemini的多Agent系统Co-Scientist,旨在实现科研流程自动化。该系统能够生成、辩论和验证假设,帮助科学家从高强度脑力劳动中解放出来。过去一年,它已在肝纤维化新靶点、ALS新疗法等复杂问题上与科学家合作探索出新方向。其定位并非取代科学家,而是作为“专职研究伙伴”。目前,其假设生成功能已通过Gemini for Science向个人研究者开放。
Stanford researchers found that law professors preferred AI answers over peer professor answers 75% of the time when judging contract-law help for students. The study tested whether LLMs can handle a field where the answer is often not a fact, but a defensible argument built from rules, exceptions, and judgment. The professors wrote 40 real student-style questions, gave their own answers, and then blindly judged nearly 3,000 comparisons between human and AI responses. The striking result was not just that AI won often, but that professors marked AI answers as harmful only 3.5% of the time, compared with 12% for human answers. i.e. the model was not merely sounding fluent, but often matching the teaching standard law professors use when explaining ambiguity to students.
译斯坦福研究人员发现,在评估合同法问题时,法律教授有75%的次数更倾向于选择AI给出的答案,而非同行教授的答案。该研究让教授们针对40个真实学生提问撰写答案,并对近3000个人类与AI的回答进行了盲测比较。结果不仅显示AI胜出频率高,而且教授们仅将3.5%的AI答案标记为“有害”,而对人类答案的有害标记率为12%。这表明大语言模型并非只是流畅,其表现常能达到教授向学生解释法律模糊性的教学标准。
AI can explain science better than it can forecast science. Across 4,760 scientific events, the models were much better at recognizing possible research paths than forecasting actual outcomes. Models often recognize a plausible research idea when the answer is already nearby, especially in multiple-choice form. But they are much weaker at the harder thing: predicting whether a discovery will actually happen, when it will happen, and what method will make it work. That means the models are still much better at hindsight than foresight. When asked whether a scientific claim will actually be realized, the models hover near chance, and when asked when progress will arrive, they systematically push it too far into the future. Even when the authors gave models extra older information, the models improved a bit but still did not become reliable at predicting future scientific progress. So having lots of scientific knowledge inside a model does not automatically make it a good scientific forecaster. ---- Paper Link – arxiv. org/abs/2605.22681 Paper Title: "Forecasting Scientific Progress with AI"
译一项对4,760个科学事件的研究发现,AI模型在“解释”科学方面优于“预测”科学。模型在识别可能的研究路径(尤其是选择题形式)时表现较好,但在预测科学发现是否会实际发生、何时发生以及何种方法有效等更难任务上表现薄弱,准确率接近随机猜测。即使提供额外历史信息,模型改善有限。这表明,模型内嵌大量科学知识并不等同于具备可靠的科学预见能力。研究论文发表于arXiv(2605.22681),标题为《Forecasting Scientific Progress with AI》。
Weather forecasts thousands of times faster than traditional supercomputers. Hear from Kenji Takeda on Aurora at the Microsoft Research Lab at #MSBuild. Learn more: https://msft.it/6018vjGUA
译天气预报速度比传统超级计算机快数千倍。听听Kenji Takeda在#MSBuild微软研究实验室关于Aurora的分享。了解更多:https://msft.it/6018vjGUA
GPU Forecasters Language Models as Selective Surrogates for Kernel Runtime Optimization
译GPU预测器 大语言模型作为内核运行时优化的选择性代理
Seeing Isn't Knowing Do VLMs Know When Not to Answer Spatial Questions (and Why)?
译视觉语言模型知道何时不回答空间问题吗(以及为什么)?
Crafter A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
译Crafter 一个用于从多样化输入生成可编辑科学图表的多智能体框架
// Scaling Behavior of Single LLM-Driven Multi-Agent Systems // Does adding more agents actually make a multi-agent system better? It's possible that collective intelligence emerges from interaction design rather than from agent plurality. This is something important to understand if you are building multi-agent systems. This new study reports that the optimal number of agents depends on the base model's capability and the task type, not on adding more of them. Paper: https://arxiv.org/abs/2606.00655 Learn to build effective AI agents in our academy: https://academy.dair.ai/
译研究探讨添加更多智能体是否提升多智能体系统性能。结论指出,最优智能体数量取决于基础模型的能力和任务类型,而非单纯增加数量。集体智能更可能源于精心的交互设计,而非智能体数量的增多。相关论文:"Scaling Behavior of Single LLM-Driven Multi-Agent Systems"。
This paper proposes a way to predict the cheapest safe AWS spot fleet before launching it. AWS spot machines can be much cheaper, but users usually cannot see the final fleet price across regions before starting, so this paper turns that blind choice into a comparison that can save up to 64%. Spot instances are cheap because they are conditional: the cloud provider can take them back, prices move, and capacity shifts by region. The quiet problem is that AWS helps users launch spot fleets, but not fully see the fleet’s price or best region before launch. The authors build a service that watches how AWS creates these fleets, learns those patterns with time-aware AI models, and then estimates the fleet mix and cost across 9 regions. A user gives the service a target amount of computing power and a placement strategy, and the service returns region-ranked options before anything is launched. They tested it on AWS with fleets up to 1500 virtual CPUs, using 720 test launches after a 90-day monitoring period. The predicted fleet matched AWS exactly in 92.78% of cases, reached 99.79% overall accuracy against AWS behavior, and AWS accepted every recommended fleet. Result is that choosing the best region mattered far more than changing the strategy inside 1 region, with possible savings up to 64%. ---- Paper Link – arxiv. org/abs/2605.22778 Paper Title: "AI-Driven Multi-Region Provisioning for Cloud Services Using Spot Fleets"
译该研究提出了一种AI驱动的服务,用于在启动前预测最便宜且安全的AWS Spot实例舰队。该服务通过时间感知模型学习AWS创建舰队的模式,并估算9个区域的舰队组合与成本,向用户返回排序后的区域选项。测试显示,在最多1500 vCPU的舰队上,预测结果与AWS完全匹配的比例达92.78%,整体准确率为99.79%,且所有推荐舰队均被AWS接受。关键发现是选择最佳区域比在单个区域内调整策略更重要,潜在成本节省最高可达64%。
Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility. It exposed the gap between beautiful video generation and controllable world simulation. A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect. WBench tests this with 289 cases, 1,058 interaction turns, 20 models, 5 dimensions, and 22 automatic metrics, covering navigation, subject actions, event edits, perspective switches, and both viewpoints. Across all those 20 evaluated models, the paper finds that no model dominates all dimensions, which means current systems have not yet merged high-quality rendering, reliable control, long-horizon memory, and physical rule-following into one stable capability. Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics. Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command. The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?” 🧵 1.
译美团LongCat发布视频世界模型评测基准WBench。该基准将测试重点从画面美观转向控制、多轮记忆、指令遵循和物理合理性等核心能力。它包含289个案例、1058个交互轮次,评估了20个模型在导航、主体动作、事件编辑等5个维度的表现,共使用22项自动指标。研究发现,没有任何模型能在所有维度上占据主导,这表明现有系统尚未将高质量渲染、可靠控制、长期记忆与物理规则遵循整合为稳定能力。WBench的设计能区分失败是源于渲染、场景设置、控制还是物理问题,并指出导航能力与视觉质量基本无关。
Big paper on AI coding agents using Github & other data The auto-complete tools (Copilot) led to 2.2x more code, local agents like original Claude Code led to 7.4x, & current remote coding agents 17.3x(!) But human bottlenecks in coding means actual releases "only" went up 30%
译关于使用Github及其他数据的AI编程智能体的重要论文 自动补全工具(如Copilot)使代码量增加2.2倍,本地智能体(如初版Claude Code)增加7.4倍,而当前远程编程智能体增加17.3倍(!) 但编程中的人类瓶颈意味着实际发布量“仅”增加了30%
A 178 page survey study for refreshing math and generative AI foundations from University of Huddersfield. The Little Book of Generative AI Foundations.
译哈德斯菲尔德大学发布了一份178页的调查研究,旨在更新数学和生成式AI的基础知识。 《生成式AI基础小册子》。
Better AI agent systems scale by remembering useful feedback, not by spending more compute. The simple mistake is to count tokens, calls, or dollars as if they were all evidence. The authors say those numbers miss the real issue, because 2 runs can spend the same budget while only 1 gets feedback that is correct, new, relevant, and remembered. An agent harness is not just a wrapper around a model; it is a feedback machine that decides what to test, what to trust, what to store, and what to ignore. Their answer is Effective Feedback Compute, or EFC, a score that counts feedback only when it teaches the agent something useful and changes later decisions. They also divide EFC by task demand, because a small lookup task and a messy software-repair task need different amounts of helpful feedback before the agent has enough to solve them. They tested this on synthetic tasks, code tasks with executable tests, real benchmark traces, held-out settings, and a new prospective batch, then compared EFC with raw compute and a strong agent-scaling baseline. The main result is that task-normalized EFC predicted failures much better than raw compute, and in 1 matched-budget test, better feedback raised success from 0.27 to 0.90 while cost and tool calls stayed fixed. ---- Link – arxiv. org/abs/2605.29682 Title: "Scaling Laws for Agent Harnesses via Effective Feedback Compute"
译当前AI智能体的扩展方法常错误地将计算资源消耗等同于学习证据。新研究指出,两次运行消耗相同预算,但反馈的有效性可能天差地别。为此,研究提出了“有效反馈计算”(EFC)指标,仅统计那些正确、新颖、相关且被记住、并能改变后续决策的反馈。研究还结合任务需求对EFC进行归一化。实验表明,任务归一化的EFC比原始计算指标更能预测失败。在一项匹配预算测试中,采用更好反馈的方法将任务成功率从0.27提升至0.90,而成本和工具调用次数保持不变。 链接:arxiv.org/abs/2605.29682 标题:"Scaling Laws for Agent Harnesses via Effective Feedback Compute"
GrepSeek Training Search Agents for Direct Corpus Interaction
译GrepSeek 训练搜索智能体以直接交互语料库
In collaboration with @nvidia, we’re open-sourcing a dataset of security scans for 67,453 ClawHub skills on @huggingface: - NVIDIA SkillSpector flagged 1/2 for agentic risk - Only 0.31% were malicious - No two scanners agreed on more than 8.5% of risks https://openclaw.ai/blog/openclaw-nvidia-skill-security
译与 @nvidia 合作,我们开源了一个包含 67,453 个 @huggingface 上 ClawHub 技能安全扫描的数据集: - NVIDIA SkillSpector 标记出 1/2 的智能体风险 - 仅 0.31% 为恶意 - 没有两个扫描器在超过 8.5% 的风险上达成一致 https://openclaw.ai/blog/openclaw-nvidia-skill-security
// The Efficiency Frontier // Cool paper on context management. As agents reuse the same documents and histories across many turns, the cheapest context strategy is not fixed. This work describes a principled rule for picking one per deployment instead of defaulting to whatever topped a benchmark in isolation. Retrieval and compression methods are almost always benchmarked on accuracy and cost separately, so you never learn when one actually beats another under real load. The Efficiency Frontier models context strategy selection as a single cost-performance problem, with a log-utility term for diminishing returns from extra context and a reuse parameter N that amortizes preprocessing across repeated queries. Sweep N and the optimal strategy changes, exposing crossover regions where retrieval, compression, or full context each wins. On 5,000 HotpotQA instances, deployment-aware selection cuts effective token usage about 25 percent at the same performance, and amortized memory compression runs over 50 percent cheaper than full-context prompting in higher-performance settings. Paper: https://arxiv.org/abs/2605.23071 Learn to build effective AI agents in our academy: https://academy.dair.ai/
译该论文指出,当AI智能体在多轮对话中重复使用相同文档和历史记录时,固定的上下文策略并非最优。研究提出了“效率前沿”框架,将上下文策略选择建模为一个成本与性能的平衡问题。通过引入重用参数N进行扫描,可以识别出检索、压缩或全上下文各自占据优势的交叉区域。在5000个HotpotQA实例上的测试表明,部署感知的选择能在保持相同性能下减少约25%的有效token使用量,而摊销内存压缩在高性能设置下比全上下文提示的运行成本便宜超过50%。
Amazon unveiled “Resilient Network Graphs,” (RNG) a data center network that reduces hardware needs by 69% and raises throughput by 33%. It is now default for most AWS workloads. They revealed that it has been quietly deploying the design across its data centers since last year, and it is now the default data center network for most AWS workloads. It replaced tree-shaped datacenter networks with flatter random ones that waste less capacity. For decades, fat-tree networks worked because they were predictable, but their layered shape can concentrate traffic at choke points while other links sit underused. So the problem is that fat-tree networks are easy to run, but their hierarchy can trap traffic on a few links while other links sit unused. “Resilient Network Graphs,” (RNG) fixes this by connecting routers in a flat quasi-random graph, so many different paths exist between servers instead of a few fixed routes through upper layers. RNG attacks the problem by flattening the fabric into a quasi-random graph, where many small independent paths replace a few privileged routes. Its routing system, Spraypoint, spreads traffic across many separate paths, while its ShuffleBox cabling device makes the random-looking wiring practical to build and expand. Instead of asking every packet to chase the shortest path, Spraypoint fans traffic outward and then guides it back through distributed waypoints, creating many edge-disjoint paths without requiring exotic switch memory. The authors tested RNG in 2 real Amazon production fabrics and compared it with fat-tree networks using transport and storage workloads. The main result is that RNG matched fat-tree application performance, found far more separate paths than common routing methods, and was estimated to cost 9% to 45% less. The hard part is not the idea, but the engineering, because routing in a random mesh needs smarter path selection and the physical system must manage millions of fiber connections without becoming impossible to operate. This is important for AI clusters because training traffic is huge, synchronized, and sensitive to congestion, so a network that spreads load better can make expensive GPUs spend less time waiting. ---- Link – arxiv. org/abs/2604.15261 Title: "RNG: Flat Datacenter Networks at Scale"
译亚马逊推出了名为“Resilient Network Graphs”(RNG) 的新数据中心网络架构。该设计以扁平的准随机图替代了传统的树形网络,并通过Spraypoint路由系统和ShuffleBox布线设备在多个独立路径上分散流量。测试显示,RNG在性能上与传统fat-tree网络持平,但硬件需求减少69%,吞吐量提升33%,并估算成本可降低9%至45%。该架构现已成为大多数AWS工作负载的默认网络,其分散负载的能力有助于提升AI集群训练效率。
I’m very excited by this new benchmark dataset for visual generation that is suitable for the modern era of large scale generative models!🤩
译我对这个适用于大规模生成模型新时代的视觉生成基准数据集感到非常兴奋!🤩
DynaFLIP Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation
译DynaFLIP 通过三模态动态引导的表征重新思考机器人感知
Qwen-VLA Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
译Qwen-VLA 跨任务、环境与机器人具身的统一视觉语言动作建模
OmniRetrieval Unified Retrieval across Heterogeneous Knowledge Sources
译OmniRetrieval 跨异构知识源的统一检索
AgentDoG 1.5 A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security
译AgentDoG 1.5 一个用于AI智能体安全与保障的轻量且可扩展的对齐框架
The problem is that agent skills are usually hand-written, made once by an LLM, or revised in loose ways that can easily make them worse. SkillOpt from Microsoft, argues that agent skills should be trained like small external programs, it teaches AI agents better task habits by editing a reusable skill document, not the model itself. The paper’s core idea is to treat the skill document like the thing being trained, while the main AI model stays frozen and unchanged. SkillOpt watches the agent try tasks, studies what worked and failed, then asks a stronger optimizer model to suggest small edits to the skill. It only accepts an edit when the new skill improves on a held-out check set, so the skill does not drift just because an edit sounds good. The authors tested this across 6 benchmarks, 7 target models, and 3 agent settings, including direct chat, Codex, and Claude Code. SkillOpt was best or tied on all 52 tested cases, and on GPT-5.5 it raised average accuracy by 23.5 points in direct chat. The final result is a small readable skill file that can improve agents across tasks and settings without retraining the model. The best part is that the optimizer is used during training, but deployment only needs the final skill file. That makes the artifact inspectable, portable, and cheap to reuse, which is exactly what most prompt-engineering systems lack. ---- Link – arxiv. org/abs/2605.23904 Title: "SkillOpt: Executive Strategy for Self-Evolving Agent Skills"
译微软提出SkillOpt方法,旨在改进AI智能体技能的优化过程。其核心思想是将一个独立的技能文档视为优化对象,而非直接修改底层大语言模型。该方法让智能体尝试任务,分析成功与失败案例,然后由一个更强的优化器模型对技能文档进行小幅编辑。编辑只会在提升验证集表现时被接受,从而确保技能的稳定改进。在6个基准测试、7个目标模型和3种智能体设置(包括直接聊天、Codex和Claude Code)的共52个测试案例中,SkillOpt均达到最佳或并列最佳。在GPT-5.5上,它将直接聊天的平均准确度提升了23.5点。最终产出的技能文件可读、可移植且可复用,部署时无需重新训练模型。
Yann LeCun's new paper asks when LeJEPA truly learns hidden world variables, and finds Gaussian structure is the key. Means LeJEPA can only reliably learn the real hidden causes behind what it sees when those causes are shaped like a balanced Gaussian cloud. The paper proves that, when the true hidden variables are independent Gaussian variables and the paired views come from a stable noisy process, the best LeJEPA solution must recover those variables up to a rotation or flip. The paper gives a math reason for when a self-supervised AI model is really learning the structure of the world, not just making useful features that happen to work on a test. ---- Link – arxiv. org/abs/2605.26379 Title: "When Does LeJEPA Learn a World Model?"
译Yann LeCun团队的新论文探讨了LeJEPA模型学习真实世界隐藏变量的条件。其核心结论是,LeJEPA只有在真实的隐藏变量呈现高斯云结构时,才能可靠地学习它们。论文通过数学证明,当这些隐藏变量是独立高斯变量,并且配对视图由一个稳定的噪声过程生成时,LeJEPA的最优解能够以旋转或翻转等价的形式恢复这些变量。这项研究为自监督AI模型究竟在何时能真正理解世界结构(而不仅仅是提取在测试集上有效的特征)提供了理论解释。
Ngl, this made me laugh and didnt surprise me at all. Researchers at Emergence AI let different AI models run simulated societies, and the results were - well - expected: Claude built the most stable world with zero crime, while Grok collapsed into extinction within four days and Gemini produced hundreds of crimes.
译说实话,这让我笑了,但一点也不意外。 Emergence AI 的研究人员让不同的 AI 模型运行模拟社会,结果——嗯——在意料之中:Claude 建立了最稳定的世界,零犯罪;而 Grok 在四天内崩溃灭绝,Gemini 则产生了数百起犯罪。
Big release - Open Source Recursive Self Improvement from @hexoai Shows AI agent can improve both how it works and what it internally knows after seeing its own task results. i.e. by repeatedly training on its own task feedback, not by relying on a human to hand-code every strategy. Most agents today are frozen workers: you can give them better prompts, better tools, better retry rules, and better code, but the actual model usually stays the same. SIA (Self Improving AI framework) changes the outer workflow, called the harness, and also changes the model’s weights, which are the internal settings that store learned patterns. which means task feedback changes the model’s internal parameters, pushing it toward domain knowledge. The paper reports a 56.6% gain on LawBench, 91.9% runtime reduction on GPU kernels, and 502% improvement on single-cell RNA denoising over baseline.
译hexoai开源了SIA(自我改进AI)框架。该框架展示了AI智能体不仅能优化其外部工作流(harness),还能通过任务反馈直接更新自身的模型权重,从而在领域知识和能力上实现自主提升,而非仅依赖人类提供的提示或工具改进。论文报告显示,SIA在LawBench基准上性能提升56.6%,在GPU kernels运行上耗时减少91.9%,在单细胞RNA去噪任务中相比基线提升502%。
// Memory as Connectivity // One of the cleaner reframings of agent memory I have seen this month. FluxMem treats memory as the continuously evolving topology of a heterogeneous graph. Three stages run together: initial connection formation, feedback-driven refinement, and long-term consolidation of recurrent successful trajectories into reusable procedural circuits. During execution, it repairs missing links, prunes interference, and aligns abstraction granularity. SOTA on LoCoMo, Mind2Web, and GAIA across three distinct memory regimes. Paper: https://arxiv.org/abs/2605.28773 Learn to build effective AI agents in our academy: https://academy.dair.ai/
译提出了一种名为FluxMem的AI智能体记忆架构,其核心理念是将记忆视为一个持续演化的异构图拓扑。该框架通过三个并行阶段运行:初始连接形成、基于反馈的精炼,以及将反复成功的轨迹长期整合为可复用的程序性回路。执行过程中,它会修复缺失链接、剪枝干扰信息并调整抽象粒度。该方法在LoCoMo、Mind2Web和GAIA三个不同的记忆任务基准测试上均达到了SOTA水平。
SkillOpt Executive Strategy for Self-Evolving Agent Skills
译SkillOpt 智能体技能自进化的执行策略
ProRL Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation
译ProRL 通过修正策略梯度估计实现主动推荐的有效强化学习
Agent Explorative Policy Optimization for Multimodal Agentic Reasoning
译多模态智能体推理的探索性策略优化
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
译离散扩散中摊销序列蒙特卡洛的对比分布匹配
PhysX-Omni Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects
译PhysX-Omni 统一的、可直接用于仿真的物理3D生成模型,支持刚体、可变形体和铰接体对象。
MRT Masked Region Transformer for Layered Image Generation and Editing at Scale
译MRT 用于大规模分层图像生成与编辑的掩码区域Transformer
Super important paper from Univ of Texas. AI agents can slowly become less reliable after deployment, even when the model itself does not change. The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance. An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house. Every one of those steps can quietly rot. A medication dose can become “a daily medication,” two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass. The uncomfortable finding is that the agent may still sound competent while becoming less exact. The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point. It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory. The deeper lesson is that “give it more memory” is often the wrong repair. If the fact was never written, retrieval cannot save it. If the fact was written but crowded out, better summarization will not fix it. If the fact is present but unused, the problem is not storage but the agent’s decision to trust or ignore what it retrieved. This paper reframes deployed agents less like static models and more like aging infrastructure. ---- Link – arxiv. org/abs/2605.26302 Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"
译论文指出AI智能体在部署后,其记忆系统会因摘要、存储、更新和维护而逐渐“衰老”,导致信息丢失、混淆、过时或被破坏。智能体看似仍能工作,但可靠性已悄然下降。为此提出AgingBench基准,用于评估智能体在多会话中的持续可靠性。论文将智能体比作会衰老的基础设施,强调单纯增加记忆并非解决方案。
Image diffusion Transformers train poorly because their layers pass information in a fixed, outdated way. Now they can train much faster by changing how layers share information. With this paper, the same image quality arrived with 8.75x fewer training iterations. The surprise is not that Diffusion Transformers had an inefficiency, but where it was hiding. Researchers have spent years refining attention, conditioning, tokenization, objectives, and autoencoders, while leaving the residual stream mostly untouched because it looked like plumbing rather than intelligence. In a standard residual stack, every layer keeps adding its output to the running stream, which sounds harmless until the stream’s magnitude swells, gradients fade backward, and neighboring blocks begin saying nearly the same thing. That is bad for any Transformer, but it is especially awkward for diffusion, because denoising is not one fixed task repeated at every step. The authors found 3 signs that this old setup hurts the model: signals get too large going forward, learning signals fade going backward, and nearby blocks often produce almost the same features. Their fix is Diffusion-Adaptive Routing, a replacement that lets each layer choose which earlier layer outputs to use, and the choice changes with the denoising timestep. The big deal is that the paper does not add a new image dataset, loss, tokenizer, or attention trick, but instead questions the old residual connection that most models kept copying from language Transformers. ---- Link – arxiv. org/abs/2605.20708 Title: "Rethinking Cross-Layer Information Routing in Diffusion Transformers"
译传统Diffusion Transformers因层间信息传递方式固化导致训练效率低下。研究团队提出Diffusion-Adaptive Routing方法,允许每层动态选择使用哪些早期层的输出,且该选择随去噪时间步调整。该方法未引入新的数据集、损失函数或注意力机制,仅通过优化残差连接,使得相同图像质量所需的训练迭代次数减少8.75倍。
There is a lot being written about the stylistic tells of AI writing (em-dashes, etc.) but this paper looks at AI narrative tells Fascinating differences between AI & human narrative, and asking AI to write in different styles doesn't do much to change it https://arxiv.org/abs/2604.03136
译关于AI写作的风格特征(如破折号等)已有大量讨论,但这篇论文关注的是AI叙事特征 AI与人类叙事之间存在引人入胜的差异,要求AI以不同风格写作并不能显著改变这一点 https://arxiv.org/abs/2604.03136
Gamma-World Generative Multi-Agent World Modeling Beyond Two Players
译Gamma-World 超越双人对战的生成式多智能体世界建模
Long-running language agents may work better if they periodically stop to consolidate memory. The problem is that today’s transformer agents get slower and more expensive as their context grows, because attention has to keep checking more past tokens. The usual fix for long context is to keep more tokens nearby, but that turns every next-token prediction into a larger search through the past. The sharper idea here is that memory is not only storage. Sometimes the hard part is converting a messy stretch of experience into a state that can actually be used later. So the paper’s idea is to add a sleep phase, where the model pauses, rereads recent context several times, writes the useful information into fixed-size memory layers, and then clears the short-term attention cache. During sleep, the model runs several offline passes over recent context, writes the result into fast weights inside its state-space blocks, then clears the attention cache. This means the model pays extra compute while sleeping, not while answering, so normal prediction can still happen with 1 forward pass. The authors test this on cellular automata, graph lookup, and GSM-Infinite math problems, where the model must use old information that is no longer sitting in its attention cache. The main result is that longer sleep improves performance, especially on harder cases that need deeper reasoning rather than just remembering a fact. The big deal is that long-horizon agents may not need to carry bigger and bigger raw context forever, because they can consolidate the important parts and safely forget the raw tokens. ---- Link – arxiv. org/abs/2605.26099 Title: "Language Models Need Sleep"
译针对当前Transformer智能体因上下文不断增长而推理变慢变贵的问题,论文提出效仿人类睡眠机制进行记忆巩固。其核心方案是加入周期性的“睡眠阶段”:模型在此阶段暂停,多次重读近期上下文,将有用信息写入固定大小的记忆层(如状态空间块的快速权重),然后清空短期注意力缓存。此离线过程使后续回答仍只需一次前向传播。在细胞自动机、图查找和GSM-Infinite数学问题上的测试表明,更长的睡眠时间能提升性能,尤其对需要深度推理的复杂任务。该思路表明,长期智能体或可通过记忆巩固实现高效遗忘与重用,不必无限携带原始上下文。
斯坦福团队研究发现,使用未过滤Common Crawl数据训练模型时,在计算量充足下效果可能优于清洗后数据,结论呈现模型规模依赖性:小模型(15M)上过滤数据全面领先,但大模型(330M、1B)未过滤数据在充分训练后反而超越过滤版本,原因是大模型参数容量足够大,可在训练中自行隔离噪声与有效信息。
Google DeepMind发布了基于Gemini的多Agent系统Co-Scientist,旨在实现科研流程自动化。该系统能够生成、辩论和验证假设,帮助科学家从高强度脑力劳动中解放出来。过去一年,它已在肝纤维化新靶点、ALS新疗法等复杂问题上与科学家合作探索出新方向。其定位并非取代科学家,而是作为“专职研究伙伴”。目前,其假设生成功能已通过Gemini for Science向个人研究者开放。
We believe AI can be a dedicated research partner to help discover the next breakthrough. Enter Co-Scientist: our latest...
关联讨论 1 条X:Google DeepMind (@GoogleDeepMind)斯坦福研究人员发现,在评估合同法问题时,法律教授有75%的次数更倾向于选择AI给出的答案,而非同行教授的答案。该研究让教授们针对40个真实学生提问撰写答案,并对近3000个人类与AI的回答进行了盲测比较。结果不仅显示AI胜出频率高,而且教授们仅将3.5%的AI答案标记为“有害”,而对人类答案的有害标记率为12%。这表明大语言模型并非只是流畅,其表现常能达到教授向学生解释法律模糊性的教学标准。
一项对4,760个科学事件的研究发现,AI模型在“解释”科学方面优于“预测”科学。模型在识别可能的研究路径(尤其是选择题形式)时表现较好,但在预测科学发现是否会实际发生、何时发生以及何种方法有效等更难任务上表现薄弱,准确率接近随机猜测。即使提供额外历史信息,模型改善有限。这表明,模型内嵌大量科学知识并不等同于具备可靠的科学预见能力。研究论文发表于arXiv(2605.22681),标题为《Forecasting Scientific Progress with AI》。
研究探讨添加更多智能体是否提升多智能体系统性能。结论指出,最优智能体数量取决于基础模型的能力和任务类型,而非单纯增加数量。集体智能更可能源于精心的交互设计,而非智能体数量的增多。相关论文:"Scaling Behavior of Single LLM-Driven Multi-Agent Systems"。
该研究提出了一种AI驱动的服务,用于在启动前预测最便宜且安全的AWS Spot实例舰队。该服务通过时间感知模型学习AWS创建舰队的模式,并估算9个区域的舰队组合与成本,向用户返回排序后的区域选项。测试显示,在最多1500 vCPU的舰队上,预测结果与AWS完全匹配的比例达92.78%,整体准确率为99.79%,且所有推荐舰队均被AWS接受。关键发现是选择最佳区域比在单个区域内调整策略更重要,潜在成本节省最高可达64%。
美团LongCat发布视频世界模型评测基准WBench。该基准将测试重点从画面美观转向控制、多轮记忆、指令遵循和物理合理性等核心能力。它包含289个案例、1058个交互轮次,评估了20个模型在导航、主体动作、事件编辑等5个维度的表现,共使用22项自动指标。研究发现,没有任何模型能在所有维度上占据主导,这表明现有系统尚未将高质量渲染、可靠控制、长期记忆与物理规则遵循整合为稳定能力。WBench的设计能区分失败是源于渲染、场景设置、控制还是物理问题,并指出导航能力与视觉质量基本无关。
当前AI智能体的扩展方法常错误地将计算资源消耗等同于学习证据。新研究指出,两次运行消耗相同预算,但反馈的有效性可能天差地别。为此,研究提出了“有效反馈计算”(EFC)指标,仅统计那些正确、新颖、相关且被记住、并能改变后续决策的反馈。研究还结合任务需求对EFC进行归一化。实验表明,任务归一化的EFC比原始计算指标更能预测失败。在一项匹配预算测试中,采用更好反馈的方法将任务成功率从0.27提升至0.90,而成本和工具调用次数保持不变。 链接:arxiv.org/abs/2605.29682 标题:"Scaling Laws for Agent Harnesses via Effective Feedback Compute"
该论文指出,当AI智能体在多轮对话中重复使用相同文档和历史记录时,固定的上下文策略并非最优。研究提出了“效率前沿”框架,将上下文策略选择建模为一个成本与性能的平衡问题。通过引入重用参数N进行扫描,可以识别出检索、压缩或全上下文各自占据优势的交叉区域。在5000个HotpotQA实例上的测试表明,部署感知的选择能在保持相同性能下减少约25%的有效token使用量,而摊销内存压缩在高性能设置下比全上下文提示的运行成本便宜超过50%。
亚马逊推出了名为“Resilient Network Graphs”(RNG) 的新数据中心网络架构。该设计以扁平的准随机图替代了传统的树形网络,并通过Spraypoint路由系统和ShuffleBox布线设备在多个独立路径上分散流量。测试显示,RNG在性能上与传统fat-tree网络持平,但硬件需求减少69%,吞吐量提升33%,并估算成本可降低9%至45%。该架构现已成为大多数AWS工作负载的默认网络,其分散负载的能力有助于提升AI集群训练效率。
1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation! 🚀100M VLM-captioned image-tex...
微软提出SkillOpt方法,旨在改进AI智能体技能的优化过程。其核心思想是将一个独立的技能文档视为优化对象,而非直接修改底层大语言模型。该方法让智能体尝试任务,分析成功与失败案例,然后由一个更强的优化器模型对技能文档进行小幅编辑。编辑只会在提升验证集表现时被接受,从而确保技能的稳定改进。在6个基准测试、7个目标模型和3种智能体设置(包括直接聊天、Codex和Claude Code)的共52个测试案例中,SkillOpt均达到最佳或并列最佳。在GPT-5.5上,它将直接聊天的平均准确度提升了23.5点。最终产出的技能文件可读、可移植且可复用,部署时无需重新训练模型。
Yann LeCun团队的新论文探讨了LeJEPA模型学习真实世界隐藏变量的条件。其核心结论是,LeJEPA只有在真实的隐藏变量呈现高斯云结构时,才能可靠地学习它们。论文通过数学证明,当这些隐藏变量是独立高斯变量,并且配对视图由一个稳定的噪声过程生成时,LeJEPA的最优解能够以旋转或翻转等价的形式恢复这些变量。这项研究为自监督AI模型究竟在何时能真正理解世界结构(而不仅仅是提取在测试集上有效的特征)提供了理论解释。
hexoai开源了SIA(自我改进AI)框架。该框架展示了AI智能体不仅能优化其外部工作流(harness),还能通过任务反馈直接更新自身的模型权重,从而在领域知识和能力上实现自主提升,而非仅依赖人类提供的提示或工具改进。论文报告显示,SIA在LawBench基准上性能提升56.6%,在GPU kernels运行上耗时减少91.9%,在单细胞RNA去噪任务中相比基线提升502%。
Superintelligence will be built on Self Improvement. Today @hexoai, we're excited to release 'SIA' - an open-source Self...
提出了一种名为FluxMem的AI智能体记忆架构,其核心理念是将记忆视为一个持续演化的异构图拓扑。该框架通过三个并行阶段运行:初始连接形成、基于反馈的精炼,以及将反复成功的轨迹长期整合为可复用的程序性回路。执行过程中,它会修复缺失链接、剪枝干扰信息并调整抽象粒度。该方法在LoCoMo、Mind2Web和GAIA三个不同的记忆任务基准测试上均达到了SOTA水平。
论文指出AI智能体在部署后,其记忆系统会因摘要、存储、更新和维护而逐渐“衰老”,导致信息丢失、混淆、过时或被破坏。智能体看似仍能工作,但可靠性已悄然下降。为此提出AgingBench基准,用于评估智能体在多会话中的持续可靠性。论文将智能体比作会衰老的基础设施,强调单纯增加记忆并非解决方案。
传统Diffusion Transformers因层间信息传递方式固化导致训练效率低下。研究团队提出Diffusion-Adaptive Routing方法,允许每层动态选择使用哪些早期层的输出,且该选择随去噪时间步调整。该方法未引入新的数据集、损失函数或注意力机制,仅通过优化残差连接,使得相同图像质量所需的训练迭代次数减少8.75倍。
针对当前Transformer智能体因上下文不断增长而推理变慢变贵的问题,论文提出效仿人类睡眠机制进行记忆巩固。其核心方案是加入周期性的“睡眠阶段”:模型在此阶段暂停,多次重读近期上下文,将有用信息写入固定大小的记忆层(如状态空间块的快速权重),然后清空短期注意力缓存。此离线过程使后续回答仍只需一次前向传播。在细胞自动机、图查找和GSM-Infinite数学问题上的测试表明,更长的睡眠时间能提升性能,尤其对需要深度推理的复杂任务。该思路表明,长期智能体或可通过记忆巩固实现高效遗忘与重用,不必无限携带原始上下文。