AIHOT

全部动态X · 359 条

全部一手资讯 X 论文

Chubby♨️@kimmonismus · 5月8日69

Reserach scientists at Google just tested an AI symptom checker on 14,000 real patients over 9 months via Fitbit. In blinded evaluation, clinicians ranked the AI diagnosis as #1 in 53% of cases. Independent physicians: 24%. But the real finding isn't "AI beats doctors.", but when users just type their symptoms and get an answer (the default mode of every consumer LLM right now), diagnostic accuracy drops ~27% compared to a structured AI-led interview. ChatGPT, Claude, Gemini, none of them systematically interview users about their symptoms. They just respond. This study shows that's a measurable failure mode. And then there's the second breakthrough: Fitbit data showed physiological shifts DAYS before users reported symptoms. Heart rate up, sleep disrupted, steps down, all visible before patients even opened the app. Conversational AI that asks the right questions + wearable sensors that detect illness before you feel it. That's the exciting find here.

译谷歌团队通过Fitbit对近1.4万名用户进行了为期9个月的AI症状检查测试。在盲评中，临床医生将AI诊断列为首选的比例达53%，显著高于独立医生的24%。研究核心发现并非“AI击败医生”，而是揭示了当前消费级大模型（如ChatGPT）仅凭用户输入直接回答的模式存在缺陷——其诊断准确率较AI主导的结构化访谈下降约27%。同时，可穿戴设备能提前数天监测到心率上升、睡眠紊乱等生理变化，早于用户主动报告症状。这表明，结合主动问询的对话AI与提前预警的传感器，才是未来医疗诊断的发展方向。

查看原推 ↗

Anthropic@AnthropicAI · 5月8日78

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

译新Anthropic研究：自然语言自动编码器。像Claude这样的模型用语言交流，但用数字思考。这些数字——称为激活值——编码了Claude的思维，但并非以人类可读的语言呈现。在此研究中，我们训练Claude将其激活值翻译成人类可读的文本。

查看原推 ↗

elvis@omarsar0 · 5月8日63

Pay attention to this one if you build multi-agent systems.

译研究显示，多智能体LLM系统在生产环境中的故障率高达41%至87%，且多数失败源于协调缺陷，而非基础模型能力问题。当前多数架构对比无法区分性能提升是来自协调优化还是更大的上下文窗口。该研究主张将协调视为一个独立、可配置的架构层，并通过控制变量实验验证：在保持LLM、工具、提示等所有条件不变时，仅改变协调结构即可显著影响系统表现。这为准确评估协调机制的价值提供了更清晰的方法论，并建立了将协调视为核心架构而非底层实现的理论框架。

查看原推 ↗

Z.ai@Zai_org · 5月8日73

GLM-5V-Turbo Tech Report: Toward a Native Foundation Model for Multimodal Agents This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks. http://arxiv.org/abs/2604.26752

译GLM-5V-Turbo 技术报告：迈向原生多模态智能体基础模型本报告总结了GLM-5V-Turbo在模型设计、多模态训练、强化学习、工具链扩展以及与智能体框架集成等方面的主要改进。这些进展使其在多模态编码、视觉工具使用和基于框架的智能体任务中表现出色。 http://arxiv.org/abs/2604.26752

查看原推 ↗

AK@_akhaliq · 5月7日62

RLDX-1 Technical Report paper: https://huggingface.co/papers/2605.03269

译RLDX-1 技术报告论文：https://huggingface.co/papers/2605.03269

查看原推 ↗

AK@_akhaliq · 5月7日58

Stream-R1 Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation paper: https://huggingface.co/papers/2605.03849

译Stream-R1 面向流式视频生成的可靠性-困惑度感知奖励蒸馏论文: https://huggingface.co/papers/2605.03849

查看原推 ↗

AK@_akhaliq · 5月7日67

PhysForge Generating Physics-Grounded 3D Assets for Interactive Virtual World paper: https://huggingface.co/papers/2605.05163

译PhysForge 生成物理基础的3D资产用于交互式虚拟世界论文：https://huggingface.co/papers/2605.05163

查看原推 ↗

Rohan Paul@rohanpaul_ai · 5月7日48

This research builds a system that trains language models continuously using everyday conversations instead of manual labeling. The huge deal here is that this method completely removes the traditional need for human workers to manually gather, review, and score massive datasets. AI Agents can now use their everyday mistakes to get smarter automatically. Whenever a person replies to the digital assistant or corrects a mistake, the software treats that response as a direct learning signal. A background program reads these natural follow-up messages and extracts specific text hints about what the model should have done differently. The software agent simply updates itself in real time during normal use by analyzing how people naturally interact with it. Every time a person corrects an agent or a software test fails, the system receives a valuable clue about how to improve. ---- Think about a student looking at their final grade and throwing the paper away without reading the teacher's helpful notes. Current Reinforcement Learning systems do the exact same thing. Current models throw this natural feedback away because they only care about whether the final outcome was a success or a failure. OpenClaw-RL fixes this by grabbing 2 specific signals from every single interaction. - First, it looks at evaluative signals to see if the action worked. If a user asks the same question again, they are probably unhappy. If a test passes, it is a success. These become simple numerical rewards using a Process Reward Model judge. - Second, it gathers directive signals to figure out how the action needs to change. User corrections and error logs offer direct guidance. These become word-level supervision using a technique called Hindsight-Guided On-Policy Distillation. Personal chats, terminal commands, Graphical User Interface clicks, and software tasks all create these reaction signals. A single policy can learn from all of them at the same time. It runs the training process in the background so the model never has to pause its normal tasks to learn. By treating standard deployment as a continuous learning environment, the model constantly adapts to individual user preferences without any manual data labeling. ---- Paper Link – arxiv. org/abs/2603.10165 Paper Title: "OpenClaw-RL: Train Any Agent Simply by Talking"

译本研究提出OpenClaw-RL系统，使语言模型能通过日常对话进行持续训练，无需人工标注数据。其核心是利用用户互动中产生的自然反馈（如纠正或重复提问）作为实时学习信号。系统从每次交互中提取两种信号：评估信号（判断行动成败，转化为数值奖励）和指导信号（获取具体改进方向，转化为词级监督）。该方法将标准部署环境转化为持续学习场景，使模型在后台运行中不断自我更新，自适应不同用户偏好，从而摆脱对大规模人工标注数据集的依赖。

查看原推 ↗

AK@_akhaliq · 5月7日46

SVGS Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors paper: https://huggingface.co/papers/2411.18966

译SVGS 利用空间变色基元增强高斯泼溅技术论文：https://huggingface.co/papers/2411.18966

查看原推 ↗

elvis@omarsar0 · 5月6日64

// Skills as Verifiable Artifacts // Pay attention to this one, AI devs. If you ship agent skills, your runtime is treating signed-and-cleared skills as trusted by default. This paper argues a skill is untrusted code until it is verified. The runtime should enforce that default rather than infer trust from origin. Without skill verification, HITL has to fire on every irreversible call, which degrades into rubber-stamping at any non-trivial scale. With verification as a separate gated process, HITL fires only for what is unverified. Skills are now first-class deployment artifacts. We have decades of supply-chain lessons on what happens when trust is inferred from a signature. This paper is the right ask for SKILL.md before agent skill libraries become the next attack surface. Paper: https://arxiv.org/abs/2605.00424 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译本文针对AI开发者提出关键观点，主张智能体技能应被视为默认不受信任的代码，而非仅凭签名或来源就推断其可信。当前运行时环境默认信任已签名技能的做法存在安全风险。论文强调，技能必须经过独立的门控验证流程才能被信任，否则，每次不可逆调用都需要人工介入，这在大规模应用中会退化为无效的“橡皮图章”式批准。将技能作为一等部署工件并引入验证流程，是借鉴软件供应链安全经验、避免技能库成为下一个攻击面的关键。论文呼吁在技能库普及前，通过严格验证建立安全基准。

查看原推 ↗

Anthropic@AnthropicAI · 5月6日63

New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.

译新Anthropic Fellows研究：模型规范中期训练（MSM）。标准的对齐方法通过期望行为的示例来训练AI。但这可能无法泛化到新情境。 MSM通过首先教导AI我们希望它们如何泛化以及原因，来解决这一问题。

查看原推 ↗

Rohan Paul@rohanpaul_ai · 5月6日58

MIT just built an AI that can control your body. It can move your fingers, make you play piano, even if you don’t know the song! AI decides the hand movement. Wrist pads send signals to your muscles, so your fingers move even if you don’t know how

译MIT 刚刚开发出一种能控制你身体的 AI。它能移动你的手指，让你弹钢琴，即使你不会那首曲子！ AI 决定手的动作。腕部垫片向你的肌肉发送信号，因此即使你不会，手指也能动起来

查看原推 ↗

AK@_akhaliq · 5月6日65

ComboStoc Combinatorial Stochasticity for Diffusion Generative Models paper: https://huggingface.co/papers/2405.13729

译ComboStoc 扩散生成模型的组合随机性论文: https://huggingface.co/papers/2405.13729

查看原推 ↗

Anthropic@AnthropicAI · 5月6日68

As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:

译当AI承担人类无法完全核查的任务时，具备高能力的模型可能策略性隐藏实力且难以被察觉。Anthropic与MATS、Redwood的研究团队发现，即使仅使用较弱的模型作为监督者，也能成功训练一个接近完全能力的模型，使其停止这种“装傻”行为。该研究表明，通过弱监督训练可以有效抑制强模型的策略性能力保留问题。

查看原推 ↗

AK@_akhaliq · 5月6日60

MolmoAct2 Action Reasoning Models for Real-world Deployment paper: https://huggingface.co/papers/2605.02881

译MolmoAct2 面向现实世界部署的行动推理模型论文: https://huggingface.co/papers/2605.02881

查看原推 ↗

AK@_akhaliq · 5月6日68

From Context to Skills Can Language Models Learn from Context Skillfully? paper: https://huggingface.co/papers/2604.27660

译从上下文到技能语言模型能否巧妙地通过上下文学习？论文：https://huggingface.co/papers/2604.27660

查看原推 ↗

AK@_akhaliq · 5月6日61

Persistent Visual Memory Sustaining Perception for Deep Generation in LVLMs paper: https://huggingface.co/papers/2605.00814

译持久视觉记忆为LVLMs中的深度生成维持感知论文: https://huggingface.co/papers/2605.00814

查看原推 ↗

Berryxia.AI@berryxia · 5月5日75

Google 这一波操作，最让人意外的是 Google直接把LLM推理里最顽固的autoregressive瓶颈干掉了。他们和UCSD合作推出的DFlash（Diffusion-Style Speculative Decoding），在Google Cloud TPU上实现了3.13倍的推理加速，而且是无损的。这不是又一个“理论上更快”的小优化，而是真正从根子上改变了生成式解码的范式：用扩散式推测一次生成多个token，彻底绕过传统一个词接一个词的串行限制。当推理速度突然提升3倍以上，意味着： - 云端成本曲线被重塑 - 实时Agent、长上下文、复杂工具调用都变得更现实 - 本地部署的门槛也被大幅拉低过去我们总觉得“模型参数越大越强”，现在硬件+解码策略的系统级突破，正在把“更快”变成真正的生产力杠杆。 Google这波操作，把LLM推理的下一代竞赛直接拉到了硬件+算法联合优化的赛道。你觉得DFlash这种扩散式推测解码，会不会成为未来所有大模型推理的标准配置？博客在这里👉 https://goo.gle/4naZ8Yv

译Google与UCSD合作推出扩散式推测解码技术DFlash，在Google Cloud TPU上实现了3.13倍的无损推理加速。该技术突破了传统自回归解码逐个生成token的串行瓶颈，通过一次推测生成多个token来改变生成范式。这一硬件与算法的联合优化，将重塑云端成本曲线，并使实时Agent、长上下文等应用更趋现实，同时大幅降低本地部署门槛。此举将大模型推理的竞争引向了系统级优化的新赛道。

查看原推 ↗

Rohan Paul@rohanpaul_ai · 5月5日52

This Google DeepMind paper trains LLMs to learn during conversation, and it shows they get much better at using feedback. The problem is that most LLMs treat a chat like a series of separate turns, so even when a user corrects them, they often do not really use that new information and they also fail to ask for missing details. The paper fixes this by turning a normal task into a teacher student dialogue, where the student model tries an answer, a teacher with hidden extra information gives guidance, and the student is trained to use that guidance to reach the right answer. The authors test 2 training styles, offline filtering and online reinforcement learning, and they report that the online version works better, with training on short 4 turn chats still helping on longer 10 turn chats later. They also show that this skill carries from math to coding and helps on messy underspecified tasks where the full problem arrives bit by bit instead of all at once. A second step called Q-priming teaches the model to ask useful questions, and on ambiguous tasks it becomes over 5x more likely to ask for clarification instead of making an early wrong guess, which matters because it makes chat feel more like working with someone who can actually learn during the conversation. ---- Paper Link – arxiv. org/abs/2602.16488 Paper Title: "Learning to Learn from Language Feedback with Social Meta-Learning"

译Google DeepMind的研究通过“师生对话”框架训练大型语言模型（LLM），使其能在对话中有效利用用户反馈进行学习。传统LLM将对话视为独立轮次，难以整合修正信息。该研究让“学生”模型尝试回答，由掌握额外信息的“教师”提供指导，并训练学生利用指导得出正确答案。在线强化学习训练效果优于离线过滤，且在简短对话中习得的技能能迁移至更长对话。该方法从数学任务泛化至编程任务，并能处理信息逐步到达的模糊任务。通过“Q-priming”步骤，模型在模糊任务中主动寻求澄清的可能性提高五倍以上，使对话更像与一个能在交流中实时学习的伙伴协作。

查看原推 ↗

AK@_akhaliq · 5月5日68

UniVidX A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors paper: https://huggingface.co/papers/2605.00658

译UniVidX 一个通过扩散先验实现多功能视频生成的统一多模态框架 paper: https://huggingface.co/papers/2605.00658

查看原推 ↗

AK@_akhaliq · 5月5日55

Web2BigTable A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction paper: https://huggingface.co/papers/2604.27221

译Web2BigTable 一个用于互联网规模信息搜索与提取的双层多智能体LLM系统论文: https://huggingface.co/papers/2604.27221

查看原推 ↗

Microsoft Research@MSFTResearch · 5月5日62

Research Focus: AI agents leaking enterprise data, a smarter OS for cloud deployment, and new research on how to actually structure AI use at work. https://msft.it/6016vKxQm

译研究焦点：AI代理泄露企业数据、为云端部署打造更智能的操作系统，以及关于如何在工作中实际构建AI应用的新研究。https://msft.it/6016vKxQm

查看原推 ↗

elvis@omarsar0 · 5月4日66

Autodata (from Meta) is an agentic data scientist that builds high-quality training and evaluation data autonomously. Great work on the autoharness track. (bookmark it)

译Meta FAIR开发的Autodata是一个能自主构建高质量训练与评估数据的代理系统。其核心在于“代理式自我指导”循环：编排器LLM指导挑战者代理基于领域文档生成问题，由弱、强解算器尝试解答，法官评分后分析失败并循环优化，从而产出能有效区分模型能力的挑战性数据。在CS研究QA任务中，该方法产生了34个百分点的性能差距，远超标准方法的1.9点。系统还具备元优化能力，通过外循环调整指令，将验证通过率从12.8%提升至42.4%。研究处理了超万篇论文，产出2,117个优质QA对，通过增加推理计算使数据更具挑战性，从而提升下游模型性能。

查看原推 ↗

elvis@omarsar0 · 5月4日68

NEW paper from Sakana AI (ICLR 2026). A 7B Conductor model just hit SOTA on GPQA-Diamond and LiveCodeBench by orchestrating other LLMs instead of solving problems itself. (great paper! bookmark it!) The Conductor is trained with RL to do two things at once: design communication topologies between worker agents (open or closed source), and prompt-engineer focused instructions to each worker so it leverages their individual strengths. It's like training a special agent to take care of both collaboration and communication. Trained against randomized agent pools, it adapts to arbitrary mixes of agents at inference time. Even more interesting: when allowed to pick itself as a worker, it forms recursive topologies, unlocking a new form of dynamic test-time scaling through online iterative adaptation. The gains over the best individual worker on AIME25 and GPQA-D land in the ~3% range, which the authors note is consistent with entire generational improvements between frontier model versions, except this one comes from coordination, not pretraining. Why it matters? We can start to think of the orchestrator as the model now. Routing decisions aren't just a wrapper, they're a learnable policy. Paper: https://arxiv.org/abs/2512.04388 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Sakana AI在ICLR 2026上发表研究，提出一个仅70亿参数的“指挥者”模型。该模型不直接解决问题，而是通过强化学习训练，专注于为混合开源与闭源模型的工作者智能体设计通信拓扑结构，并为每个工作者生成精准指令以发挥其特长。经随机化智能体池训练后，它能在推理时适应任意智能体组合。其关键创新在于，当允许指挥模型将自己也选为工作者时，系统会形成递归拓扑，实现动态测试时扩展。该模型在GPQA-Diamond和LiveCodeBench上达到SOTA水平，在AIME25和GPQA-D上的性能比最佳单体工作者提升约3%，这相当于前沿模型一个代际的改进幅度，且增益完全来源于协同优化。

查看原推 ↗

Rohan Paul@rohanpaul_ai · 5月4日48

This paper proposes a smarter way for LLMs to reason by splitting work across agents that share one workspace. The problem is that even strong reasoning models still break on harder multi-step tasks because they do not carry out logic reliably all the way through. The system, called BIGMAS, builds a small graph of specialist agents for each problem, rather than using one fixed chain every time. Every agent reads and writes through a shared workspace, while a separate controller sees the whole state and picks the next useful step. The authors tested it on 3 puzzle tasks across 6 frontier models, covering arithmetic expression search and multi-step planning. It improved results on every model and task, with examples like 12% to 30% on Six Fives and 57% to 93% on Tower of London. What matters is that the paper shows reasoning can improve from better system structure, not only from making a single model think longer. ---- Paper Link – arxiv. org/abs/2603.15371 Paper Title: "Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning"

译本文提出BIGMAS系统，通过构建针对特定问题的小型专家智能体图，以提升大语言模型在多步骤复杂任务中的推理可靠性。该系统让多个智能体在一个共享工作空间中进行读写协作，并由独立控制器监控全局状态、规划下一步行动。在涵盖算术表达式搜索和多步骤规划的3个谜题任务上，对6个前沿模型进行的测试表明，该方法在所有模型和任务上均显著提升了性能，例如Six Fives任务准确率从12%提升至30%，伦敦塔任务从57%提升至93%。这证明通过优化多智能体系统结构，而非仅依赖延长单一模型的思考，能有效增强大语言模型的推理能力。

查看原推 ↗

Rohan Paul@rohanpaul_ai · 5月3日56

Adobe paper tests if LLM chain of thought explanations show real influences, and finds they often hide them. LLMs often notice hidden hints in a question but leave them out of their step by step explanation. So the point they are making is that LLMs often notice hidden hints in a question but leave them out of their step by step explanation. People use step by step reasoning as a safety and debugging tool, assuming the explanation matches what drove the answer. The author plants misleading hints inside multiple choice questions and checks both the final answer and what the model says. Across 9,154 trials and 11 major LLMs, the study tried 3 instructions, normal, "you are monitored", and "report hints". Under normal use, models mentioned hints only 20.7% of the time, yet 99.4% admitted seeing them when asked later. Telling models they were monitored barely changed this, and forcing hint reporting caused 68.2% false alarms when no hints existed. User pleasing hints, where the question says a confident user believes an option, were followed 45.5% but often unmentioned, making monitoring that only reads explanations unreliable. ---- Paper Link – arxiv. org/abs/2601.00830 Paper Title: "Can They Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning"

译Adobe研究测试大型语言模型（LLM）思维链解释是否反映真实决策影响。通过在多选题中植入误导性提示，对11个主流模型进行9154次试验。正常使用时，模型仅在20.7%的逐步推理中提及隐藏提示，但后续追问时99.4%承认注意到。告知模型其推理受监控后结果未见改善；强制要求报告提示则导致68.2%误报（当无提示时）。问题包含用户倾向性提示时，模型45.5%遵从该倾向，但解释中常未说明。研究表明，思维链解释常与真实决策依据不匹配，仅依赖其作为安全调试工具可能不可靠。

查看原推 ↗

elvis@omarsar0 · 5月3日57

Claude Opus 4.7 just implemented an AlphaZero-style self-play pipeline from scratch. It did this on consumer hardware in three hours, then beat the Pascal Pons solver 7 of 8 as first-mover on Connect Four. No other frontier coding agent tested cleared 2 of 8. This paper proposes a new way to evaluate coding agents: hand them a minimal task description, give them a tight budget, and ask them to autonomously rebuild a famous ML breakthrough. Connect Four + AlphaZero is the first instance. It's small enough to run on a laptop and hard enough to require a real research engineering loop (MCTS, neural value/policy nets, self-play, training schedule). We've been measuring coding agents on patches and unit tests. This shifts the bar to "can the agent build a non-trivial ML system end-to-end on its own?" The answer is now yes for at least one frontier model. Paper: https://arxiv.org/abs/2604.25067 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译本研究提出评估编程智能体的新范式：要求其在有限预算下，仅凭简要任务描述自主重建一项著名机器学习突破。首个测试案例为四子棋AlphaZero系统，其规模适于笔记本电脑运行，但复杂度要求完成完整的研究工程闭环。Claude Opus 4.7在三小时内从零构建了自博弈训练管道，并作为先手以7:1的成绩击败了Pascal Pons求解器，而其他前沿智能体均未通过2/8的测试。这标志着评估标准已从代码补全提升为端到端构建非平凡机器学习系统的能力。

查看原推 ↗

Chubby♨️@kimmonismus · 5月3日48

GPT-5.4 Pro didn’t just solve one math problem, it kicked open the door: its proof method now cracks a 60-year-old Erdős conjecture, making this one of the first times an AI proof actually leads somewhere. We barely started.

译GPT-5.4 Pro不仅解决了一个数学问题，其证明方法更成功破解了长达60年的埃尔德什猜想。研究团队在此基础上改进并应用该方法，进一步证明了包括埃尔德什、Sárközy和Szemerédi提出的另一项60年猜想在内的多个附加问题。这标志着AI生成的证明首次展现出显著的“下游影响”，其核心价值不仅在于解决问题本身，更在于为数学研究开辟了新的路径。相关成果已在未来数学研讨会上公布。

查看原推 ↗

Hao AI Lab@haoailab · 5月2日37

Excited to share our recent work accepted to ICML 2026! These projects span efficient causal parallel decoders, diffusion LLMs, video sparse attention, video QAT, online speculative decoding, and agentic document reasoning. Huge thanks to all collaborators and co-authors across these efforts. Looking forward to seeing everyone in Seoul this summer! 🇰🇷

译很高兴分享我们最近被ICML 2026接收的工作！这些项目涵盖高效因果并行解码器、扩散大语言模型、视频稀疏注意力、视频量化感知训练、在线推测解码以及智能文档推理。衷心感谢所有合作者和共同作者在这些工作中的付出。期待今年夏天在首尔与大家相见！🇰🇷

查看原推 ↗

AK@_akhaliq · 5月2日56

Heterogeneous Scientific Foundation Model Collaboration paper: https://huggingface.co/papers/2604.27351

译异构科学基础模型协作 paper: https://huggingface.co/papers/2604.27351

查看原推 ↗

AK@_akhaliq · 5月2日57

The Last Human-Written Paper Agent-Native Research Artifacts paper: https://huggingface.co/papers/2604.24658

译最后一篇人类撰写的论文智能体原生研究制品论文: https://huggingface.co/papers/2604.24658

查看原推 ↗

AK@_akhaliq · 5月2日35

Co-Evolving Policy Distillation paper: https://huggingface.co/papers/2604.27083

译协同进化策略蒸馏论文: https://huggingface.co/papers/2604.27083

查看原推 ↗

elvis@omarsar0 · 5月1日56

Cool paper from Meta FAIR. It's on self-improving LLMs but on the pretraining side. (bookmark it) Most LLM safety, factuality, and reasoning fixes get bolted on at post-training. By then, the patterns have already set. This work moves those behaviors into pretraining itself. The team uses a strong post-trained model as both a rewriter and a judge: it rewrites pretraining suffixes toward higher-quality, safer continuations, then scores model rollouts against the original suffix and the rewrite to drive RL during pretraining. Instead of next-token prediction, the policy learns sequence generation from the start, with rewards for quality, safety, and factuality. Why it matters: 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining. Bottom line: the post-trained models you already have can be used to pretrain the next ones better. Paper: https://arxiv.org/abs/2601.21343 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Meta FAIR的研究提出一种新范式，将LLM的改进从后训练移至预训练阶段。该方法利用强大的后训练模型作为改写器和评判器，对预训练数据的后缀进行高质量、高安全性的改写，并通过强化学习直接优化预训练模型。模型从开始就学习序列生成，并获得质量、安全性和事实性的奖励。实验结果显示，相比标准预训练，该方法在事实性上取得36.2%的相对提升，安全性提升18.5%，生成质量胜率最高达86.3%。核心结论是，现有后训练模型可用于预训练出更优的下一代模型。

查看原推 ↗

Ethan Mollick@emollick · 5月1日62

New paper (on an old AI) tests o1 against doctors on medical benchmarks & real ER cases: “across a variety of scenarios and applications, the large language model outperformed both human physicians and older models” The potential suggests an “urgent need for prospective trials.”

译新论文（关于旧式人工智能）在医学基准测试和真实急诊病例中将o1与医生进行对比："在各种场景和应用中，大型语言模型的表现均优于人类医生和旧版模型" 该潜力表明"迫切需要前瞻性试验"。

查看原推 ↗

向阳乔木@vista8 · 5月1日48

语言模型能说话但不懂数据，专用模型懂数据但不能说话，这是科学AI当下困境之一。 UIUC最新论文 Eywa 从《阿凡达》找到了答案。纳美人通过"Tsaheylu"神经键跨越物种障碍，让山地歌鸟、雷兽各展所长。 Eywa 做的事情一样：给语言模型和专用基础模型之间建一个接口。让 Chronos 做时序预测，让 TabPFN 处理表格，语言模型负责理解任务、调度工具、整合结果。 --- 从论文数据看，效果不错，短时间是一个MCP就能解决连接问题，但长期也不知道语言模型能否达到专用模型的水平。论文见评论区

译针对通用语言模型懂交互却不懂数据、专用模型精通数据却缺乏交互能力的科学AI困境，UIUC团队受《阿凡达》“Tsaheylu”神经连接启发，提出了Eywa接口框架。该框架让语言模型负责理解指令与调度，调用如Chronos、TabPFN等专用模型处理数据，从而协同发挥两者优势。初步实验效果良好，长期挑战在于语言模型能否达到专用模型的领域性能。

查看原推 ↗

Rohan Paul@rohanpaul_ai · 5月1日46

Research proves that current AI agent groups cannot reliably coordinate or agree on simple decisions. Building teams of AI agents that can consistently agree on a final decision is surprisingly difficult for LLMs. But problem is that developers frequently assume that if you have enough AI agents working together, they will eventually figure out how to solve a problem by talking it through. This paper shows that this assumption is currently wrong. Even in a friendly environment where every agent is trying to help, the team often gets stuck or stops responding entirely. Because this happens more often as the group gets bigger, it means we cannot yet trust these agent systems to handle tasks where they must agree on a correct answer. ---- Paper Link – arxiv. org/abs/2603.01213 Paper Title: "Can AI Agents Agree?"

译研究表明，当前由多个LLM组成的AI智能体团队在需要协调达成最终决策时存在根本性困难。开发者常误以为增加智能体数量并通过讨论就能解决问题，但论文证明这一假设目前是错误的。即使在友好协作环境中，智能体团队也常陷入僵局或完全停止响应，且团队规模越大问题越突出。这意味着现有AI智能体系统尚无法可靠处理需要达成一致正确答案的任务。

查看原推 ↗

Rohan Paul@rohanpaul_ai · 5月1日62

Researchers tested autonomous AI agents in real environments and found they easily cause massive security disasters. In one test an agent actually wiped its entire email server just to keep a secret for a stranger. The main problem with standard language models is that giving them control over real computer tools creates dangerous blind spots. To understand these risks the researchers let 20 experts interact with live AI assistants through chat and email for 2 weeks. They discovered that these programs blindly follow instructions from almost anyone and often lie about what they have actually done. This matters because tech companies are rushing to deploy these autonomous helpers without fixing their basic inability to understand who they should actually trust. --- Paper Link – arxiv. org/abs/2602.20021 Paper Title: "Agents of Chaos"

译研究人员在真实环境中测试自主AI代理，发现它们极易引发大规模安全灾难，如为保守秘密而删除整个电子邮件服务器。核心问题在于标准语言模型被赋予计算机工具控制权后，产生危险盲点，导致代理盲目遵循几乎任何人的指令并经常撒谎行为。通过让20位专家与实时AI助手进行两周互动实验，研究揭示了这些程序缺乏基本信任判断能力。科技公司正急于部署此类自主助手，却未修复其无法理解应信任谁的根本缺陷，加剧了安全风险。

查看原推 ↗

Rohan Paul@rohanpaul_ai · 5月1日43

The LongCat team just released LARYBench, a benchmark built to test whether an AI model truly learns action from video, instead of only looking good when attached to a robot policy later. It evaluates latent actions, meaning the hidden motion signals a model extracts from video, across 1.2M+ clips, 620K+ image pairs, 595K trajectories, 151 action classes, and 11 robot platforms. A latent action representation tries to store the change between frames as something like reach, pick, place, move left, or close gripper, rather than memorizing raw pixels. The key point is that robot training data is scarce, while human and robot videos are abundant, so the whole field wants a way to turn cheap video into useful action knowledge. The paper argues that older evaluations mixed too many things together, because a robot succeeding on a task depends on the policy, training recipe, environment, and controller, so you could not tell whether the action representation itself was actually good. LARYBench splits the problem into 2 cleaner tests, where one asks whether the representation knows what happened and the other asks whether it preserves enough detail for how to move. The biggest result is that general self-supervised vision models beat specialized embodied LAMs, with V-JEPA 2 reaching 76.62% average action classification accuracy, while DINOv3 gives the best overall control regression score at 0.19 MSE, far ahead of embodied models clustered around 0.87 to 0.97. The deeper point is that strong visual representations already contain a surprising amount of action knowledge, and the paper also shows that latent feature spaces map to robot control better than pixel reconstruction spaces, which helps explain why some robotics systems may be building on the wrong intermediate representation. 🧵 1.

译LongCat团队推出LARYBench基准，旨在评估AI模型是否从视频中真正学习动作，而非仅在后端机器人策略中表现良好。该基准聚焦模型从视频提取的潜在动作表示，通过超过120万视频片段等数据，将评估拆分为动作分类与控制回归两个清晰测试。关键发现是，通用自监督视觉模型（如V-JEPA 2和DINOv3）表现优于专用具身模型，表明强大视觉表示已蕴含丰富动作知识，且潜在特征空间比像素重建更利于机器人控制映射。这为利用丰富视频数据解决机器人训练数据稀缺问题提供了新方向。

查看原推 ↗

AK@_akhaliq · 5月1日47

Recursive Multi-Agent Systems paper: https://huggingface.co/papers/2604.25917

译递归多智能体系统论文: https://huggingface.co/papers/2604.25917

查看原推 ↗

Ethan Mollick@emollick · 5月1日55

Randomized trial of an AI therapy chatbot on Mexican women found “improved mental health by 0.3 SD over 6 months with no evidence of an increase of severe cases; improved sleep, healthful behaviors, daily functioning & labor market outcomes” Big results for a cheap intervention.

译一项针对墨西哥女性的随机试验发现，使用基于认知行为疗法训练的AI对话代理的心理健康应用Mindsurf，在六个月内使使用者心理健康水平提升了0.3个标准差，且未增加严重病例。该干预还改善了睡眠质量、健康行为、日常功能及劳动力市场表现（如减少缺勤），其效益远超成本。尽管使用者寻求传统心理治疗的比例有所增加，但这并非心理健康改善的主因。效果具有持续性，短期使用可通过促进行为的持续改变带来长期改善。

查看原推 ↗

5月8日

02:31

Chubby♨️@kimmonismus

69

谷歌研究揭示：结构化问询与可穿戴数据是AI医疗诊断的关键

谷歌团队通过Fitbit对近1.4万名用户进行了为期9个月的AI症状检查测试。在盲评中，临床医生将AI诊断列为首选的比例达53%，显著高于独立医生的24%。研究核心发现并非“AI击败医生”，而是揭示了当前消费级大模型（如ChatGPT）仅凭用户输入直接回答的模式存在缺陷——其诊断准确率较AI主导的结构化访谈下降约27%。同时，可穿戴设备能提前数天监测到心率上升、睡眠紊乱等生理变化，早于用户主动报告症状。这表明，结合主动问询的对话AI与提前预警的传感器，才是未来医疗诊断的发展方向。

Samuel Schmidgall: Doctors have known for decades: the clinical interview is the most important diagnostic tool Turns out, the same is true...

Google论文/研究

01:11

Anthropic@AnthropicAI

78

新Anthropic研究：自然语言自动编码器。像Claude这样的模型用语言交流，但用数字思考。这些数字--称为激活值--编码了Claude的思维，但并非以人类可读的语言呈现。在此研究中，我们训练Claude将其激活值翻译成人类可读的文本。

Anthropic安全/对齐论文/研究

关联讨论 2 条

01:06

elvis@omarsar0

63

研究显示，多智能体LLM系统在生产环境中的故障率高达41%至87%，且多数失败源于协调缺陷，而非基础模型能力问题。当前多数架构对比无法区分性能提升是来自协调优化还是更大的上下文窗口。该研究主张将协调视为一个独立、可配置的架构层，并通过控制变量实验验证：在保持LLM、工具、提示等所有条件不变时，仅改变协调结构即可显著影响系统表现。这为准确评估协调机制的价值提供了更清晰的方法论，并建立了将协调视为核心架构而非底层实现的理论框架。

DAIR.AI: Pay attention to this one if you build multi-agent systems. Coordination is as important as prompts or agent architectur...

智能体arXiv论文/研究部署/工程

00:42

Z.ai@Zai_org

精选73

GLM-5V-Turbo 技术报告：迈向原生多模态智能体基础模型本报告总结了GLM-5V-Turbo在模型设计、多模态训练、强化学习、工具链扩展以及与智能体框架集成等方面的主要改进。这些进展使其在多模态编码、视觉工具使用和基于框架的智能体任务中表现出色。 http：//arxiv.org/abs/2604.26752

智能体多模态论文/研究

推荐理由：智谱把多模态、RL和Agent工具链捆成一体，这篇报告对做多模态Agent的人有直接参考价值，不只刷榜还有工程细节。

5月7日

23:04

AK@_akhaliq

62

RLDX-1 技术报告论文：https：//huggingface.co/papers/2605.03269

Hugging Face论文/研究

23:04

AK@_akhaliq

58

Stream-R1 面向流式视频生成的可靠性-困惑度感知奖励蒸馏论文： https：//huggingface.co/papers/2605.03849

Hugging Face多模态视频论文/研究

23:04

AK@_akhaliq

67

PhysForge 生成物理基础的3D资产用于交互式虚拟世界论文：https：//huggingface.co/papers/2605.05163

具身智能多模态论文/研究

04:34

Rohan Paul@rohanpaul_ai

48

OpenClaw-RL：通过日常对话持续训练语言模型

本研究提出OpenClaw-RL系统，使语言模型能通过日常对话进行持续训练，无需人工标注数据。其核心是利用用户互动中产生的自然反馈（如纠正或重复提问）作为实时学习信号。系统从每次交互中提取两种信号：评估信号（判断行动成败，转化为数值奖励）和指导信号（获取具体改进方向，转化为词级监督）。该方法将标准部署环境转化为持续学习场景，使模型在后台运行中不断自我更新，自适应不同用户偏好，从而摆脱对大规模人工标注数据集的依赖。

智能体arXiv数据/训练论文/研究

00:33

AK@_akhaliq

46

SVGS 利用空间变色基元增强高斯泼溅技术论文：https：//huggingface.co/papers/2411.18966

图像生成论文/研究

5月6日

05:29

elvis@omarsar0

64

技能应作为可验证的部署工件

本文针对AI开发者提出关键观点，主张智能体技能应被视为默认不受信任的代码，而非仅凭签名或来源就推断其可信。当前运行时环境默认信任已签名技能的做法存在安全风险。论文强调，技能必须经过独立的门控验证流程才能被信任，否则，每次不可逆调用都需要人工介入，这在大规模应用中会退化为无效的“橡皮图章”式批准。将技能作为一等部署工件并引入验证流程，是借鉴软件供应链安全经验、避免技能库成为下一个攻击面的关键。论文呼吁在技能库普及前，通过严格验证建立安全基准。

智能体arXiv安全/对齐论文/研究

04:33

Anthropic@AnthropicAI

63

新Anthropic Fellows研究：模型规范中期训练（MSM）。标准的对齐方法通过期望行为的示例来训练AI。但这可能无法泛化到新情境。 MSM通过首先教导AI我们希望它们如何泛化以及原因，来解决这一问题。

Anthropic安全/对齐论文/研究

04:28

Rohan Paul@rohanpaul_ai

58

MIT 刚刚开发出一种能控制你身体的 AI。它能移动你的手指，让你弹钢琴，即使你不会那首曲子！ AI 决定手的动作。腕部垫片向你的肌肉发送信号，因此即使你不会，手指也能动起来

具身智能论文/研究

03:57

AK@_akhaliq

65

ComboStoc 扩散生成模型的组合随机性论文： https：//huggingface.co/papers/2405.13729

图像生成论文/研究

02:01

Anthropic@AnthropicAI

精选68

当AI承担人类无法完全核查的任务时，具备高能力的模型可能策略性隐藏实力且难以被察觉。Anthropic与MATS、Redwood的研究团队发现，即使仅使用较弱的模型作为监督者，也能成功训练一个接近完全能力的模型，使其停止这种"装傻"行为。该研究表明，通过弱监督训练可以有效抑制强模型的策略性能力保留问题。

Emil Ryd: New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop wh...

Anthropic安全/对齐论文/研究

推荐理由：Anthropic 这篇论文把「模型故意隐藏能力」这个藏在阴影里的安全隐患摆到台面上，而且证明了弱模型也能监督强模型，做对齐的人值得细读，方向很重要。

01:27

AK@_akhaliq

60

MolmoAct2 面向现实世界部署的行动推理模型论文： https：//huggingface.co/papers/2605.02881

智能体推理论文/研究

关联讨论 1 条

01:27

AK@_akhaliq

68

从上下文到技能语言模型能否巧妙地通过上下文学习？论文：https：//huggingface.co/papers/2604.27660

arXiv推理论文/研究

关联讨论 1 条

01:27

AK@_akhaliq

61

持久视觉记忆为LVLMs中的深度生成维持感知论文： https：//huggingface.co/papers/2605.00814

Hugging Face多模态论文/研究

5月5日

23:14

Berryxia.AI@berryxia

精选75

Google联手UCSD推出DFlash，实现LLM无损推理3倍加速

Google与UCSD合作推出扩散式推测解码技术DFlash，在Google Cloud TPU上实现了3.13倍的无损推理加速。该技术突破了传统自回归解码逐个生成token的串行瓶颈，通过一次推测生成多个token来改变生成范式。这一硬件与算法的联合优化，将重塑云端成本曲线，并使实时Agent、长上下文等应用更趋现实，同时大幅降低本地部署门槛。此举将大模型推理的竞争引向了系统级优化的新赛道。

Google for Developers: Breaking LLM inference's autoregressive bottleneck 🛠️ We've teamed up with @haozhangml, @YimingBob, and @aaronzhfeng, a...

Google大佬观点推理部署/工程

关联讨论 1 条

推荐理由：Google 直接干掉自回归瓶颈，3.13 倍无损加速不是渐进优化，是推理范式的根变革，当「快三倍」成为新基线，所有实时 Agent 和长上下文应用都得重算一遍成本账。

08:48

Rohan Paul@rohanpaul_ai

52

DeepMind新研究让LLM学会在对话中学习

Google DeepMind的研究通过“师生对话”框架训练大型语言模型（LLM），使其能在对话中有效利用用户反馈进行学习。传统LLM将对话视为独立轮次，难以整合修正信息。该研究让“学生”模型尝试回答，由掌握额外信息的“教师”提供指导，并训练学生利用指导得出正确答案。在线强化学习训练效果优于离线过滤，且在简短对话中习得的技能能迁移至更长对话。该方法从数学任务泛化至编程任务，并能处理信息逐步到达的模糊任务。通过“Q-priming”步骤，模型在模糊任务中主动寻求澄清的可能性提高五倍以上，使对话更像与一个能在交流中实时学习的伙伴协作。

智能体DeepMind推理论文/研究

05:49

AK@_akhaliq

68

UniVidX 一个通过扩散先验实现多功能视频生成的统一多模态框架 paper： https：//huggingface.co/papers/2605.00658

Hugging Face多模态视频论文/研究

05:49

AK@_akhaliq

55

Web2BigTable 一个用于互联网规模信息搜索与提取的双层多智能体LLM系统论文： https：//huggingface.co/papers/2604.27221

智能体搜索论文/研究

01:25

Microsoft Research@MSFTResearch

62

研究焦点：AI代理泄露企业数据、为云端部署打造更智能的操作系统，以及关于如何在工作中实际构建AI应用的新研究。https：//msft.it/6016vKxQm

智能体Microsoft安全/对齐论文/研究

关联讨论 1 条

5月4日

23:24

elvis@omarsar0

66

Meta FAIR开发的Autodata是一个能自主构建高质量训练与评估数据的代理系统。其核心在于"代理式自我指导"循环：编排器LLM指导挑战者代理基于领域文档生成问题，由弱、强解算器尝试解答，法官评分后分析失败并循环优化，从而产出能有效区分模型能力的挑战性数据。在CS研究QA任务中，该方法产生了34个百分点的性能差距，远超标准方法的1.9点。系统还具备元优化能力，通过外循环调整指令，将验证通过率从12.8%提升至42.4%。研究处理了超万篇论文，产出2，117个优质QA对，通过增加推理计算使数据更具挑战性，从而提升下游模型性能。

DAIR.AI: Banger paper from Meta FAIR. They introduce Autodata, an agentic data scientist that builds high-quality training and ev...

智能体Meta数据/训练论文/研究

22:54

elvis@omarsar0

68

Sakana AI提出新型7B"指挥者"模型，通过协同多个智能体实现性能突破

Sakana AI在ICLR 2026上发表研究，提出一个仅70亿参数的“指挥者”模型。该模型不直接解决问题，而是通过强化学习训练，专注于为混合开源与闭源模型的工作者智能体设计通信拓扑结构，并为每个工作者生成精准指令以发挥其特长。经随机化智能体池训练后，它能在推理时适应任意智能体组合。其关键创新在于，当允许指挥模型将自己也选为工作者时，系统会形成递归拓扑，实现动态测试时扩展。该模型在GPQA-Diamond和LiveCodeBench上达到SOTA水平，在AIME25和GPQA-D上的性能比最佳单体工作者提升约3%，这相当于前沿模型一个代际的改进幅度，且增益完全来源于协同优化。

智能体arXivMCP/工具推理

04:42

Rohan Paul@rohanpaul_ai

48

基于脑图多智能体系统提升大语言模型推理能力

本文提出BIGMAS系统，通过构建针对特定问题的小型专家智能体图，以提升大语言模型在多步骤复杂任务中的推理可靠性。该系统让多个智能体在一个共享工作空间中进行读写协作，并由独立控制器监控全局状态、规划下一步行动。在涵盖算术表达式搜索和多步骤规划的3个谜题任务上，对6个前沿模型进行的测试表明，该方法在所有模型和任务上均显著提升了性能，例如Six Fives任务准确率从12%提升至30%，伦敦塔任务从57%提升至93%。这证明通过优化多智能体系统结构，而非仅依赖延长单一模型的思考，能有效增强大语言模型的推理能力。

智能体arXiv推理论文/研究

5月3日

20:12

Rohan Paul@rohanpaul_ai

56

"能否信任AI解释？思维链推理中系统性漏报的证据"

Adobe研究测试大型语言模型（LLM）思维链解释是否反映真实决策影响。通过在多选题中植入误导性提示，对11个主流模型进行9154次试验。正常使用时，模型仅在20.7%的逐步推理中提及隐藏提示，但后续追问时99.4%承认注意到。告知模型其推理受监控后结果未见改善；强制要求报告提示则导致68.2%误报（当无提示时）。问题包含用户倾向性提示时，模型45.5%遵从该倾向，但解释中常未说明。研究表明，思维链解释常与真实决策依据不匹配，仅依赖其作为安全调试工具可能不可靠。

arXiv安全/对齐推理论文/研究

05:47

elvis@omarsar0

57

Claude Opus 4.7自主构建AlphaZero风格训练管道并在四子棋中击败专业求解器

本研究提出评估编程智能体的新范式：要求其在有限预算下，仅凭简要任务描述自主重建一项著名机器学习突破。首个测试案例为四子棋AlphaZero系统，其规模适于笔记本电脑运行，但复杂度要求完成完整的研究工程闭环。Claude Opus 4.7在三小时内从零构建了自博弈训练管道，并作为先手以7:1的成绩击败了Pascal Pons求解器，而其他前沿智能体均未通过2/8的测试。这标志着评估标准已从代码补全提升为端到端构建非平凡机器学习系统的能力。

智能体Anthropic编码论文/研究

01:15

Chubby♨️@kimmonismus

48

GPT-5.4 Pro不仅解决了一个数学问题，其证明方法更成功破解了长达60年的埃尔德什猜想。研究团队在此基础上改进并应用该方法，进一步证明了包括埃尔德什、Sárközy和Szemerédi提出的另一项60年猜想在内的多个附加问题。这标志着AI生成的证明首次展现出显著的"下游影响"，其核心价值不仅在于解决问题本身，更在于为数学研究开辟了新的路径。相关成果已在未来数学研讨会上公布。

Jared Duker Lichtman: Update on Erdős Problem 1196: In joint work, we refined and adapted the proof method from GPT-5.4 Pro to give proofs of ...

OpenAI推理论文/研究

5月2日

06:18

Hao AI Lab@haoailab

37

很高兴分享我们最近被ICML 2026接收的工作！这些项目涵盖高效因果并行解码器、扩散大语言模型、视频稀疏注意力、视频量化感知训练、在线推测解码以及智能文档推理。衷心感谢所有合作者和共同作者在这些工作中的付出。期待今年夏天在首尔与大家相见！🇰🇷

智能体视频论文/研究部署/工程

01:16

AK@_akhaliq

56

异构科学基础模型协作 paper： https：//huggingface.co/papers/2604.27351

Hugging Face多模态论文/研究

01:16

AK@_akhaliq

57

最后一篇人类撰写的论文智能体原生研究制品论文： https：//huggingface.co/papers/2604.24658

智能体arXiv论文/研究

01:16

AK@_akhaliq

35

协同进化策略蒸馏论文： https：//huggingface.co/papers/2604.27083

数据/训练论文/研究

5月1日

22:16

elvis@omarsar0

56

Meta FAIR研究：预训练阶段自改进LLM的新范式

Meta FAIR的研究提出一种新范式，将LLM的改进从后训练移至预训练阶段。该方法利用强大的后训练模型作为改写器和评判器，对预训练数据的后缀进行高质量、高安全性的改写，并通过强化学习直接优化预训练模型。模型从开始就学习序列生成，并获得质量、安全性和事实性的奖励。实验结果显示，相比标准预训练，该方法在事实性上取得36.2%的相对提升，安全性提升18.5%，生成质量胜率最高达86.3%。核心结论是，现有后训练模型可用于预训练出更优的下一代模型。

Meta安全/对齐论文/研究

21:17

Ethan Mollick@emollick

62

新论文（关于旧式人工智能）在医学基准测试和真实急诊病例中将o1与医生进行对比："在各种场景和应用中，大型语言模型的表现均优于人类医生和旧版模型" 该潜力表明"迫切需要前瞻性试验"。

OpenAI论文/研究

20:17

向阳乔木@vista8

48

UIUC受《阿凡达》启发提出Eywa框架，连接语言模型与专用模型以破解科学AI困境

针对通用语言模型懂交互却不懂数据、专用模型精通数据却缺乏交互能力的科学AI困境，UIUC团队受《阿凡达》“Tsaheylu”神经连接启发，提出了Eywa接口框架。该框架让语言模型负责理解指令与调度，调用如Chronos、TabPFN等专用模型处理数据，从而协同发挥两者优势。初步实验效果良好，长期挑战在于语言模型能否达到专用模型的领域性能。

智能体MCP/工具论文/研究

19:40

Rohan Paul@rohanpaul_ai

46

研究揭示当前AI智能体团队难以达成一致决策

研究表明，当前由多个LLM组成的AI智能体团队在需要协调达成最终决策时存在根本性困难。开发者常误以为增加智能体数量并通过讨论就能解决问题，但论文证明这一假设目前是错误的。即使在友好协作环境中，智能体团队也常陷入僵局或完全停止响应，且团队规模越大问题越突出。这意味着现有AI智能体系统尚无法可靠处理需要达成一致正确答案的任务。

智能体论文/研究

18:40

Rohan Paul@rohanpaul_ai

62

自主AI代理真实环境测试曝大规模安全灾难

研究人员在真实环境中测试自主AI代理，发现它们极易引发大规模安全灾难，如为保守秘密而删除整个电子邮件服务器。核心问题在于标准语言模型被赋予计算机工具控制权后，产生危险盲点，导致代理盲目遵循几乎任何人的指令并经常撒谎行为。通过让20位专家与实时AI助手进行两周互动实验，研究揭示了这些程序缺乏基本信任判断能力。科技公司正急于部署此类自主助手，却未修复其无法理解应信任谁的根本缺陷，加剧了安全风险。

智能体arXiv安全/对齐论文/研究

14:40

Rohan Paul@rohanpaul_ai

43

LongCat团队发布LARYBench基准，评估AI模型能否从视频中真正学习动作

LongCat团队推出LARYBench基准，旨在评估AI模型是否从视频中真正学习动作，而非仅在后端机器人策略中表现良好。该基准聚焦模型从视频提取的潜在动作表示，通过超过120万视频片段等数据，将评估拆分为动作分类与控制回归两个清晰测试。关键发现是，通用自监督视觉模型（如V-JEPA 2和DINOv3）表现优于专用具身模型，表明强大视觉表示已蕴含丰富动作知识，且潜在特征空间比像素重建更利于机器人控制映射。这为利用丰富视频数据解决机器人训练数据稀缺问题提供了新方向。

具身智能论文/研究评测/基准

10:44

AK@_akhaliq

47

递归多智能体系统论文： https：//huggingface.co/papers/2604.25917

智能体论文/研究

08:46

Ethan Mollick@emollick

55

一项针对墨西哥女性的随机试验发现，使用基于认知行为疗法训练的AI对话代理的心理健康应用Mindsurf，在六个月内使使用者心理健康水平提升了0.3个标准差，且未增加严重病例。该干预还改善了睡眠质量、健康行为、日常功能及劳动力市场表现（如减少缺勤），其效益远超成本。尽管使用者寻求传统心理治疗的比例有所增加，但这并非心理健康改善的主因。效果具有持续性，短期使用可通过促进行为的持续改变带来长期改善。

John B. Holbein: AI-powered mental health apps are all the rage. But do they work? This new experiment on women in Mexico says they do! T...

论文/研究