Artificial Analysis@ArtificialAnlys · 5月28日71Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50%
ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where models must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab, leveraging IBM’s deep expertise in enterprise IT operations
Artificial Analysis has worked closely with IBM over the last 6 months to develop a implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time
ITBench-AA SRE overview:
➤ 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks
➤ Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident
➤ Faults span typical SRE failure modes including infrastructure, service, application, and chaos-injected incidents, such as resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions
Methodology details:
➤ Agentic harness: each task is solved by the model running in our open-source Stirrup reference harness, with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task, 3 repeats per task
➤ Models submit a list of root-cause entities (Kubernetes Deployments, Services, Pods, etc.) they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM Research
➤ Scoring uses average precision at full recall: if a model misses any of the ground-truth root causes, it scores 0.0 for that repeat. If it identifies all of them, it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes, i.e. true positives / (true positives + false positives). The headline score is the average across 59 tasks × 3 repeats.
➤ The harness (Stirrup) is held constant across all evaluated models, allowing an apples-to-apples comparison between models.
Key findings:
➤ Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42%
➤ All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context, frontier models score considerably higher on Terminal-Bench
➤ Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives
➤ GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%, ahead of Gemini 3.1 Pro Preview at 30%
译Artificial Analysis与IBM Research联合推出ITBench-AA,首个评估AI智能体在企业IT任务中表现的基准,首发任务为站点可靠性工程(SRE)。该基准包含59项Kubernetes事件响应任务,所有前沿模型得分均未超过50%。其中,Claude Opus 4.7以47%领先,GPT-5.5得46%,通义千问(Qwen3.7 Max)得42%。开源模型中,智谱GLM-5.1(推理)得分40%,与Gemini 3.5 Flash持平;深度求索(DeepSeek V4 Pro)得38%。分析还发现,模型推理轮次差异近3倍,但更长轮次并不保证更高准确率。
Qwen@Alibaba_Qwen · 5月28日69Fast, faster, Qwen. 🚀
Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners.
Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨
Dive into the full @PyTorch blog post below! 👇
https://pytorch.org/blog/up-to-580tps-new-speed-record-of-qwen3-5-397b-a17b-on-gpu-for-agentic-workloads-with-tokenspeed/
#Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance
译Qwen3.5在TokenSpeed推理引擎上,针对智能体工作负载达到了创纪录的580 tokens per second (tps)速度。这一成果由通义千问推理团队、lightseekorg Foundation TokenSpeed团队、NVIDIA及Mooncake团队共同实现,并采用了tri_dao的FlashAttention-4 (FA4) 优化。此里程碑标志着开源大语言模型推理性能的边界得到了推动,相关详情可查阅PyTorch社区博客。
Berryxia.AI@berryxia · 5月27日61鹅厂好的新基准测试,叫Chronicles-OCR。
腾讯HY实验室和四家机构一起做的,专门测AI对3000年中国古文字的识别能力。
2800张专家标注的图像,覆盖甲骨文、金文、篆书、隶书、楷书、行书、草书七大类。
结果28个前沿多模态模型全军覆没。
最强的VLLM在甲骨文上也只拿到14%的准确率。
端到端检测的H-mean最高才16.5%。
GPT-5和Gemini 2.5 Pro直接接近0。
更反直觉的是,开启reasoning模式反而让表现变差。
Chain-of-thought在感知失败的时候,反而放大了幻觉。
模型其实根本没在认字,它认的是载体。
古文字分类准确率能到96.7%,靠的是看到龟壳、青铜器这些容器,而不是看懂上面的字符。
到底非遗中的价值,AI的攻克只有九牛一毛。
译腾讯HY实验室与四家机构发布了专门测试AI对中国古文字识别能力的基准Chronicles-OCR,包含2800张专家标注图像,覆盖甲骨文、金文等七大类。测试显示,28个前沿多模态模型集体表现不佳:VLLM在甲骨文上准确率仅14%,GPT-5与Gemini 2.5 Pro得分近零。值得注意的是,开启推理模式反而损害性能,因模型实为识别龟壳、青铜器等载体(准确率96.7%),而非真正识别字符本身。
Berryxia.AI@berryxia · 5月27日55Minmax 最近沉寂了挺久~
昨天看到应该是M3蓄势待发了
刚刚留意到MiniMax AI的动态。
他们六个月前在12月23日开源了M2模型。
这半年里,社区把他们的几个核心系统直接拿去用了:CISPO(裁剪重要性采样权重策略优化)、Forge RL System(锻造强化学习系统),还有Self-Evolution(自我进化)。
几乎每一版模型上线,都冲上Hugging Face榜首。
现在他们把M2背后的所有工作系统性整理成论文,挂到了arXiv上。
不是简单发个权重,而是把当时的设计思路、训练细节、系统架构全摊开。
这步其实挺关键。
开源社区最缺的往往不是新模型,而是能看懂为什么它能跑通的完整路径。
MiniMax Head of DevRel Ryan Lee在帖子里说,现在是时候翻开新的一章。
M3已经在路上了,MSA论文也快发布。
他们没有停在刷榜,而是把过去半年踩过的坑、验证过的方案沉淀下来,让后来人少走弯路。
这才是真正推动开源生态往前滚的做法。
兄弟们,
你们觉得开源大模型的下一阶段,是继续卷参数和榜单,还是像MiniMax这样把系统和方法论也彻底公开?
M3如果把这些积累再往前推一步,你们最期待它在哪个方向有突破?
译MiniMax 在开源 M2 模型半年后,系统性发布了其背后所有工作的论文,详细阐述了设计思路、训练细节与系统架构。此前,其开源系统 CISPO、Forge RL System 和 Self-Evolution 已被社区广泛采用,且多版模型发布后曾登顶 HuggingFace 排行榜。与此同时,MiniMax 官方宣布已为下一代模型 M3 做好准备,并且 MSA 论文也即将发布。
Saining Xie@sainingxie · 5月27日69📸latest in our cambrian series: cambrian-p, p for pose.
i think pose is probably the minimal sufficient 3d signal (and it’s easy to get!) that we need for robust video multimodal models -- jointly modeling frames and pose turns image sequences into a globally grounded structure.
译推文介绍了Cambrian-P,这是一个原生集成相机位姿的多模态大语言模型。其核心观点是,相机位姿是一种易于获取且足以支撑鲁棒视频理解的最小3D信号。通过联合建模视频帧与位姿,模型能将图像序列转化为全局结构化的表示。引用推文指出,当前多模态大语言模型在识别视频活动方面表现优异,但对视频中的空间结构及自主体/物体动态的理解仍然不足,而相机位姿信息是弥补这一差距的关键缺失环节。
karminski-牙医@karminski3 · 5月27日69什么?! skill 也能"训练"了?
以往大家都是凭经验让AI写 skill, 然后调试的时候也是运行几下感觉没bug就完事了.
但 skill 能运行就一定好吗? 于是微软联合上交复旦同济等机构发了一个新框架 SkillOpt, 直接让AI评估skill写的好不好然后不断去优化!
最终, 这个框架写的 skill 让GPT-5.5的直接对话准确率飙升了 23.5分!
这个框架具体是怎么做的也很简单, 让skill迭代过程实现 harness 闭环! 大模型写完 skill 后, 立刻进入跑分流程, 只有得分更高的 skill 变更才会留下来. 跟大模型的强化学习过程如出一辙.
框架的设计也很值得做 Agent 框架的同学借鉴, 比如:
它设计了一个独立的优化器模型, 这个模型是用来写 skill 的, 它会根据 Agent 执行任务的试错表现得分, 对 skill 进行编辑操作(增加、删除、替换文本).
然后就是 harness 流程了:每一次文本编辑都必须在独立的验证集上分数有提升, 才会允许合并.
最后, 也是最精彩的地方, 框架还引入深度学习训练机制, 设计了文本层的学习率预算, 这个的核心就是限制大模型每次只能修改skill的一小部分, 慢慢迭代, 而不是全都重写.
论文中最有价值的数据就在这里, 论文实验发现, 每一步设置 4 到 8 个编辑操作的预算效果最好. 最终的最佳 skill 往往只包含 1 到 4 个被接受的核心修改.
甚至他们还设计了被拒编辑缓冲区, 用来存储训练过程的反面胶材, 以及周期性慢速/元更新, 这个则是跑完一个周期后, 会进行一次盘点, 类似于让框架形成记忆, 能更好的维持后续迭代.
这篇论文的结论十分深刻: skill(prompt) 完全配得上, 也需要一套系统级的训练流程.
原文中的描述直接是: 我们主张, skill 应当作为 Agent 的外部冻结状态来被"训练", 并且训练过程还要"让权重空间优化具有可重复性"!
这是不是意味着, 提示词工程(Prompting)和模型训练(Training) 的界限将逐渐变得模糊? 而提示词工程完全进入了机器学习的领域. 也许很快, 我们再也不需要人类去手动瞎改和调试提示词了!
论文地址: http://arxiv.org/pdf/2605.23904
#skillopt #微软 #提示词工程 #harness
译微软联合上海交通大学等机构发布SkillOpt框架,旨在通过机器学习流程系统性地优化AI智能体的技能。该框架引入独立的优化器模型,通过harness闭环流程对技能进行编辑,且每次编辑必须在验证集上带来分数提升才被接受。框架设置了每步4到8个编辑操作的学习率预算,使核心修改控制在1到4个。实验表明,优化后的技能可使GPT-5.5的对话准确率提升23.5分。
Ant Ling@AntLingAGI · 5月26日69From IcePop to KPop — our team keeps pushing on RL training stability for large MoE models. 👇
KPop replaces the fixed-ratio mask with an adaptive binary-KL region that matches each token's inherent noise. More robust updates, stable long-horizon agentic RL.
Ring-2.6-1T → 76+ on SWE-bench Verified, pure RL.
Congrats to @Jia__Guo & team!
Blog: https://ringtech.notion.site/kpop
译团队发布了KPop技术,用于稳定大规模MoE模型的强化学习训练。它取代了此前IcePop方法的固定比例掩码,改用自适应二元KL散度区域来匹配每个token的固有噪声,从而实现更鲁棒的参数更新,支持长期、智能体化的强化学习训练。具体应用中,万亿参数的Ring-2.6-1T模型在仅使用纯强化学习训练(未修改基础设施或路由重放)的情况下,于SWE-bench Verified评测中得分超过76。KPop仅通过一个关键参数即可实现该优化。
Ant Ling@AntLingAGI · 5月26日68From IcePop to KPop — our team keeps pushing on RL training stability for large MoE models. 👇
KPop replaces the fixed-ratio mask with an adaptive binary-KL region that matches each token's inherent noise. More robust updates, stable long-horizon agentic RL.
Ring-2.6-1T → 76+ on SWE-bench Verified, pure RL.
Congrats to @Jia__Guo & team!
Blog: https://ringtech.notion.site/kpop
译团队推出 KPop,用于稳定大规模 MoE 模型的智能体强化学习训练。它用基于二元 KL 散度的自适应掩码机制,替代了此前 IcePop 方法中的固定比例掩码,能根据训练过程中的训练-推理不匹配程度动态调整。这一改进使得 Ring-2.6-1T 模型在无需修改基础设施或路由重放的情况下,仅通过纯 RL 训练,在 SWE-bench Verified 上取得了超过 76 分的成绩。
Rohan Paul@rohanpaul_ai · 5月26日61Brilliant new paper from Meta, CMU and other labs.
Shows that coding agents improve faster by manufacturing their own software experience.
Coding agents can train themselves by making and fixing bugs inside real projects.
Most coding agents still learn from human leftovers: issues, pull requests, tests, comments, and benchmarks that describe what went wrong.
That is useful, but it makes the agent dependent on the rate at which humans produce clean, verifiable lessons.
Self-play SWE-RL changes the unit of learning from a labeled task to an executable situation.
One version of the model explores a real codebase, weakens tests, injects a meaningful bug, and leaves behind test artifacts that define the failure without needing an English issue description.
Another version of the same model has to repair the system, not by matching words to patches, but by restoring behavior under tests.
Here’s the key point: the test is not just a grader here, it is the language of the problem.
That matters because software understanding lives in constraints, dependencies, edge cases, and invariants that prose often compresses or misses.
The reported gains, +10.4 points on SWE-bench Verified and +7.8 on SWE-Bench Pro, are early but hard to ignore because evaluation still used natural-language issues the self-play system did not train on.
That suggests SSR (Self-play SWE-RL) is learning something deeper than issue phrasing, though not yet anything like open-ended mastery.
The restraint matters: generated bugs can be artificial, rewards can be noisy, and sandboxed repositories are still a narrow slice of software reality.
Still, the direction is sharp.
The next bottleneck for coding agents may not be more human-written tasks, but more ways for agents to encounter, create, survive, and learn from failure.
----
Paper Link – arxiv. org/abs/2512.18552
Paper Title: "Toward Training Superintelligent Software Agents through Self-Play SWE-RL"
译Meta、CMU等机构在论文中提出了Self-play SWE-RL方法。该方法让编程智能体通过“自我博弈”生成训练数据,而非仅依赖人工标注的问题。具体而言,一个模型探索代码库、注入bug并留下测试用例来描述问题;另一个模型则学习根据测试修复系统。其中,测试成为了描述问题的核心语言。该方法在SWE-bench Verified上提升了+10.4分,在SWE-Bench Pro上提升了+7.8分。值得注意的是,评估使用了该系统未训练过的自然语言问题,表明其可能学到了更深层的软件理解能力。
Ant Ling@AntLingAGI · 5月26日62SwiGLU is everywhere in modern LLMs — but for large inputs it behaves like x². That quadratic blow-up inflates activations, amplifies outliers, and makes deep network or low-precision (FP8/FP4) training prone to loss spikes.
We propose PowLU, a drop-in activation built for stable large-scale pre-training. 🧵
译SwiGLU在现代大语言模型中无处不在——但对于大输入,它的行为类似于x²。这种二次增长会膨胀激活值,放大异常值,并使深层网络或低精度(FP8/FP4)训练容易出现损失尖峰。
我们提出了PowLU,一种为稳定大规模预训练而设计的即插即用激活函数。🧵
X.PIN@thexpin · 5月26日67Huawei plans to scale AI chips without smaller nodes.
A new paper by Huawei's He Tingbo, "A Time Scaling Theory for Multi-Layer Electronic Systems," outlines how they'll advance Ascend AI chips as transistor shrinking slows down.
Instead of next-gen lithography, Huawei will scale its Ascend SuperPoD line through ~2030 by packing mature tech across the 2025 910C, 2026 950, and 990:
🔹 Chiplets
🔹 2.5D fan-out packaging
🔹 3D stacking (via micro-bumps & hybrid bonding)
Around 2030, Ascend 990 will debut LogicFolding in AI accelerators, aiming for a 100x integration leap by 2035.
译华为将不依赖更小制程节点,通过封装与架构创新来扩展其昇腾AI芯片。根据何庭波的论文,华为计划在2025年至2030年间,通过Chiplets、2.5D扇出封装和3D堆叠技术,推进其昇腾SuperPoD系列,具体产品包括2025年的910C、2026年的950及后续的990。约2030年,Ascend 990将引入LogicFolding技术,目标是到2035年实现100倍的集成度跃升。
Rohan Paul@rohanpaul_ai · 5月26日59One engineering challenge in dexterous Robot hands is balancing strength and speed.
Here a SharpaWave performing rapid hand cycles at over 4x/sec. The Dynamic Tactile Array uses visuo-tactile sensing: fingertip integrates camera & 1,000+ tactile pixels.
译灵巧机械手的一个工程挑战在于平衡强度与速度。
这里 SharpaWave 正以超过每秒 4 次的频率进行快速手部循环。动态触觉阵列采用视觉-触觉传感:指尖集成了摄像头和 1000 多个触觉像素。
Rohan Paul@rohanpaul_ai · 5月26日65This Meta + Stanford + Illinois survey paper argues that AI agents work better when code becomes their main working layer.
The problem is that an LLM by itself is mostly a text predictor, so long tasks can lose state, hide mistakes, and turn plans into actions in fragile ways.
The real advance is not “AI writes code,” but “AI uses code as the environment it thinks inside.”
The authors call the surrounding system an agent harness, meaning the tools, memory, sandboxes, checks, and feedback loops that turn a model into an agent.
Their core idea is that code should sit at the center of that harness, because code can be run, inspected, checked, saved, edited, and shared.
Tests become sensors.
Repositories become memory.
Logs become history.
Sandboxes become boundaries.
A generated script is no longer merely an answer; it is a handle the system can run, check, revise, share, and roll back.
The main finding is a pattern across many fields: code helps agents reason through executable steps, act through tool calls or control programs, and model environments through tests, traces, logs, repositories, and simulators.
----
Paper Link – arxiv. org/abs/2605.18747
Paper Title: "Code as Agent Harness"
译Meta、斯坦福与伊利诺伊的研究论文指出,AI智能体在将代码作为主要工作层时性能更佳。论文认为,大语言模型(LLM)作为文本预测器,在处理长任务时存在状态丢失、错误隐蔽等问题。真正的进步并非“AI写代码”,而是“AI在代码环境中思考”。论文的核心是提出一个以代码为中心的“智能体框架”,即工具、记忆、沙箱等系统。在此框架中,测试成为传感器,代码库成为记忆,日志成为历史,沙箱成为边界。生成的脚本成为可运行、检查、修改和共享的操控对象。总结发现,代码能通过可执行步骤帮助智能体推理,通过工具调用行动,并通过测试、日志等对环境进行建模。
elvis@omarsar0 · 5月25日66New research from Microsoft Research
I see a lot of AI engineers handwriting agent skill docs and hope they generalize.
Probably not optimal. This works show why.
It treats the skill doc as a trainable external state of a frozen agent instead.
It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes.
SkillOpt is best or tied on all 52 (model, benchmark, harness) cells.
On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses.
Paper: https://arxiv.org/abs/2605.23904
Learn to build effective AI agents in our academy: https://academy.dair.ai/
译微软研究院提出了SkillOpt方法,将AI智能体的技能文档视为可训练的外部状态,而非由工程师手动编写。该方法利用一个优化器模型对技能文件进行验证门控编辑,通过添加、删除或替换指令来优化文档,并引入文本学习率控制每轮重写力度,而智能体本身保持不变。实验显示,在全部52个测试单元(涵盖不同模型、基准测试和工具链)中,SkillOpt均达到最佳或并列最佳。具体在GPT-5.5上,相比无技能文档,SkillOpt在直接聊天、Codex和Claude Code下分别取得23.5、24.8和19.1分的提升,超越人类手写技能及其他自动化方法,且不增加推理时开销,学到的技能还能跨模型和工具链迁移。
Rohan Paul@rohanpaul_ai · 5月25日75🇨🇳 Huawei just released breakthrough chip design approach "LogicFolding" that will close it's gap with TSMC.
The technical paper behind it.
The core idea is that chips should stop measuring progress mainly by how small transistors are and start measuring progress by how much time delay can be removed from the whole machine.
A chip wastes time when signals move through long wires, memory paths, chip-to-chip links, and software communication layers, so Huawei calls this delay τ, or tau.
Huawei’s paper introducing "LogicFolding" says the next chip breakthrough may come from cutting wasted time inside the machine.
That is what "τ scaling" means.
τ is the delay that accumulates before useful computing happens: a transistor switches, a signal crosses a wire, data reaches memory, a chip talks to another chip, or a server waits for a response.
Moore’s Law reduced this delay indirectly because shrinking transistors also shortened many of the paths around them.
But modern chips are no longer slowed only by transistor size.
They are slowed by wire resistance, parasitic capacitance, clock skew, memory distance, protocol conversion, chip-to-chip communication, and the cost of moving data.
So τ scaling changes the question from “how small is the transistor?” to “where is time being lost?”
LogicFolding is Huawei’s physical answer to that question inside a chip.
In a normal chip, related logic gates are spread across a flat surface, so signals often travel sideways through long metal routes before reaching the next important gate.
Those wires behave like sticky pipes: resistance slows current, capacitance must be charged and discharged, and every extra distance creates delay and wastes energy.
LogicFolding tries to stack active circuit layers vertically and connect them with very fine hybrid bonds, so circuits that need to talk are placed above and below each other instead of far apart on one plane.
The signal now takes a shorter route, the critical path becomes faster, clock timing becomes cleaner, and the same manufacturing node can deliver more performance.
Huawei is trying to win not by making every switch smaller, but by making every important signal travel less, wait less, and arrive sooner.
译华为提出了“τ缩放”和“LogicFolding”两种新方法,旨在不依赖最先进光刻工具的前提下,缩小与台积电的性能差距。其核心思想是将衡量芯片进步的指标从晶体管尺寸转向信号传输延迟(τ)。LogicFolding作为具体实现,通过垂直堆叠逻辑电路层并采用混合键合,将需要通信的电路紧邻放置,从而缩短关键线路、降低电阻和寄生电电容,提升信号速度。华为表示,其下一代麒麟手机芯片将是对τ缩放规律的首次全面测试。
Rohan Paul@rohanpaul_ai · 5月25日65New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) with only lightweight adaptation
Shows standard LLMs can handle very long context faster by making attention selectively sparse.
The problem is that full attention gets very expensive when the input grows to hundreds of thousands or 1M tokens, because the model keeps comparing too many tokens with too many other tokens.
The paper’s claim is that a trained full-attention model already has a hidden sparse structure, so the model does not need to be rebuilt or trained from scratch.
RTPurbo uses that structure by finding the few attention heads that really need faraway tokens, while letting the other heads focus mostly on nearby text.
For those retrieval heads, it uses a small 16-dimensional token finder to guess which old tokens matter, then runs the real attention only on that selected set.
The authors tested this on long-context benchmarks and reasoning tasks, and RTPurbo kept accuracy close to full attention while reaching up to 9.36x faster prefill at 1M tokens and about 2x faster decoding.
RTPurbo's engineering rule: keep expensive long-context access only where it matters, and route the rest through a smaller search space.
The clever part is the 16-dimensional indexer.
It does not replace the model’s real attention computation; it acts like a cheap scout, finding likely useful tokens before the full representation is used on the selected set.
RTPurbo is not proof that every model can be safely sparsified this way.
But it is strong evidence that the waste in long-context inference is more structured than it looks.
----
Paper Link – arxiv. org/abs/2605.16928v1
Paper Title: "Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps"
译阿里巴巴与南京大学提出RTPurbo,一种轻量级适配方法。该方法发现,已训练的全注意力模型内存在隐藏的稀疏结构。它利用一个轻量的16维token查找器作为“侦察兵”,为少数需要长程信息的关键注意力头定位重要token,而让其他头主要关注局部文本。基于此,RTPurbo在100万token预填充任务上,相比FlashAttention-2实现了高达9.36倍的加速,解码阶段也约有2倍加速,同时在长上下文和推理基准上保持了接近全注意力模型的精度。该研究表明,长上下文推理中的计算浪费具有可挖掘的结构性。
Chubby♨️@kimmonismus · 5月25日60Nine more Erdős problems have been solved.
This time, however, by Google DeepMind.
This shouldn't be underestimated, because on the one hand it increases competitive pressure, and on the other hand it proves that the other Frontier Labs can easily keep up.
译又有九个Erdős问题被解决了。
但这次,是Google DeepMind完成的。
这不容小觑,因为一方面它加剧了竞争压力,另一方面也证明了其他前沿实验室可以轻松跟上。
Rohan Paul@rohanpaul_ai · 5月25日73A large MoE model may be wasting half its expert compute on tokens that barely need expert help.
In this paper 50% of expert computation removed, with almost no loss in accuracy.
This makes already-trained MoE models like Qwen3 and GLM stop calling half their experts when a token is too easy to need them.
Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones.
Shows that many MoE tokens do not need real experts, only permission to skip them.
That sounds like a small routing trick, but it changes the economics of deployed language models.
Standard MoE models already avoid using every parameter, yet they still spend the same expert budget on every token.
ZEDA adds a strange new option to the router: experts that output exactly nothing.
When the model routes a token to one of these zero experts, it is not making the model dumber; it is admitting that this token does not need another expensive transformation.
The clever part is not the dummy expert, but the adaptation method.
Instead of retraining the model from scratch, the original MoE becomes a frozen teacher, while the new dynamic version learns when it can safely skip work.
Across Qwen3-30B-A3B and GLM-4.7-Flash, the result is roughly half the expert computation removed, with only marginal average accuracy loss and about 20% real inference speedup.
The deeper finding is: compute use did not simply track task difficulty.
The model spent more expert budget where uncertainty or teacher-student disagreement rose, while structured code and math fragments often needed less.
That makes ZEDA feel less like pruning and more like attention to computational doubt.
----
Paper Link – arxiv. org/abs/2605.18643
Paper Title: "Post-Trained MoE Can Skip Half Experts via Self-Distillation"
译论文提出ZEDA框架,可将训练后固定的静态MoE模型(如Qwen3、GLM)转变为动态模型,允许路由器在token过于简单时跳过专家调用。实验显示,在Qwen3-30B-A3B和GLM-4.7-Flash上,ZEDA可移除约50%的专家计算量,仅带来轻微准确率损失,并实现约20%的实际推理速度提升。研究发现,计算分配主要依据模型的不确定性,而非单纯跟随任务难度。
Chubby♨️@kimmonismus · 5月24日68Dont like this at all.
Researchers at KIT (germany) just demonstrated that ordinary WiFi routers can identify individuals with near-perfect accuracy.
No phone required, no special hardware, no line of sight. The system reads unencrypted beamforming feedback that every connected device already broadcasts. 197 test subjects, nearly 100% identification rate.
The surveillance infrastructure isn't being built. It's already installed in every café, airport, and office you walk through. The only question is who starts reading the signals first. Source: science daily
译德国KIT研究人员展示,使用普通WiFi路由器即可近乎完美地识别个人身份,无需手机、特殊硬件或视线。该系统利用每个已连接设备都在广播的未加密波束成形反馈(beamforming feedback)。在197名受试者的测试中,识别准确率接近100%。该研究指出,此类监控基础设施(如咖啡馆、机场、办公室中的路由器)已普遍存在,核心问题在于谁将开始读取并利用这些信号。
elvis@omarsar0 · 5月23日64// Adapt the Interface, Not the Model //
I am fascinated by the results across my cheap-model-plus-good-harness builds.
This new paper also shows good signs of the code-as-agent-harness thesis.
The idea is really simple. Do not touch the model. Instead, modify the runtime interface that wraps the frozen LLM. Then convert recurring interaction failures into reusable interventions on the harness side.
The paper reports an average relative improvement 88.5% across 7 deterministic environments, 126 model-environment settings, and 18 backbones.
A harness learned from one model trajectory generalizes to 17 other backbones. That tells you the harness is capturing environment structure, not model-specific patterns.
If you ship agents in production, your harness work is more portable than you might assume.
Paper: https://arxiv.org/abs/2605.22166
Learn to build effective AI agents in our academy: https://academy.dair.ai/
译一项新研究提出通过改进包裹冻结LLM的运行时接口来优化AI代理性能,而非修改模型本身。该方法将反复出现的交互失败转化为对运行时层的可复用干预,在7个确定性环境、126个设置中取得平均88.5%的相对性能提升。关键发现是,从单一模型轨迹中学习到的运行时方法可成功迁移至18个不同模型骨架,证明其捕捉的是环境结构而非模型特异性模式。这为生产环境中部署AI代理提供了更高可移植性的解决方案。
Rohan Paul@rohanpaul_ai · 5月23日61This paper shows that agent performance depends less on prompts alone and more on the harness around them.
“Agent intelligence” is becoming partly a systems problem. The problem is that many AI agents look like 1 model, but their real behavior comes from surrounding code that controls planning, tools, memory, retries, checking, and stopping.
A model may reason well in one step, but long tasks fail in messier places: state disappears, verification drifts, tools return partial evidence, and the agent forgets which intermediate artifact actually matters.
Natural-Language Agent Harnesses try to make that control layer visible.
Instead of burying the logic in controller code, they express the stages, roles, contracts, state rules, failure modes, and stopping conditions in structured natural language that a shared runtime can execute.
The claim is not that natural language should replace code, but that the important design choices around an agent should become inspectable, portable, and testable instead of hiding inside one framework’s habits.
On SWE-bench, heavier harnessing changed behavior dramatically, with more calls, tools, delegation, and runtime, but it did not produce a simple win curve; sometimes added structure helped, and sometimes it pushed the agent away from the shortest benchmark-aligned repair.
A harness is not magic scaffolding around a model; it is a set of bets about where reliability comes from.
----
Paper Link – arxiv. org/abs/2603.25723
Paper Title: "Natural-Language Agent Harnesses"
译本研究指出,AI代理的实际性能更多取决于围绕模型的外部控制系统(即代理框架),而非单纯的提示词。当前许多代理看似单一模型,其行为实则由规划、工具调用、记忆管理等周边代码驱动,导致长任务易因状态丢失、验证漂移等环节失败。为此,论文提出“自然语言代理框架”理念,旨在将控制流程以结构化自然语言显式表达,使其可检查、可迁移且可测试。研究发现,虽然更复杂的框架能显著改变代理行为,但并未带来稳定的性能提升,这表明框架设计是保障可靠性的关键选择,而非一种立竿见影的万能方案。
Rohan Paul@rohanpaul_ai · 5月23日55AI detectors fail because student writing is too varied to judge from 1 document.
The problem is not only that AI writing is getting better, but that many real students write in ways that can look statistically close to AI output.
The paper frames this as a testing problem where the detector does not know each student’s normal writing style, so “human writing” is not 1 fixed target.
Because of that, any detector that catches many AI-written submissions must also wrongly accuse some real students, especially students whose writing is more structured, formulaic, or shaped by learning English.
The authors use basic statistics to show that this false-accusation problem is not just a bug in current tools, because it appears whenever student writing overlaps with AI writing.
A university is not comparing “AI text” with “human text”; it is comparing one submission with the unknown writing habits of one particular student.
Better detectors may reduce some errors, but they cannot erase the structural problem created by one-shot judgment.
----
Paper Link – arxiv. org/abs/2603.20254
Paper Title: "AI Detectors Fail Diverse Student Populations: A Mathematical Framing of Structural Detection Limits"
译该研究指出,AI检测器频繁失效的根本原因在于学生写作风格的多样性,使得仅凭单份文档判断是否为AI生成变得极为困难。问题不仅在于AI写作能力在提升,更在于许多真实学生的写作风格,在统计特征上已与AI输出高度相似。检测器无法事先掌握每个学生独特的写作习惯,因此“人类写作”不存在一个固定的判断标准。这意味着任何能有效识别大量AI文本的检测器,都不可避免地会误判一部分真实学生,尤其是写作更规范、公式化或受英语学习影响的学生。现有技术或许能降低错误率,但无法根除基于“单次判断”模式所带来的结构性误判问题。
Rohan Paul@rohanpaul_ai · 5月23日64New Google paper shows that wearable data becomes far more useful when AI learns the person behind the signals.
It's is not another heart-rate algorithm, but a general model trained on more than one trillion minutes of sensor data from five million people.
The authors propose SensorFM, a foundation model trained on more than 1 trillion minutes of unlabeled wearable data from 5 million people, so it can learn general patterns of human physiology before seeing specific health tasks.
That scale changes the problem from measuring isolated events to learning patterns of lived physiology: sleep, movement, temperature, oxygen, heart rhythms, and their ordinary daily messiness.
Wearables are not weak because they lack data; they are weak because most systems compress that data into crude summaries before the meaningful structure has a chance to appear.
SensorFM tries to learn that structure first, then reuse it across tasks, which is why the same representation can help with cardiovascular, metabolic, mental health, sleep, lifestyle, and demographic predictions.
The evidence is strongest as a scaling story: larger models trained on more data performed better, and the learned embeddings beat engineered-feature baselines on 34 of 35 prediction tasks.
----
Paper Link – arxiv. org/abs/2511.15352v3
Paper Title: "People readily follow personal advice from AI but it does not improve their well-being"
译谷歌研究院提出基础模型SensorFM,通过学习超过500万人产生的逾1万亿分钟可穿戴设备传感器数据,掌握了人类生理活动的一般性模式。该模型超越了将数据压缩为简单指标的传统方法,能够从数据中提取出有意义的结构并将其复用于多种健康预测任务。实验显示,模型规模和数据量越大性能越强,且其学习到的数据表征在35项预测任务中的34项上,均优于基于工程特征的基线方法。
Rohan Paul@rohanpaul_ai · 5月23日79Google DeepMind's new paper.
Shows that AI can now search formal mathematics proofs, but only inside carefully constrained worlds.
The striking result is not that the system “thinks like a mathematician,” but that it keeps forcing its thoughts through Lean, where every step must compile.
The problem is that LLMs can sound convincing in math while still making tiny mistakes, so the authors use Lean, a proof system that checks every logical step.
Their system, AlphaProof Nexus, lets an LLM keep editing a formal proof, read compiler errors, try again, and sometimes ask a stronger proof tool for help on smaller subproblems.
The stronger version also keeps a shared pool of partial proof attempts, rates which ones look promising, and uses those attempts to guide later searches.
That changes the role of the model from a persuasive storyteller into a generator of candidates that can be killed quickly when they are wrong.
The verifier is not a cosmetic add-on, it is the mechanism that makes exploration tolerable.
Without it, a beautiful proof sketch can hide a false lemma; with it, the model has to turn insight into executable logic, or fail visibly.
The authors tested the system on real unsolved math problems, including 353 formalized Erdős problems and 492 open conjectures from the Online Encyclopedia of Integer Sequences.
The main result is that the best agent solved 9 Erdős problems and proved 44 sequence conjectures, while also helping with problems in optimization, graph theory, algebraic geometry, and quantum optics.
The failures are as revealing as the wins, because the agents sometimes buried the hard part inside a helper lemma or hallucinated a known result, exactly the kind of error formal checking is built to expose.
The real shift is not full mathematical autonomy, but a new division of labor: humans choose the formal question, libraries define the terrain, models propose routes, and the proof assistant refuses to be impressed.
----
"Advancing Mathematics Research with AI-Driven Formal Proof Search"
Paper Link – arxiv. org/abs/2605.22763
译Google DeepMind提出了AlphaProof Nexus系统,它将大型语言模型与Lean形式化验证工具相结合。该系统允许LLM在生成证明的过程中,不断读取Lean的编译错误并进行修正,还可调用更强的工具辅助解决子问题。这一机制迫使模型将每一步逻辑都转化为可编译、可验证的代码,从而将其角色从“令人信服的叙述者”转变为“候选方案生成器”。在针对353个Erdős问题和492个开放猜想的测试中,系统成功解决了9个Erdős问题并证明了44个序列猜想。该研究展示了形式化验证在暴露AI逻辑错误、建立“人类提问-模型探索-验证器把关”新分工中的关键作用。
Rohan Paul@rohanpaul_ai · 5月22日46This RAI Institute robot managing 3-balls juggling through dynamic hand adjustments.
It processes visual and contact information to maintain the pattern without external aids.
译这个RAI研究所的机器人通过动态手部调整管理三球抛接。它处理视觉和接触信息以维持模式,无需外部辅助。
Chubby♨️@kimmonismus · 5月22日54University of Tokyo built a chip component that processes data 1000x faster than conventional methods - without generating extra heat.
The real number worth paying attention to: power consumption drops to 1/100th of current levels. A Google-scale data center that today powers 80,000 homes could theoretically run on the energy of 800.
But the prototype chip isn't scheduled until 2030, and commercial availability is years beyond that. We're watching the AI industry sprint toward an energy wall at full speed while the most promising efficiency breakthroughs are still a decade from production. via techradar
译东京大学研发了一种新型芯片组件,其处理数据速度较传统方法提升1000倍,且不产生额外热量。关键突破在于功耗仅为现有技术的百分之一,这理论上能使一个谷歌规模的数据中心能耗降低至当前的百分之一,极大缓解AI行业的能源压力。然而,该芯片原型预计2030年才问世,商用化需更长时间,凸显了AI快速发展与突破性节能技术量产时间之间的差距。
Berryxia.AI@berryxia · 5月22日66兄弟们,Apple的Persona团队又把数字人真实度干上新高度了。
他们刚在WWDC26前放出一篇新论文,专门讲面部捕捉和动画的最新进展。
从演示视频里看,捕捉精度和动画自然度又明显进化了一步,尤其是眼部微表情、头部细微动作和皮肤质感,真实感拉满。
这已经不是简单的“数字头像”了,而是越来越接近可信的数字分身。
对AR/VR、游戏、远程协作来说,这类突破直接决定“沉浸感”能不能成立。毕竟当你戴上头显后,最先被打穿的往往就是“这个人看起来假”的那层滤镜。
Apple显然还在持续重仓这条赛道。
论文和演示在这里(强烈建议看视频):
https://apple.github.io/ml-headsup/
有空试试这货到底表现如何??
译苹果Persona团队在WWDC26前发布新论文,展示了面部捕捉与动画技术的最新进展。从演示来看,其在眼部微表情、头部细微动作和皮肤质感等细节上实现了显著提升,使数字形象的真实感进一步增强,已超越简单“数字头像”,趋近于可信的“数字分身”。这类突破对AR/VR、游戏和远程协作等领域的沉浸式体验至关重要,能够有效打破虚拟交互中的“不真实感”。苹果持续重仓该技术赛道,相关论文与演示视频已公开。
Saining Xie@sainingxie · 5月22日60check out RAEv2 led by Jas. through extensive exps, we found some really intriguing behaviors showing why strong representation encoders are key for pixel decoders.
spoiler: it’s not about hillclimbing fid; new metrics like ep@fid-k/fdr^k show there’s a lot more left to explore!
译RAEv2通过大幅简化架构并提升通用性,在文本到图像(T2I)和世界模型等任务中实现了超过10倍的收敛速度提升,同时改善了重建与生成质量。研究团队在大量实验中发现,强大的表示编码器对像素解码器至关重要。传统评估指标(如FID)已不足以全面衡量模型性能,新的评估指标(如ep@fid-k/fdr^k)揭示了生成模型领域仍存在广阔的研究空间。
Ethan Mollick@emollick · 5月22日61Seems GPT-5.2 reaches expert level in peer review: 45 scientists took 469 hours evaluating human & AI reviews on 82 papers.
"Surprisingly, current AI reviewers are competitive even with the top-rated reviewers in Nature’s official peer review..." though not without weaknesses.
译似乎GPT-5.2在同行评审中达到了专家水平:45位科学家花费469小时,评估了人类与AI对82篇论文的评审。
“令人惊讶的是,当前的AI评审甚至能与《自然》官方同行评审中的顶级评审人相媲美……”尽管并非没有弱点。
AK@_akhaliq · 5月22日68Mix-Quant
Quantized Prefilling, Precise Decoding for Agentic LLMs
译Mix-Quant
量化预填充,精确解码,面向智能体LLM
AK@_akhaliq · 5月22日56LongMINT
Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
译LongMINT
评估长期智能体系统中多目标干扰下的记忆能力
Orange AI@oran_ge · 5月21日81AI 发展的里程碑时刻。
OpenAI 的一个未公布的内部推理模型,自主解决了 Erdős 1946 年提出的平面单位距离问题。
chain of thought 长达125 页,核心手法是从代数数论拉了一套工具去解离散几何问题,这个跨领域连接是人类 80 年没想到的。
最有意思的是这个模型不是专门为数学训练的,是通用推理模型。
这说明足够强的推理能力到了某个阈值之后,创造性会自然涌现。
恭喜人类。
译OpenAI未公开的内部通用推理模型,自主解决了数学家Erdős于1946年提出的平面单位距离问题,颠覆了近80年来学界对解法结构的普遍预期。该模型通过125页思维链,创新运用代数数论工具解决离散几何问题,实现了跨领域方法论突破。更值得注意的是,该模型并非专攻数学训练,其成果表明通用推理能力达到一定阈值后可能自然催生创造性,标志着AI在基础科学领域迈出了关键一步。
Greg Brockman@gdb · 5月21日78our math result is a milestone in new knowledge generation by AI. very exciting to imagine similar results in other scientific fields. "It's very hard to sleep, man" is a pretty good reaction.
译AI在数学领域实现了新知识生成的里程碑式突破。OpenAI模型解决了组合几何中悬而未决的著名难题——平面单位距离问题(Erdos 1946),首次证明通过AI方法可将该问题中单位距离对的数量提升至超线性规模(n^{1+δ}),超越了以往所有人类已知的线性构造。这标志着AI从解决已知问题迈向发现新数学的重要进展。该突破引发了研究者“难以入睡”的强烈反响,被视为AGI时代临近的信号。
Rohan Paul@rohanpaul_ai · 5月21日78AI in math is creating history again, as OpenAI's general-purpose reasoning model has disproved a major Erdős conjecture from 1946.
The important part is not that AI solved a hard math problem, but how little special machinery it needed.
For decades, the planar unit distance problem looked almost embarrassingly simple: place points on a plane, then ask how many pairs can be exactly one unit apart.
For decades, the best examples looked like stretched versions of a square grid, so mathematicians believed grids were almost the best possible design.
OpenAI’s internal model broke that picture by finding an infinite family of constructions that gives a polynomial improvement, with the proof checked by external mathematicians.
The point to note is that the model was not a bespoke theorem-proving engine trained only for this problem, and the official post says its success improved with more test-time compute, meaning more reasoning at inference rather than only more training.
That matters so much, because research progress often comes from holding a fragile chain of ideas together long enough to cross from one field into another.
In this case, the bridge ran from a plain geometric question into deep algebraic number theory, including machinery like infinite class field towers and Golod–Shafarevich theory.
And now we see a general-purpose reasoning system appears able to search a conceptual space where human taste, field boundaries, and inherited guesses may have quietly narrowed the path.
So future is not machines replacing judgment, but machines widening the map before judgment begins.
译OpenAI的通用推理模型自主解决了一个自1946年以来未解的著名数学难题——平面单位距离问题。该模型没有采用专门为数学设计的定定理证明引擎,而是通过推理时增强计算能力,发现了优于传统网格结构的新构造方案。这标志着AI首次自主解决一个数学领域的核心开放问题。更重要的是,该模型能将几何问题与代数数论等深层理论连接,展示了通用人工智能在跨领域研究和拓宽人类认知边界方面的巨大潜力。
Rohan Paul@rohanpaul_ai · 5月21日67A 10 million parameter model just outperformed deterministic rivals 3 times its size by doing something regular recursive AI dont do: exploring multiple reasoning paths at the same time.
Most AI reasoning models are trapped on a single train of thought, and GRAM ("Generative Recursive Reasoning") is the first to break that by letting the model think in parallel universes simultaneously.
The problem is that all existing recursive models are fully deterministic, meaning given the same input they always follow the exact same reasoning path and can never escape a wrong trajectory or discover more than 1 valid answer.
GRAM fixes this by injecting learned randomness at each refinement step, so the model samples a slightly different direction each time rather than snapping to 1 fixed next state, which produces a spread of diverse reasoning trajectories.
At test time the model runs many of these paths in parallel and selects the best one using a small reward predictor trained alongside the main model, adding a "width" scaling axis on top of the usual "depth" axis of running more recursion steps.
On hard Sudoku puzzles, GRAM with 10M parameters hits 97% accuracy versus 87.4% for the best prior recursive model, and with only 20 parallel samples it outperforms every deterministic baseline even at 320 recursion steps.
On tasks with many valid answers like N-Queens, deterministic recursive models collapse as the number of solutions grows, while GRAM maintains near-perfect accuracy throughout.
The same stochastic framework also acts as a generator: given a blank board, GRAM produces valid Sudoku puzzles 99% of the time using 16 steps, versus 1,000 steps and 55M parameters for the best diffusion baseline at just 91%.
---
Paper Link – arxiv. org/abs/2605.19376v1
译仅1000万参数的GRAM模型,通过引入可学习的随机性,在推理时并行探索多条不同路径,打破了传统递归模型锁定单一思维的限制。该模型在测试时同时运行这些平行轨迹,并借助奖励预测器选择最优结果,从而在深度之上增加了“宽度”维度。实验表明,GRAM在困难数独任务上准确率高达97%,远超此前最佳确定性模型;在多解的皇后问题上也能维持高性能,并能高效生成有效的数独谜题。这一框架为提升小模型的推理能力提供了新思路。
Chubby♨️@kimmonismus · 5月21日84OpenAI made history today.
An internal reasoning model autonomously disproved a famous conjecture in mathematics that stood for nearly 80 years.
The problem: In 1946, Paul Erdős asked how many pairs of points can be exactly 1 unit apart if you place n points on a flat surface. The best known answer came from square grid constructions, and Erdős himself conjectured you can't do meaningfully better. Mathematicians believed this for decades.
The AI proved him wrong. It found entirely new point configurations that beat the square grid by a fixed polynomial factor, not a marginal improvement, a real mathematical gap.
The proof uses methods from algebraic number theory, a completely different branch of math, Class field towers, Golod-Shafarevich theory, tools nobody expected to be relevant to a geometry problem about distances in the plane (reminds me of move 37, AlphaGo tbh).
Fields Medalist Tim Gowers calls it "a milestone in AI mathematics." The proof was verified by leading external mathematicians.
According to OpenAI, this is the first time AI has independently solved a prominent open research problem in mathematics!
Caveat: Obviously OpenAI chose which problems to test the model on. So "autonomous" means the model generated the idea and wrote the proof, not that it wandered into the problem on its own.
But if reasoning models can reliably make cross-domain connections like this, finding paths that experts didn't prioritize, this changes research far beyond math. Biology, physics, materials science, medicine.
This isn't AI reproducing human knowledge anymore. This is AI producing new knowledge. That's a qualitative shift.
译OpenAI内部推理模型自主解决了存在近80年的著名数学开放问题——平面单位距离问题。该模型推翻了Paul Erdős的猜想,发现了全新的点配置构造,其效率以固定多项式因子优于传统方格网格方案。证明运用了代数数论等跨学科方法,经外部数学家验证,被Fields奖得主Tim Gowers誉为“AI数学的里程碑”。这是AI首次独立解决数学领域的核心公开问题,标志着从知识复现到知识创造的重要转变,其跨领域推理能力可能为多学科研究带来深远影响。
Z.ai@Zai_org · 5月21日75http://x.com/i/article/2057206923208884224
# Next-generation LLM Inference Network: How ZCube Alleviates Network Bottlenecks?
LLM inference is reshaping AI infrastructure. The network used to be the least interesting part of an inference cluster. That isn't true anymore. With long-context inference and Prefill-Decode disaggregation now standard, the network sits on the critical path of throughput, tail latency, and per-token serving cost.
To address the increasingly severe topology-induced congestion in Prefill-Decode disaggregated deployments, Z.ai, Harnets.AI, and Tsinghua University jointly developed and deployed the ZCube network architecture in an online production environment. The deployment shows that system-level innovation at the network architecture layer can unlock hardware potential in a highly cost-effective way.
In production benchmarking for the GLM-5.1 coding workload, ZCube delivered significant gains through architectural optimization alone:
- Cost optimization: GPUs, the software stack, and applications remained unchanged, while switch and optical module CapEx was reduced by 33%.
- Throughput improvement: Average GPU inference throughput increased by 15%.
- Latency improvement: TTFT P99 was reduced by 40.6%.
The root cause of the congestion lies in the shift of inference traffic patterns. As PD disaggregation becomes mainstream, cross-node KV Cache transfers make inference traffic highly asymmetric, with dynamically changing sources, destinations, and traffic volumes. In traditional ROFT (Rail-Optimized Fat-Tree) architectures, static topology and port mappings can easily concentrate traffic on a limited set of switches and links, causing local hotspots, queue buildup, and PFC backpressure. This leads to a structural issue where aggregate bandwidth appears sufficient, yet localized congestion occurs frequently.
ZCube addresses this issue by using a fully flattened network topology together with a hybrid single-rail / multi-rail access design. At the network architecture layer, it decouples and distributes PD traffic across a broader path space, reducing the probability of topology-induced congestion at its source. This provides a more efficient networking foundation for next-generation hyperscale inference clusters.
# Network Becoming a Bottleneck for Effective Inference
When thousands of GPUs serve online inference requests concurrently, every KV Cache transfer and every data synchronization operation traverses the inter-GPU network. As long-context inference and Prefill-Decode disaggregated inference gradually become mainstream, data exchange between Prefill and Decode nodes continues to grow. Network bandwidth, and more importantly the ability to use it effectively, has begun to affect cluster-level throughput and latency directly.
To quantify the impact of networking on inference performance, we first conducted an ablation study on a 512-GPU cluster. We kept GPU compute, the software stack, the model, and application logic unchanged, and only adjusted the available NIC bandwidth cap. We then measured changes in overall cluster throughput and Time to First Token (TTFT).
For example, when network bandwidth was increased from 100Gbps to 200Gbps, overall inference throughput improved by approximately 19%, while Time to First Token, or TTFT, decreased by approximately 22%. This indicates that, in LLM inference, network bandwidth has become one of the key factors constraining service performance.
# 1. Network Congestion in Inference
Today, AI clusters commonly use Clos, or Fat-Tree, architectures. The basic idea is to scale the network by stacking multiple layers of switches. However, the performance of Clos networks depends heavily on ideal load balancing across switches, which is difficult to achieve in practice due to routing policies and real traffic patterns.
For example, in many two-tier Fat-Tree deployments, which consist of Spine and Leaf layers, traffic across Spine switches can become severely imbalanced. As a result, upper-layer applications often fail to obtain the expected network performance.
To reduce the overhead of cross-layer forwarding, the industry often adopts ROFT (Rail-Optimized Fat-Tree) architectures [1]. As shown in Figure 3, ROFT groups GPUs by index ("rail"), and connects GPUs with the same index to the same Leaf switch, reducing the communication cost across Spine switches.
ROFT works well for certain training traffic patterns. However, in Prefill-Decode disaggregated inference, we observed a more prominent issue: KV Cache transfers exhibit strong source-destination asymmetry. Different GPUs and different NICs carry highly uneven communication loads, as shown in Figure 4. As a result, ROFT’s rail mapping no longer naturally translates into load balancing. Instead, traffic can become concentrated on a small number of Leaf switches and links, leading to link congestion and degraded transfer performance.
This manifests in several ways:
- Some Leaf switches become persistent load hotspots, increasing the probability that multiple KV Cache transfer flows compete on the same links. As a result, actual transfer throughput can fall far below the NIC bandwidth capacity.
- Certain egress queues on some Leaf switches remain at high depth for extended periods and frequently trigger PFC backpressure, as shown in Figure 5.
- Link congestion further amplifies tail latency, affecting both TTFT and overall throughput.
It is important to distinguish between the two types of network congestion, as illustrated in Figure 6:
- Unavoidable congestion: For example, when multiple GPUs send data to the same destination at the same time, contention on the final-hop link is inevitable.
- Avoidable congestion: This is caused by topology design, traffic mapping, or imbalanced multipath utilization. Fundamentally, it is an architecture-level design problem.
For the first type of congestion, we typically rely on congestion control, traffic shaping, and related mechanisms to mitigate its impact. For the second type, new network transport mechanisms such as adaptive routing [2], packet spraying [3,4], and MRC [5] can help. However, a more effective approach is to prevent network conflicts that should not occur in the first place through innovation at the network architecture layer.
Prefill-Decode disaggregated inference is a typical example. If the network topology cannot match the traffic pattern, the system will repeatedly generate load hotspots and link conflicts. Solving this problem requires rethinking the inference network architecture itself.
# 2. ZCube Network Architecture
To address the above issues, we deployed a new ZCube network architecture [6]. ZCube breaks away from the traditional Clos design philosophy of hierarchical switch stacking and instead introduces a fully flattened GPU server interconnect.
The ZCube routing strategy, designed specifically for the ZCube architecture, fully leverages the structural properties of the flattened topology. It can achieve near-ideal load balancing across all switches in the network, thereby significantly improving overall cluster network bandwidth.
Compared with Clos, ZCube has a natural advantage in load balancing. This advantage benefits both training clusters and inference clusters. Importantly, ZCube achieves these performance gains while reducing switch and optical module costs by approximately one third compared with Clos. Based on current mainstream switch and NIC configurations, ZCube can support flattened networking for tens of thousands, or even hundreds of thousands, of GPUs.
## 2.1 ZCube Core Architecture
As shown in Figure 7, the core ideas of ZCube are:
1. Remove the Spine switch layer.
1. Divide Leaf switches into two groups of equal size, typically odd-numbered switches and even-numbered switches.
1. Establish a complete bipartite interconnect between the two switch groups.
1. Connect the two ports of each GPU NIC to the corresponding switches in the two groups using single-rail and multi-rail access patterns.
Suppose each GPU has a corresponding NIC with two ports, i.e., p=2. There are n GPUs in total, and GPUs and NICs share the same indices: 1,2,…,n. Let k denote the number of GPUs connected to each switch. The total number of switches is 2n/k, numbered 1,2,…,2n/k. For GPU i, where 1≤i≤n:
- The first port connects to the odd-numbered switch:
((i−1)mod(n/k))×2+1
- The second port connects to the even-numbered switch:
⌈i/k⌉×2
The two switch groups are connected as a complete bipartite graph: every odd-numbered switch connects to every even-numbered switch.
A ZCube topology under dual-port NIC configuration, withp=2,n=32, and k=8, is shown in Figure 7.
## 2.2 Key Properties of ZCube
Network Diameter
ZCube has a network diameter of two switch hops, meaning any pair of GPUs can reach each other through two switches. This sits between a one-layer switch network, which has one switch hop but limited scale, and a conventional two-layer switch network, which supports a larger scale but typically requires three switch hops and incurs higher latency.
Load Balancing
First, the ZCube routing strategy ensures that each GPU pair has a unique optimal path, avoiding traffic conflicts caused by multipath route selection.
Second, ZCube uses two complementary GPU-to-switch connection patterns. One switch group connects to GPUs in a single-rail pattern, where each switch connects to a contiguous range of GPU IDs. The other switch group connects to GPUs in a multi-rail pattern, where each switch connects to GPUs with the same relative index across groups.
This design enables ZCube to achieve highly effective load balancing across the entire switch fabric under both typical AI training traffic patterns, such as AllReduce and All-to-All, and typical AI inference traffic patterns, where source-destination relationships are uncertain, and NIC loads can be highly imbalanced.
As a result, ZCube can avoid the second type of network congestion described earlier at the architecture layer. As shown in Figure 8, traffic flows that would conflict under ROFT can obtain dedicated network paths under ZCube, thereby avoiding congestion.
Scalability
ZCube provides strong scalability while preserving its favorable performance characteristics. For example, using one layer of 51.2T switches, each with 128 × 400Gbps ports, ZCube can construct a network connecting 16,384 400Gbps NICs. If higher-capacity switches are used, or if the ZCube network is divided into more planes, the architecture can scale further to support interconnection among tens of thousands or even hundreds of thousands of GPUs.
Cost
At the same cluster scale, ZCube can reduce switch and optical module costs by approximately one third compared with traditional Clos / ROFT architectures. For example, in a 10,000-GPU AI cluster, ZCube can save roughly 210 million RMB to 640 million RMB in network hardware investment. These characteristics show that ZCube can achieve better load balancing and performance while requiring lower network hardware cost.
## 2.3 Real-World Cluster Testing: Boosting Inference Performance While Cutting Network Costs
We upgraded the network architecture of a thousand-GPU cluster running GLM-5.1 coding inference services from the original ROFT to the ZCube architecture. Since the ZCube architecture eliminates the Spine-layer switches found in traditional Clos architectures, the legacy cabling patterns, IP addressing schemes, routing policies, and switch configuration methods established under the Clos framework could not be reused directly, necessitating a complete redesign tailored to ZCube.
To tackle these challenges, the Harnets.AI Network Team designed a comprehensive network solution centered on the ZCube architecture. They developed a suite of automation tools, including the ZCube Controller, a data center layout design tool, and a cabling correctness verification program. This enabled capabilities such as data center deployment planning, cabling validation, automated configuration generation, and batch deployment, effectively resolving numerous hurdles in ZCube deployment. This suite of tools was the critical factor enabling the successful transformation of a large-scale production cluster within an exceptionally tight timeframe.
Following the seamless network architecture migration, we conducted real-world testing on the ZCube architecture by running the GLM-5.1 coding inference services on this cluster. By comparing the cluster's inference performance before and after the upgrade, we found that ZCube boosted the average GPU inference throughput by over 15% compared to the ROFT architecture (as shown in Figure 9), while dropping the P99 tail latency of TTFT by 40.6%.
In summary, for GPU and server hardware of the same scale and configuration, and without modifying any applications, upgrading the networking architecture to ZCube allowed us to not only save 1/3 of the optical modules and switch hardware, but also enable the cluster to serve 15% more inference requests per second. Against the current backdrop of exploding inference workloads and severe shortage of compute resources, this approach proves to be highly pragmatic and valuable. Currently, this ZCube cluster has been running stably for over two weeks, playing a vital role in powering the GLM-5.1 coding inference services.
# 3. Conclusion
LLM inference is moving from point-wise optimization toward system-level co-design. The coupling between the network and the inference engine is becoming increasingly tight, making networking a critical component of the inference system. The production deployment of ZCube shows that network architecture innovation can directly unlock the effective capacity of inference systems. By better aligning the network architecture with KV Cache transfers and PD traffic patterns, ZCube reduces the probability of topology-induced congestion at the source, improving throughput and latency while enhancing cluster cost efficiency.
Looking ahead to next-generation LLM infrastructure, network design will evolve from general-purpose interconnects toward model-traffic-driven system co-design. Long-context inference, PD disaggregation, MoE, and integrated training-inference workloads are reshaping intra-cluster communication patterns, requiring network topology, communication libraries, and scheduling policies to be jointly optimized around real model traffic. Looking ahead, we will continue pioneering novel AI network architectures for larger-scale inference and training clusters ─ upgrading the network from a foundational GPU connection layer into a core driver of token generation efficiency, system resilience, and cost-effectiveness.
# Acknowledgements
ZCube was published at ACM SIGCOMM 2025, and was recognized as “significantly change the way we think about and understand networking.” This is the first large-scale deployment of the technology in a production inference cluster. We thank the Harnets.AI team for their professional support and close collaboration throughout this network architecture upgrade and optimization effort.
## Reference
[1] NVIDIA. 2023. SuperPOD: Next Generation Scalable Infrastructure for AI Leadership. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod-reference-architecture-dgx-h100.pdf
[2] NVIDIA. 2025. https://developer.nvidia.com/blog/accelerating-ai-storage-by-up-to-48-with-nvidia-spectrum-x-networking-platform-and-partners/
[3] Ultra Ethernet Consortium. Ultra Ethernet specification v1.0.1, 2025.
[4] Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Michael Papamichael, Mohammad Dohadwala, Lukas Gianinazzi, Mikhail Khalilov, Elias Achermann, Daniele De Sensi, and Torsten Hoefler. REPS: Recycled entropy packet spraying for adaptive load balancing and failure mitigation, 2026.
[5] Araujo, J., Chow, A., Handley, M., Lewis, R., Paasch, C., Padhye, J., … & Sur, S. (2026). Resilient AI Supercomputer Networking using MRC and SRv6. arXiv preprint arXiv:2605.04333.
[6] Yan, Z., Li, D., Chen, L., Xiong, D., Gao, K., Zhang, Y., … & Lin, H. (2025, September). From ATOP to ZCube: Automated topology optimization pipeline and a highly cost-effective network topology for large model training. In Proceedings of the ACM SIGCOMM 2025 Conference (pp. 861-881).
译随着长上下文与Prefill-Decode分离部署成为主流,GPU集群网络已从次要部件转变为制约推理吞吐、尾部延迟和成本的关键瓶颈。传统静态网络拓扑与动态非对称的KV Cache流量模式冲突,导致局部拥塞。为此,Z.ai、Harnets.AI与清华大学联合研发了ZCube网络架构。该架构采用完全扁平化拓扑与混合接入设计,从源头解耦并分散流量以减少拥塞。在GLM-5.1生产测试中,ZCube在保持GPU与软件栈不变的前提下,实现了交换机与光模块成本降低33%、平均推理吞吐提升15%、首token时间P99降低40.6%的显著效果,证明网络架构创新能有效释放硬件潜力。
Emad@EMostaque · 5月21日91Once AI starts making solving open problems in novel ways it won’t stop.
We are entering the final stage of human solutions to open problems like this.
Feels weird, doesn’t it?
译OpenAI模型首次自主解决了Paul Erdős于1946年提出的平面单位距离问题,这一突破推翻了数学界近80年来的主流猜想。AI不仅给出了更优的解法,更发现了一族全新的构造方式。这一事件被视为AI能力的里程碑,暗示着在解决科学开放性问题上,AI正开始以新颖方式持续突破,可能标志着人类主导此类问题求解的“最终阶段”的到来。
Greg Brockman@gdb · 5月21日92An OpenAI model has achieved a major breakthrough in mathematics, by disproving a central conjecture in discrete geometry that was first posed by Paul Erdős in 1946.
This is the first time AI has autonomously solved a prominent open problem central to a field of mathematics.
译OpenAI的模型在离散几何领域取得重大突破,自主解决了由数学家Paul Erdős于1946年首次提出的平面单位距离猜想。该突破是AI首次独立解决一个学科的核心著名开放问题。此前近80年间,数学家普遍认为该问题的最优解大致呈现为方形网格结构,而OpenAI模型发现了全新的、性能更优的构造方式,颠覆了这一长期信念。
AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 5月21日87"This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics."
译OpenAI模型自主攻克了数学领域一个长达近80年的著名开放问题——平面单位距离问题。该问题由Paul Erdős于1946年提出,传统观点认为最优解结构近似于方格网格。OpenAI模型的突破性发现不仅推翻了这一长期假设,还构造出性能更优的全新解法,标志着人工智能首次在数学核心领域独立解决重大未解难题。