Kim受邀首次参加微软Build,参观GitHub HQ、参与多场会议并见到Satya Nadella,认为远超预期。微软发布7个新AI模型(定位中端、约Sonnet级别、价格亲民),新Surface Laptop Ultra配新芯片对标MacBook Pro,展示Project Solaris和智能体手持设备等实验项目,推出改版Copilot应用,企业版新增智能体功能及新量子芯片。作者认为微软正认真听取反馈,在各个方向推动变革。
Microsoft Build. My personal review. For me, this was the first time I had the chance to attend Microsoft Build, at Microsoft's invitation. To be honest, I didn't really know what to expect, but I was especially looking forward to the keynote. And it wasn't just the keynote: I also visited GitHub HQ, saw the event hall, sat in on numerous sessions, and even met Satya Nadella in person. Holy moly. It truly exceeded all my expectations. 2026 is turning out to be a crazy year for me. It started with NVIDIA GTC in San Jose in March, followed shortly after by a trip to China - Guangzhou and Beijing - then Google I/O in California, and now Microsoft Build, also in California. What a wild ride! I met incredible people and had fascinating conversations late into the evening about LLMs, chips, energy, geopolitical challenges, financial markets, and so much more. What impressed me most was the pioneering spirit, the optimistic atmosphere, the enthusiasm for being at the forefront of this tech-revolution. Optimism mixed with passion and a love of building, that's what I take away from all these trips. Microsoft was no exception. I got a behind-the-scenes look, heard exclusive GitHub sessions, experienced a personal demo of the flagship Surface Laptop Ultra, met researchers, and much more. My honest take on Microsoft Build: Microsoft is taking feedback seriously and is trying to set things in motion and drive change on every front. Seven new AI models - clearly not aiming for the absolute top end, but positioned in the mid-range, roughly at Sonnet level, and affordable; a new laptop with a new chip meant to rival the MacBook Pros, which, frankly, at first glance even seems capable of pulling it off; bold experiments like Project Solaris and the agentic handheld (yes, I've read all the Rabbit comparisons :D); a revamped Copilot app; the rollout of agentic features into enterprise editions with a new quantum chip; and plenty more. It certainly wasn't boring. Time will tell what succeeds, but I'd argue Microsoft is on the right track.
译Kim受邀首次参加微软Build,参观GitHub HQ、参与多场会议并见到Satya Nadella,认为远超预期。微软发布7个新AI模型(定位中端、约Sonnet级别、价格亲民),新Surface Laptop Ultra配新芯片对标MacBook Pro,展示Project Solaris和智能体手持设备等实验项目,推出改版Copilot应用,企业版新增智能体功能及新量子芯片。作者认为微软正认真听取反馈,在各个方向推动变革。
Lawyers, too, are cooked "When law professors were handed a stack of anonymized answers to student contract questions and asked to pick the better one, they picked AI 75% of the time"
译律师们,也完了 "当法学教授收到一堆匿名的学生合同法问题答案并让选出更好的那个,他们75%的时候选了AI"
PAPER: We used state-of-the-art LLMs to prove AI still can't do X THE STATE-OF-THE-ART LLMS:
译论文:我们使用最先进的大语言模型来证明AI仍无法做到X 最先进的大语言模型:
How do we automate business analytics with Claude? New blog post covering our best practices for skills, data foundations, and evaluations when building agents to perform data analysis: https://claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude
译我们如何用 Claude 自动化商业分析? 新博客文章,涵盖构建数据智能体时在技能、数据基础和评估方面的最佳实践: https://claude.com/blog/how-anthropic-enables-self-service-data-analytics-with-claude
Drones enforcing traffic rules in Shenzen
译深圳无人机正在执行交通规则。
The five stages of Claude, @JeremieEO is currently at Stage 1... ACCEPTANCE.
译Claude的五阶段,@JeremieEO目前处于第一阶段... 接受。
Im confused. And excited at the same time. I got the feeling OpenAI is preparing for some big releases. Superapp? 5.6? Let it come!
译我很困惑,同时也感到兴奋。我感觉到OpenAI正在准备一些重大发布。 超级应用?5.6?让它来吧!
This story was so implausible that the only way it even (kind of) made sense if it is some sort of internal accounting placeholder at a cloud provider using their own compute. And even then it seems unbelievable for a wide number of reasons.
译@binarybits 称,不相信有公司一个月意外花费5亿美元在Claude上,这个数字大得不合理。主推文表示这故事难以置信,唯一可能解释是云提供商内部会计占位符,即便如此也仍有诸多疑点。
http://x.com/i/article/2062244283940544512 # A Functional Taxonomy of World Models > “The world is everything that is the case.” — Ludwig Wittgenstein, Tractatus Logico-Philosophicus, 1921 ## The world is not made of words. In an earlier essay, we argued that spatial intelligence is AI’s next frontier and that world models are the path to it. Here, the World Labs team and I want to go one level deeper: of the many things now being built and called ‘world models,’ which functional pieces actually compose that capacity — and what is each one for? Language models have given machines an extraordinary command of concepts, vocabulary, and reasoning, but the physical world, virtual or real, runs on a different substrate. Where language models learn the statistical structure of text, world models learn the statistical structure of space and time: how light falls on a surface, how a garden looks from an angle no camera has captured, how objects respond to force and follow the laws of physics. That makes “world model” one of the most important and most overloaded terms in AI today. Computer vision, robotics, reinforcement learning, and generative AI each claim to be building world models, and each means something quite different. A video model that produces gorgeous but physically impossible flames, a language model improvising a playable game, and a physics engine that faithfully simulates combustion all go by the same name. The ancient Greeks could never agree on what the world was made of, whether fire, water, or indivisible atoms, because “world” was never a single thing. It was always a stand-in for whatever totality a given thinker needed to reason about. AI has inherited the same problem, at exactly the moment when the field needs precision. ## The loop beneath the taxonomy Cutting through that confusion starts with a diagram older than any of the technology in question. Reinforcement learning textbooks, including the canonical Sutton and Barto, have used a version of the same picture for decades to describe how an agent interacts with a world. The formal name for this picture is the partially observable Markov decision process, or POMDP, and the original definition of the term “world model” belongs to that tradition. An agent, which can be a person, a robot, or a software system, takes actions. Those actions affect the state of the world. The agent never sees the state directly. What reaches the agent are observations: the photons that fall on a retina, the readings from a sensor, and the pixels in a video frame. New observations inform new actions, and the loop continues. The word “state” needs unpacking, because the meaning shifts from field to field. This is not the chemist’s state, the difference between solid, liquid, and gas. This is the physicist’s and roboticist’s state: a complete description of what is happening in the world at a given moment, including every object, every position, every velocity, every property. State is the underlying reality of the world; complete in principle, but never directly visible to any agent inside it. Observations are an agent’s partial view of that reality. Actions are what the agent does in response. This loop — agent to action to state to observation and back — is the structure that gave the modern term “world model” its technical meaning. The phrase itself is older, traced to Kenneth Craik’s 1943 proposal that minds reason by running “small-scale models” of reality, and carried into neural networks by the late 1980s and early 1990s. And the loop also explains what people mean by the term today. The different things now being called world models are in fact different projections of this same loop. Each one outputs a different piece of it. ## Three functions of a world model The first kind of world model is a renderer. A renderer outputs observations in the form of pixels meant for human eyes, and the quality that matters most is visual fidelity. A video model that turns a text prompt into a cinematic drone shot is a renderer. So is an interactive system like Google’s Genie 3, or World Labs’ own RTFM, where the model generates frames in real time conditioned on user input. The model carries no explicit understanding of three-dimensional structure. It produces what a viewer would see, not what is. The buildings in the drone shot may look flawless from above, but try to drive through the city below and they fall apart. The second kind is a simulator. A simulator outputs state: a geometrically, physically or dynamically faithful representation of the world that humans and computer programs can both compute on and interact with. Where the renderer’s contract is purely visual, the simulator’s contract is structural, demanding geometry that holds up under inspection, physics that respects Newton’s laws, and dynamics that behave the way the world needs to behave given the laws of physics. A simulator serves two consumers at once. Human professionals such as architects, designers, filmmakers, and game developers need accuracy beyond visual plausibility. Computer programs such as reinforcement learning agents, robot controllers, and autonomous vehicles use simulators as training grounds where they can interact with the world at scale, testing scenarios that would be dangerous, expensive, or impossible to run in reality. The third kind is a planner. A planner outputs actions. Given an observation and a goal, a planner answers the question of what the agent should do next. This is, in many ways, the inverse of the renderer. Where a renderer takes actions as input and produces observations, a planner takes observations as input and produces actions, closing the perception-action loop. Vision-Language-Action models, model-based systems, and the new wave of World Action Models are all attempts at planners: systems that can decide what a robot should do in an unstructured world. These three categories describe most of what is actually shipping today, and the distinction between them is useful in practice. The categories are not, however, fundamentally separate. The same underlying knowledge of how the world works—geometry, physics, dynamics—sits beneath all of them. A model that can render a cup from any angle ought, in principle, to be able to simulate what happens when the cup is pushed and plan a hand to pick the cup up. Increasingly, the most interesting research deliberately blurs the boundaries between the three. ## Why simulation is the linchpin Of the three categories, the simulator gets the least public attention, and is the most consequential of the three. This essay addresses this asymmetry. The renderer is by far the most commercially mature. A number of image- or text-to-video products are expanding in the consumer or enterprise markets rapidly. Google’s Nano Banana model has put renderer-quality image generation in the hands of potentially hundreds of millions of users. The technology is real, and the markets are real. Yet renderers optimize for visual plausibility rather than physical accuracy, and that ceiling matters. Their outputs are beautiful, but they cannot be trusted to design a building or train a robot. The planner is the most intriguing and the most nascent, closely connected to the rapidly evolving field of robotic learning. The field has produced robotic demos in the last two years that look impressive in videos, but candor is required about what those demos actually show. Almost all have been confined to heavily constrained laboratory setups, with narrow object sets and short task horizons. None have been validated at the complexity, variability, or duration that real-world deployment demands. The gap between a compelling demo reel and a robot that reliably works in a kitchen, a warehouse, or an operating room remains vast. The commercial bets are nonetheless substantial. A wave of well-funded entrants is racing to ship general-purpose planning systems, while the largest infrastructure players are positioning planning atop broader simulation stacks. A robot that can plan is a robot that can work, and the entire industry is racing to be the one that gets there first. Simulation is the bridge between the two. If language is an abstraction of the world and pixels are a projection of it, then geometry, physics, and dynamics are the world itself. A simulator must work at that level: the structural backbone from which both visual appearance (for renderers) and action consequences (for planners) can be derived. A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents. A model that masters only rendering, or only planning, cannot do either. The commercial surface area is enormous. NVIDIA’s Omniverse alone targets what the company estimates as more than a trillion dollars of addressable market in factories, warehouses, supply chains, and digital twins. Robotics training, autonomous vehicle testing, architectural visualization, engineering, and drug discovery all depend on something simulation-shaped. The hardest open problems in the field live there too. Three-dimensional data with explicit geometry, material properties, and physical annotations is orders of magnitude scarcer than the internet video that renderers train on. The sim-to-real gap, which is the difference between how things behave in simulation and how they behave in reality, persists. Generative simulators introduce a new risk on top of that: AI-generated geometry can look correct while containing self-intersections or wrong scale that produce nonsensical physics. Multi-physics simulation at scale, where rigid bodies, deformable objects, fluids, and cloth all interact, remains orders of magnitude more expensive than single-domain simulation. At World Labs, Marble is our first move into this territory. It takes multimodal prompts (text, image, video, or spatial sketch) and generates explorable 3D environments, outputting Gaussian splats for visual exploration alongside collision meshes a physics engine can operate on. But Marble is only the first chapter of a much longer arc being written across the field as the lines between rendering, simulation, and planning begin to collapse. ## Where the boundaries are collapsing and what comes next But more is to come. The most important pattern in the field right now is that the three categories are starting to blend into one another. The shared insight is that the knowledge required to render a world, simulate it, and act in it is largely the same. Continuing the earlier example, a model that truly understands how a cup sits on a table (its geometry, material properties, response to force, etc.) should be able to render that cup from any angle, simulate what happens when the cup is pushed, and plan for a hand to pick the cup up. The three categories are three projections of a single underlying understanding. For example: a small but growing number of recent work from various robotics labs have demonstrated that—at least conceptually—a pretrained video renderer can be used as the backbone for joint world-and-action prediction, suggesting a bridge between the renderer and the planner by letting one model imagine what will happen and what to do. World Labs’ Marble already outputs Gaussian splats and collision meshes from a single model, dissolving the boundary between the renderer and the simulator. Every level is moving from passive output to interactive system, with renderers becoming action-conditioned, simulators generating worlds that are more controllable and editable, and planners deliberating rather than just reacting. The logical endpoint is a unified world model: one foundation model that can render photorealistic views, produce physically accurate structure, and plan action sequences, switching between output modalities depending on what the downstream consumer needs. We will still face a number of daunting challenges. The data picture is uneven, with renderers awash in internet video while simulators and planners face acute shortages of 3D assets and robot demonstrations. Optimizing for visual beauty can sacrifice the precision a robot or a high-fidelity simulation needs. Reconciling these tensions inside a single architecture is the defining open problem in world model research today, and this is what World Labs sets out to do as we continue to evolve Marble. The direction, however, is clear. The same bet the field has been making since the late 1980s — that a sufficiently rich model of the world is all that any agent needs to see worlds, build them, and act in them — is the bet now driving an entire generation of research. What gives that “big bet” weight is the convergence already underway: three threads, each already driving and shaping multi-billion-dollar industries on its own, that began as separate research programs are starting to behave like one. Taken together, as the boundaries between them collapse, they will reshape something larger: the relationship between machine intelligence and the physical world it inhabits - the long arc of spatial intelligence. Language gave machines a way to talk about that world. World models are how machines will finally come to understand, imagine, reason and interact with it.
译World Labs团队与李飞飞发文,梳理“世界模型”这一被滥用的术语。对比语言模型学习文本统计,世界模型学习空间与时间统计(如光照、物理规律)。基于部分可观马尔可夫决策过程(POMDP)框架,智能体通过动作影响世界状态,观测是部分视图。当前被称为“世界模型”的不同系统本质上是同一循环的不同投影:第一类为渲染器,输出给人眼看的像素,以视觉保真度为核心。文章着重于概念分层,未给出具体模型名、参数或基准分数。
150M 的活,35M 干了, Google 新出的 Gemma 4 12B,把多模态里那个最重的零件,视觉编码器, 从 150M-550M 直接压到 35M了, 过去做多模态,套路是固定的, 图片先扔给一个专门的视觉编码器翻译成模型能懂的语言, 再交给大模型理解,就像配了个翻译官。 这个翻译官,传统 ViT 编码器要 150M 到 550M 参数。 Gemma 4 12B 直接把翻译官辞了, 只留一个 35M 的轻量嵌入器,把图片切成 48×48 的小块,当成 token 直接扔进去, 让 Transformer 自己学着看世界, 音频也一样,16kHz 原始波形切成 40ms 一帧,直接喂进同一个模型。 也就是说,图片、声音、文字,第一次被当成同一种东西。 为什么敢这么干, 因为它赌的是一件事, 当基座模型大到某个临界点,那些专门的子模块,就不再是必需品了。 这个剧本你可能见过, 当年 ViT 取代 CNN,也是同一个套路, 规模够大的时候,与其手工设计一堆专用结构,不如把活儿直接交给一个统一的大模型自己学。 现在这套逻辑,正从视觉单模态,蔓延到整个多模态架构。 而且 12B 这个尺寸不是随便选的, 刚好大到能扔掉编码器,又刚好小到能塞进 16GB 的笔记本里, 据 aaryan_kakad 在 M4 Max 上的实测,4-bit 量化下识图延迟 1.2 到 1.5 秒, 官方说 16GB 够用,社区的说法更实在,能跑,但高分辨率多图会压线。 但这条新闻真正值得琢磨的,不是它能跑在你的笔记本上, 是它意味着什么, 过去做一个多模态应用,你得拼装 Whisper 转录、LLaVa 看图、再接一个 LLM, 像攒一台机器,每个零件都得你自己调好接口、对齐、调试。 如果 encoder-free 这条路走通, 未来一个微调好的统一模型,可能就把这一整条流水线吃掉了。 那一刻贬值的,不是某个工具, 是你过去攒那台机器、拼那条 pipeline 攒下的全部手艺。 模型不是在帮你省一个零件, 是在悄悄重写哪种手艺还值钱。
译Google 推出 Gemma 4 12B(Apache 2.0),采用无独立视觉编码器的统一多模态架构。仅用 35M 参数的轻量嵌入器,将图像切为 48×48 块、音频(16kHz 原始波形)切为 40ms 帧,直接作为 token 输入 Transformer。M4 Max 上 4-bit 量化识图延迟 1.2-1.5 秒,官方称 16GB 内存可用,但社区指出高分辨率多图会压线。该设计暗示:当基座模型足够大,专用子模块不再是必需,未来一个微调好的统一模型可能取代传统拼装 Whisper、LLaVa 等多模态 pipeline。
In early May, the best superforecasters predicted that, by the end of the year, the longest METR 80% task horizons would reach 3-4 hours. In late May, Claude Mythos achieved that number.
译5月初,顶级超级预测者预计2026年底前最长METR 80%任务时间范围可达3-4小时。然而5月底,Anthropic的Claude Mythos模型在METR基准预览中即以80%成功率达到3小时6分钟,直接落在专家和超级预测者对2026年底的中位数预测范围内(3-4小时)。此前基线为1.5小时。此次突破表明AI能力进展速度远超预期。
More and more engineers are now burning more money on AI tokens than their base salaries. Tech companies are facing a brutal dilemma: > let everyone tokenmaxx and move at AI speed > add token budgets and kill the vibe > lay off 50% of people and give the rest unlimited tokens
译越来越多的工程师现在在AI token上花费的钱比他们的基本工资还要多。 科技公司面临一个残酷的两难选择: > 让每个人尽情使用token并以AI速度前进 > 增加token预算并扼杀氛围 > 裁掉50%的人,给剩下的人无限token
Deploy Step 3.7 Flash on @modal with SGLang 🚀 Modal is a serverless AI platform for deploying and scaling compute-intensive workloads without managing infrastructure. Their new guide shows how to serve our open-weight Step 3.7 Flash with SGLang on Modal, using 8×H100 GPUs, Modal Volumes, and an OpenAI-compatible chat completions endpoint. Excited to collaborate with Modal to make StepFun models more accessible to builders. https://modal.com/docs/examples/stepfun_inference
译在 @modal 上用 SGLang 部署 Step 3.7 Flash 🚀 Modal 是一个无服务器 AI 平台,用于部署和扩展计算密集型工作负载,无需管理基础设施。 他们的新指南展示了如何在 Modal 上使用 SGLang 服务我们的开源权重 Step 3.7 Flash,采用 8×H100 GPU、Modal Volumes 以及兼容 OpenAI 的聊天补全端点。 很高兴与 Modal 合作,让 StepFun 模型更易于构建者使用。 https://modal.com/docs/examples/stepfun_inference
This is probably GPT-5.6. Either tomorrow or coming week i suppose. Get ready friends. We are in for a wild ride!
译这大概是 GPT-5.6。要么明天,要么下周,我想。 朋友们,准备好了。我们即将迎来一场狂野之旅!
If this prompt feels well written to you, it's because Suzanne is a writer in her little spare time! You can read her short story, Mall of America here: https://suzannewang.com/mall-of-america It's one of my favorite short stories about the human condition that happens to involve AI.
译如果这个提示词让你觉得写得很好,那是因为Suzanne在业余时间是一名作家! 你可以在这里阅读她的短篇小说《Mall of America》:https://suzannewang.com/mall-of-america 这是我最喜欢的关于人类境况且恰好涉及AI的短篇小说之一。
Most people, including really accomplished people, don't have an accurate mental model of how LLMs operate (and why would they?) You see this in wide beliefs that AI is just copying from known sources, or that it only produces average answers, or that it can't generate new ideas
译大多数人,包括非常有成就的人,对LLM的运作方式没有准确的认知(他们凭什么有呢?) 你可以从广泛的观念中看到这一点:认为AI只是从已知来源复制,或者它只能产生平均水平的答案,或者它不能产生新想法。
Great demo by @atomic_chat_hq. Step 3.7 Flash was designed for real-world agentic coding tasks — not just generating code fast, but keeping logic, visuals, and execution coherent across complex outputs. Love seeing builders test it in creative ways!
译阶跃星辰(StepFun)称其 Step 3.7 Flash 在与 DeepSeek V4-Flash 的物理编程测试中全面胜出。测试要求在不使用库的情况下,生成一个包含高尔顿板、旋转六边形弹球和同步节拍器三个场景的自包含 HTML5 canvas 动画,并实现真实物理。Step 3.7 Flash 输出 59.6k tokens(耗时 9分57秒),DeepSeek V4-Flash 输出 52.5k tokens(耗时 6分21秒)。尽管 DeepSeek 更快,但 StepFun 模型在物理模拟、视觉效果和逻辑渲染上均占优。主推文指出 Step 3.7 Flash 专为真实世界 agentic 编码任务设计,能保持复杂输出中逻辑、视觉和执行的一致性。
This SkillOpt paper from Microsoft is a must-read! (bookmark it) I was a bit skeptical of the results reported in the paper when I shared it a few days ago. However, I managed to integrate it into my agent orchestrator and ran a few experiments. The results are mindblowing. Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this. One exciting result was when I applied it to my paper-figure-extraction skill, which requires an agent to do multimodal analysis. In particular, it improved quality by +20 points (0.73 → 0.93). I went to see the extracted tables and figures, and I was absolutely stunned by how much better my skill got at the task. Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve. In this case, it was skills, but it's not hard to imagine how this scales to optimizing agent patterns, tool use, context engineering efforts, agentic search, workflows, evals, and even the harness itself. I already started with a few of these ideas inspired by SkillOpt. Stay tuned!
译DAIR.AI的Elvis Saravia将微软SkillOpt论文集成到智能体编排器中后,所有智能体技能获得测试框架与自我演化机制。应用于多模态论文图表提取技能时,质量评分从0.73提升至0.93(+20点),提取结果显著改善。Saravia认为这是自我改进AI的早期范例,该思路可扩展至智能体模式优化、工具使用、上下文工程、智能体搜索及工作流评估等环节。他已基于SkillOpt启动多项后续实验。
Dr. Feifei Li, CTO of Alibaba Cloud & Tommy Eastman @yeahfortommy , Head of Strategy, @NousResearch As we orchestrate intelligence at scale, reshaping knowledge work, giving agents autonomy requires reproducible actions—the core secret behind Hermes agent's success.
译阿里巴巴集团首席技术官李飞飞博士与NousResearch战略主管Tommy Eastman 我们大规模编排智能,重塑知识工作,赋予智能体自主性需要可复现的行动——这就是Hermes智能体成功的核心秘诀。
GPT 5.5 Pro 调研生成了一份关于 Codex 的Goal指令如何用的文档。 仔细阅读学到了两个技巧: 1. 觉得写不好goal时,先用plan模式,让AI反问自己一些问题,让AI帮收敛写Goal指令。 提示词模板: /plan Help me turn this vague task into a strong Codex goal. Interview me for missing success criteria, verification commands, constraints, boundaries, iteration policy, and blocked stop conditions. Then draft a final `/goal ...` command. 2. 写好Goal的六要素:结果、验证、约束、边界、迭代和阻塞条件 官方标准模板如下: /goal [Outcome]. Verification: [commands/artifacts/evidence]. Constraints: [what must not change]. Boundaries: [allowed writes / forbidden paths]. Iteration policy: [one focused change, rerun checks, log progress]. Stop when: [evidence proves completion]. Pause if: [blocked conditions / human decisions / budget cap]. 详细调研报告见评论区,有不少模板可直接用。
译GPT 5.5 Pro 调研生成了一份 Codex 的 Goal 指令使用文档,分享两个技巧:1. 写不好 Goal 时先用 /plan 模式,让 AI 反问用户来完善命令,提示词模板为 `/plan Help me turn this vague task into a strong Codex goal...`;2. 写好 Goal 的六要素:结果、验证、约束、边界、迭代策略、阻塞条件。官方标准模板为 `/goal [Outcome]. Verification: [...] Constraints: [...] Boundaries: [...] Iteration policy: [...] Stop when: [...] Pause if: [...]`。详细报告含多个可直接使用的模板。
"Engineering, product, and design are all merging into a 'builder' role" Yeah... I'm not so sure. This feels like an oversimplification and podcast talking point. Reality is a lot more complex. Even with 1000 "Member of Technical Staff" titles, someone still has to wake up and care 100x more about Product or Design than anyone else. It is their Main Thing™ That's not to say MTS titles are universally bad, but I think they're an example of this 'builder' talking point that's become bastardized. AI and coding agents have made generating code easy and yet... you're in for a world of pain if non-engineers ship a bunch of slop and don't have great engineers to tame the complexity. The SF hivemind has a tendency to overfit what works at startups for every company. And to be fair, sometimes this is true! Startups can be a leading indicator for how the industry is changing and often cause disruption. However, it is going to be incredibly hard to disrupt the extremely human parts of corporate jobs. You really think there's going to be a PM who also does some engineering and design on the side at JPMorgan Chase? This is true for the simple parts of most jobs, like people wanting to have ownership over something and do good work, move up a career ladder, support their family, get paid well, make an honest living... And also the hard parts: internal politics, some critical business system that has a bus factor of 1 which has been running for 15 years and isn't documented anywhere because it's that guy's job security. The real world has a lot of this stuff. It's easy to pontificate about all roles collapsing but it's actually really nice to have a specific person or team who is an expert in one thing that you can work with. I don't expect that to change. Further, I think AI disruption to knowledge work will take decades to play out because it is more fundamental to the human condition (e.g. sociological/organizational) than pure intelligence.
译Lee Robinson 认为该说法是过度简化的播客话术。现实更复杂:即便大量“技术专家”存在,仍需要有人百分百专注产品或设计;AI 虽让生成代码变易,但缺乏优秀工程师会导致灾难。硅谷常把创业公司经验套用于大公司,却难以颠覆内部政治、遗留系统等极度人性化的部分。他判断 AI 颠覆知识工作需要数十年,因为本质是社会/组织问题,而非纯智力问题。
That feeling of being trapped in latent space
译用户指令要求修改屏幕,使其显示为正在打FaceTime电话。主推文感叹:被困在潜在空间中的那种感觉。
A key lesson of the last year of building open models, once it became so obvious the US is behind, is that talk is cheap. Many people say they're helping / want to help but actually don't do anything. Finding the few people who genuinely push open forward is crucial.
译过去一年构建开放模型的一个关键教训,当美国明显落后这一点已变得如此清晰时,就是空谈是廉价的。 许多人说他们在帮助/想要帮助,但实际上什么都没做。找到那些真正推动开放进步的人是至关重要的。
Grok Imagine is pretty cool for product marketing. Turn a quick phone photo into a professional ad in seconds.
译Grok Imagine 在产品营销方面相当不错。 将手机随手拍的照片在几秒钟内变成专业广告。
Codex 小技巧:一台电脑远程指挥另一台写代码 如果你多台电脑都安装了 Codex,且登录ChatGPT账号。 可以在设置 -> 连接 -> 控制其他设备,添加其他电脑。 这样设置后,本机创建项目时,能选添加远程项目。 比如远程控制家里电脑中的Codex工作。
译若多台电脑均安装 Codex 并登录同一 ChatGPT 账号,可在设置 -> 连接 -> 控制其他设备中添加其他电脑。之后本机创建项目时即可选择添加远程项目,例如远程控制家中电脑上的 Codex 进行代码编写。该功能无需额外配置,利用账号同步实现跨设备协作。
> Change the screen so it shows that she's on a facetime call
译更改屏幕,使其显示她在FaceTime通话中。
Capafy has released 5 pre-made e-commerce Skills, each built by an operator who has spent years on the store-side front line, with their hands-on playbook packaged into an agent that anyone can now run. The set covers 👀 > Commerce Video Ad Maker > Amazon Listing Image Generator > Paid Ads Diagnose > Amazon Listing Fix Kit > Amazon/TikTok/eBay SOP Generator
译Capafy 推出 5 个预制电商技能,每个均由一线运营者将实践手册打包成 AI 智能体。包括:Commerce Video Ad Maker(上传 1-3 张产品图生成适配 TikTok Shop、Amazon、Meta 等平台的广告视频);Amazon Listing Image Generator(按买家疑虑顺序生成主图到画廊,用 GPT Image 2 检查合规性);Paid Ads Diagnose(分析 ROAS 下降或 CPA 飙升原因,给出扩量或暂停建议);Amazon Listing Fix Kit(逐行检查详情,提供重写文案、7 图计划、A+ 内容和索赔安全标识);Amazon/TikTok Shop SOP Generator(生成逐条执行的 SOP 并标注违规风险)。Capafy 定位为技能智能体市场,支持上传技能并按次收费。
Google Cloud revenue showed a +63% y/y growth this past quarter. Microsoft Intelligence Cloud revenue showed a +30% y/y growth this past quarter. AWS revenue showed a +28% y/y growth. Despite this, AWS' margins increased 213bps q/q while the other CSPs lagged behind. How you sell tokens is become equally important to how much of it you sell. Bedrock's TaaS (token-as-a-service) business model with Anthropic has 3 parts: 🟠 fixed IaaS fee, 🟠 revenue share of the tokens, 🟠 and performance hurdles that trigger outperformance payments above certain token/spend thresholds. The risk with this business model is that there's no guaranteed take-or-pay floor so revenue can miss if adoption stalls but their bet paid off, primarily driven by Anthropic's addition of $21B net new ARR in a single quarter.
译Google Cloud营收同比增长63%,Microsoft Intelligence Cloud增长30%,AWS增长28%。但AWS利润率环比提升213bps,领先其他云服务商。AWS Bedrock与Anthropic采用Token-as-a-Service(TaaS)商业模式,包含三部分:固定IaaS费用、token收入分成、以及超额绩效支付(达到特定token/消费阈值触发额外付款)。该模式风险是无保底收入,但赌注成功,Anthropic单季度新增210亿美元净新ARR。
发现老黄简直就是个行走的拉盘神器, COMPUTEX 2026 台北国际电脑展, Nvidia 市值5万多亿的黄仁勋,逛展会逛累了,直接跑到技嘉展台,席地一坐,跟技嘉老总李宜泰喝起来了。 旁边围了一圈人,他完全不在意,地上坐了近 10 分钟。 技嘉股价当场就被拉了一下, 估计很多人都纳闷,:老黄和技嘉到底铁到什么程度?这么捧场? 上上届 COMPUTEX 他公开喊过 "GIGABYTE NO.1",这次直接坐人地盘上喝啤酒——是真把合作伙伴当兄弟。 而且有个规律很硬,COMPUTEX 期间老黄一出现,相关供应链股票经常大涨,技嘉最近参会已经五连涨超 20%,这个视频一出,盘中又被带了一波。 所以怎么看这个信号? 第一层是股价信号,他在哪里坐下,市场的钱就跟到哪里, 第二层更深,他没去敲钟的展台,而是去长期合作伙伴的地盘坐下来聊天 ,这说明 Nvidia 的供应链逻辑里,技嘉的位置在加深,而不只是贴个牌。 对看供应链的人来说,老黄的行程表比研报值钱。
译黄仁勋在COMPUTEX 2026上逛至技嘉展台,席地而坐与技嘉老总喝啤酒近10分钟,引来围观。技嘉股价当场被拉,期间已五连涨超20%。深层信号显示Nvidia供应链逻辑中技嘉地位加深。引用推文回顾:2009年Nvidia市值仅40亿美元(Intel 1000亿),黄仁勋押注CUDA和异构计算,17年后Nvidia市值5万亿,Intel约五千亿,25倍劣势变为近10倍反超,体现其远见与护城河。
Yo @xai team, this would be an amazing demo of @grok capability. Push button, have it read all your bookmarks, organise them, make a report on the most interesting one and your interests over time etc
译嘿 @xai 团队,这将是一个展示 @grok 能力的绝佳演示。 一键操作,让它读取你所有的书签,整理它们,就最有趣的书签以及你长期以来的兴趣生成报告等。
Fantastic in depth guide about Microsoft MAI by @eliebakouch tl;dr about the model: Respect where respect is due. -zero synthetic data or distillation from previous models. -1T model with 35B active, trained on 33.5T tokens
译Microsoft MAI 技术报告公开模型细节:1T 总参数,35B 活跃参数,在 33.5T tokens 上训练。最突出的特点是零合成数据、零知识蒸馏,推理、智能体行为、工具使用全部在后训练中从头学习。报告透明度极高,首次在此规模公开各迭代的 MFU 和完整缩放方案,目标成为前沿实验室。
Agent performance is no longer about cost per token, but the cost to finish the whole task. We must treat inference as a whole operating system to turn tokens into real business value.
译智能体性能不再取决于每个token的成本,而是完成整个任务的成本。我们必须将推理视为一个完整的操作系统,将token转化为实际的商业价值。
OpenAI's usage pattern from CFO Sarah Friar's new interview. "Our free users do about seven turns, or seven questions, a day. Our first paid tier does double that, about 15. Our real paid tier, Plus, which is $20, is about 3x, and Pro is about 11x over a free user." Our mission at OpenAI is AGI for the benefit of humanity, not for the benefit of humanity who can pay, or for the benefit of humanity who live in an enterprise" ---- From @theallinpod YouTube channel, (link in comment)
译OpenAI 的使用模式来自 CFO Sarah Friar 的最新采访。 “我们的免费用户每天大约进行七轮对话,也就是七个问题。我们的第一个付费层级是这个数字的两倍,大约 15。我们的真正付费层级 Plus,价格为 20 美元,大约是免费用户的 3 倍,而 Pro 大约是免费用户的 11 倍。” OpenAI 的使命是为了全人类的利益实现 AGI,而不是为了那些能付费的人,也不是为了那些在企业工作的人的权益。 —— 来自 @theallinpod YouTube 频道,(链接在评论中)
当 AI 成为默认工作方式,工程团队如何改变? Claude Code / Claude Cowork 工程负责人 Fiona Fung 在 Code w/ Claude SF 2026 给咱们分享了「如何管理一个 AI-native 工程团队」。她的主要判断是:在 Claude Code 团队里,写代码、写测试、重构已经很少成为主要限制,新的限制变成了验证、代码评审、安全和专业判断。 https://claude.com/blog/running-an-ai-native-engineering-org # 四个研发流程变化 1. 规划:从半年路线图转向及时规划 Fiona 说,Claude Code 团队曾经写过一份不错的六个月路线图,但因为变化太快,到第三个月就过时了。于是他们把规划从重文档、重长期计划,转向原型、内部用户反馈和更短周期的判断。 这不是说不规划,而是规划的颗粒度变了。越是 AI 加速明显的团队,越不适合把大量时间花在远期细节上。合理做法是保留方向判断,把执行细节放到更接近真实验证的时间点。 2. 上下文获取:从找人,变成先问系统 传统工程团队遇到问题,常常先找“谁写了这段代码”。但如果大量 PR 都由 Claude 辅助完成,只知道开发作者已经不够。文章建议更深入地问:你到底想知道什么?是找回归原因、找某个决策背景,还是找能回答客户问题的人? 这里的变化很关键:知识不再只绑定在人身上,而要尽量沉淀到代码、PR、日志、反馈和自动摘要里。团队管理的重点也从“问谁”变成“如何让上下文可被检索、可被解释、可被复用”。 3. 代码评审:AI 处理常规问题,人处理专业判断 文章提到 Claude 会大量参与样式、lint、PR 反馈、bug 发现、修复和测试补充;但法律风险、安全边界、产品判断、设计品味这些仍然需要人。 这说明代码评审的价值正在重新分层。低层次的一致性检查、常见 bug、测试补齐,应该更多自动化;高层次的架构判断、安全责任、业务取舍,仍然要由有经验的人负责。 这也是很多团队容易误解的地方:AI 不是让人退出评审,而是让人从琐碎检查中移出来,把注意力放在更难、更有责任的问题上。 4. 团队结构:角色边界变模糊,但深度专业仍然重要 文章提到 PM 开始写代码,工程师也会承担内容和设计相关工作。团队更看重两类人:有产品感觉的创造型建设者,以及有深厚系统能力的工程师。相对而言,单纯“写得多、写得快”的价值下降,因为模型已经能承担大量产出。 这点很现实。AI 会扩大非传统工程角色的能力范围,但并不会消除专业深度。恰恰相反,当更多人都能生成代码,真正稀缺的是:判断要做什么、如何保证可靠、如何处理复杂系统约束。 # 组织管理上的真正变化 第一,流程不能永久存在。很多流程当初是为了解决某个问题,但问题消失后,流程往往还在消耗团队时间。AI 加速后,团队要更频繁地审视哪些会议、文档、审批、评审已经不再有必要。 第二,组织要把“默认使用 AI”变成共同原则,而不是个人偏好。Claude Code 团队要求成员持续使用自己的产品,包括跨职能伙伴也使用 Claude Code 和 Claude Cowork。这会让团队更快发现真实问题,也能形成一致的工作方式。 第三,管理层需要贴近一线。文章提到希望 manager 先作为 IC 参与交付,理解团队真实工作方式。在 AI 改变开发流程时,只靠传统管理汇报,很容易低估变化速度,也容易保留过时流程。 # 可以跟踪的三个指标(建议工程负责人关注) 1. 新成员多久能有效工作。Claude Code 团队认为,现在新人可以在第一周就交付真实代码。 2. PR 周期是否变短。如果代码生成速度上来了,但 CI、构建、评审跟不上,瓶颈会转移到工程平台。 3. AI 辅助提交比例是否上升。但作者也提醒,不要把产出量本身误认为成功,真正要衡量的是团队原本想解决的问题。
译Claude Code 工程负责人 Fiona Fung 在 Code w/ Claude SF 2026 分享管理 AI-native 团队经验:写代码不再是瓶颈,验证、评审、安全与专业判断成为新限制。四个流程变化:规划从半年路线图转向短周期原型与反馈;上下文获取从“问谁写的”转为沉淀到代码/PR/日志;AI 处理常规代码评审,人负责法律/安全/业务判断;团队角色模糊但深度专业仍稀缺。组织上建议定期清理过时流程、默认使用 AI、管理者贴近一线。可跟踪新人首周交付真实代码、PR 周期变短、AI 辅助提交比例,但产出量不是成功本身。
I need to see a video of two of these playing each other in real life.
译一位开发者使用强化学习在模拟中训练AI智能体,随后部署到真实的机器人空气曲棍球台上。该机器人能以毫米级精度跟踪曲棍球,反应时间约20毫秒,足以挑战熟练的人类玩家。这标志着从预设编程规则到模拟学习后在物理世界执行的转变。主推文作者期待看到两个这样的机器人进行真实对战。
被 AI 不听话折磨了大半年,终于找到解法了 发现一个开源项目 OpenSquilla,国内团队做的 他们用 Python 把"小龙虾"重写了一遍 解决了它太费token、不按照规则执行以及安全的问题 100 次对话就能省下 100万 Token 先说省钱: 它集成了一个本地的小模型,你发的每一个请求,在真正发给大模型之前,会被这个小模型极速向量化,分析这个请求到底是简单任务还是复杂任务。简单的发给便宜模型,复杂的才派顶级模型上场。 就跟医院分诊台一个道理,感冒发烧不用挂专家号。 关键是这个分类在本地跑,不花 token,速度极快,基本感知不到。 官方跑了个测试,25 个任务,纯用 Claude Opus 4.7 总成本 6.2 美金,用 OpenSquilla 路由 Opus4.7、GLM5.1、DS4 Flash 混着跑,分数几乎一样,成本只要 0.68 美金。同样的效果,成本砍到九分之一! 这下我终于敢把 Opus 和 GPT 接进去了!每轮对话还会显示本轮省了多少 token。 而且省 token 不只省在模型调用上。 我装了九十多个 Skill,每轮对话都把所有 Skill 的 description 全塞进上下文里,算了一下每轮要消耗 9000 左右 Tokens。 OpenSquilla 会根据当前对话语义只注入匹配度最高的几个 Skill,按我的规模大概 100 次对话就能省 100万 Token
译国内团队开源项目OpenSquilla用Python重写“小龙虾”,解决费token、不按规则执行及安全问题。它集成小模型对请求实时分类:简单任务走便宜模型,复杂任务走顶级模型。测试25个任务,纯Claude Opus 4.7成本6.2美金,OpenSquilla混跑Opus 4.7、GLM5.1、DS4 Flash成本仅0.68美金,分数几乎一样。同时,它根据对话语义只注入匹配度最高的Skill(原90+个),每轮省约9000 Token,100次对话累计省100万Token。
分享一个让Agent额度翻倍的小技巧。 之前发Codex教程的时候,评论区有一条留言被顶到了最高赞,是一个关于5小时额度窗口的小技巧。 然后发现很多朋友都说第一次知道,我觉得可以单独拿出来再给大家说一下。 先说原理。 不管是Codex还是Claude Code,它们的额度限制都不是每天重置或者每小时重置,而是一个5小时的滚动窗口。 也就是你发第一条消息的那一刻,5小时倒计时就开始了,这5个小时内你有一定的Token额度可以用,用完了,就得等这个窗口走完才能重置。 但这里有一个很多人不知道的细节。 5小时窗口结束之后,系统并不会自动帮你开启下一个窗口,它会一直等,等到你发出下一条消息的那一刻,才重新开始计算新的5小时。 比如你每天下午2点到6点是集中用Agent工作的时间。 如果你2点才开始用Codex,窗口就从2点开始算,到晚上7点才重置。中间如果用的比较猛,3点半额度就见底了,你得干等到7点,这基本就要当3个半小时的原始人了。 但如果你在上午11点的时候,提前给Codex发一条消息,哪怕就随便说一句话,窗口就从11点开始计算了,等于下午4点就重置了。 你2点开始干活,干到4点额度刷新了一波,4点以后,你又有一整个新窗口可以用。也就是说在2点到6点的核心工作时间里,你能享受的5小时额度窗口,直接从一个窗口变成了两个。 变相让你的额度变成了两倍。 原理就这么简单,提前触发窗口,让重置时间刚好落在你干活的中间。 很多人用了大半年agent,每次撞限了就硬等,因为可能确实不知道这个重置时间是可以自己控制的。 所以你只要理解了窗口的重置是可以人为控制的这一点,玩法就打开了,只要搭配上自动化,你就可以享受两倍额度窗口了。 说下怎么设置。 Codex比较简单,在左边菜单找到自动化,点进去以后新建一个,触发条件选「每天」,时间填你主要干活前的3小时,动作就是随便发一条短消息,内容无所谓,写个“叫我一声爹”都行。 设好之后就不用管了,每天到点它会自动跑一下,帮你把窗口提前激活。 Claude如果你有客户端,也是一样的,设置一个Routines自动化就行。 如果是CLI版,Mac就直接跟你的Agent说: “帮我设一个crontab定时任务,每天上午11点自动给Claude Code发一条消息“叫我一声爹”触发5小时窗口” Windows就用任务计划程序,也可以直接让Agent帮你配。 不过这里要提醒一下,5小时窗口是一层限制,但上面还有一个周额度的上限,所以不用贪心,让重置时间跟你的工作节奏对上就够了。 以上,希望对大家有用。
译Codex和Claude Code的额度限制采用5小时滚动窗口,从用户发送第一条消息开始计时,用完需等待窗口结束才能重置。但窗口结束后系统不会自动开启新窗口,需等到下一条消息才重新计时。利用此机制,可在主要工作时段前3小时(如上午11点)提前发送一条消息激活窗口,使重置时间落在工作时段中间(如下午4点)。这样在2-6点的核心工作中,能享受两个5小时窗口,变相将额度翻倍。设置方法:Codex可在自动化中创建每日定时任务发送短消息;Claude CLI可通过crontab(Mac)或任务计划程序(Windows)实现。注意仍有周额度上限,适度使用即可。
http://x.com/i/article/2062080260586283008 # xAI 视频多模态负责人访谈:视频模型的天花板,其实是语言模型 一个在英伟达造出 Cosmos 世界模型、又在 xAI 三个月从零搭出 Grok Imagine 的人,离职时说的理由是:视频模型最大的瓶颈,其实是语言模型。 Laten Space最近访谈了Ethan He,内容很不错,让AI转写一篇文章。 > https://www.latent.space/p/video-agents ## 他是谁,做过什么 Ethan He 是一位多模态 AI 研究员,职业轨迹横跨图像识别、自监督学习、大规模模型训练和视频生成。 在英伟达期间,他是 Cosmos 视频基础模型的核心作者之一。 Cosmos 是一个大规模视频生成模型,目标是模拟物理世界,作为机器人研究的基础底座,于 2024 年底发布。 2025 年中,他加入 xAI,主导 Grok Imagine 的视频和多模态方向,包括: - 音频视频联合生成(Grok Imagine 0.9) - 视频扩展(Video Extension,支持完整历史上下文的长视频生成) - 参考视频生成(Reference-to-Video,支持上传最多 7 张图片作为角色或场景条件) - 内部世界模型团队(专注实时长时程视频生成) 访谈时他刚刚离开 xAI,准备转向语言模型方向的研究。 ## 三个月从零到视频模型,靠的不是算法 加入 xAI 时,团队没有数据、没有基础设施、没有现成模型,只有几个工程师。 三个月后 Grok Imagine 0.9 发布。 他总结了两个关键因素。 第一是人的密度,而非人的数量。 团队里每个人都很强,目标高度一致,沟通成本极低。 每天只有一个例会,其余时间全部用来构建。 他的观察是:小团队减少沟通带宽,反而比大团队更容易快速迭代。 第二是迭代速度,而非单次训练质量。 他的核心判断是:训练模型最重要的指标,不是某次实验的结果有多好,而是每天能跑多少轮实验。 迭代越快,发现 bug 的机会越多。 而且他特别强调:模型质量最大的提升,往往不来自新算法,而来自数据管道和训练流程里那些不起眼的小 bug。 这听起来有点怪,但这是他在英伟达和 xAI 两次从零搭建视频模型的共同经验。 他还提到一个时间节点:2025 年中加入时,代码模型还不够好,写出来的代码经常是几千行的"意大利面条",连模型自己都搞不清楚怎么维护。 到 2025 年 12 月,代码模型已经强到可以快速实现任何想法。 这带来了一个新的瓶颈反转:以前是写代码慢,现在是算力跟不上想法的速度。 代码几小时就能写完,但训练一个新模型可能要等好几周。 ## 视频模型是怎么炼出来的:完整路径 第一步:先训图像模型,再训视频模型 原因很实际,图像比视频便宜得多,而且语言和图像之间的对应关系更密集。 举个具体数字:训练 10 亿张图文对,和训练 10 亿个视频文本对,成本完全不在一个量级。 但前者能给模型打下更扎实的语言理解基础。 视频模型对语言的理解,完全来自这种文本到视觉内容的映射关系。 如果映射数量不够,模型就不能充分理解人类意图。 所以标准做法是:先训图像扩散模型,再用它作为基础,迁移到视频模型。 第二步:解决数据对齐问题 互联网上的视频天然缺少精准的文字描述。 YouTube 上的标题和评论,和视频内容本身几乎没有关联。 一段山川自然风光,标题可能是"今天真开心",二者毫无关系。 所以必须用 VLM(视觉语言模型,能同时理解图像和文字的 AI 模型)给视频打字幕,生成合成的文本视频对。 但 VLM 本身在早期也需要人工标注来冷启动。 Cosmos 的标注要求非常具体:描述要详细到让一个盲人听完文字,就能在脑海中重建出这段视频。 所有物体、角色、互动、对话,都要覆盖。 这个标准直接决定了后来视频模型能不能真正理解人类意图。 第三步:训练 VAE(变分自编码器,一种把图像压缩成低维表示再还原的压缩器) 原始视频帧的像素量太大,1000×1000 的图像就有 100 万个像素,Transformer(一种主流的 AI 模型架构)根本无法直接处理。 VAE 把图像映射到一个低维的连续潜空间(latent space,可以理解为图像的"压缩编码"),再从潜空间还原回图像。 具体做法是把图像切成小块(patch),每个小块映射成一个向量,这样一张图就变成了几十个向量,而不是 100 万个像素。 时间维度的压缩比例是个关键决策。 Wan 2.1 采用 8×8×4 的压缩率,时间维度压缩 4 倍,上下文长度大幅缩短,训练效率更高。 但代价是实时性:如果要做实时交互,时间维度的压缩会引入固定的延迟,无法做到即时响应。 如果不压缩时间维度,只做帧内压缩(8×8×1),上下文长度会是 4 倍压缩方案的 4 倍,计算量大得多,但可以支持帧级别的实时输出。 第四步:训练扩散 Transformer(Diffusion Transformer) 流程和语言模型非常相似,区别只是输入输出换成了视觉 token(图像的压缩表示),以及加入了去噪过程:向视觉 token 加入随机噪声,训练模型把噪声去掉,推理时从纯噪声开始迭代生成干净的图像或视频。 推理侧的主要优化手段是步骤蒸馏(Step Distillation):用完整模型跑 100 步生成高质量结果,再训练一个只需要 10 步的小模型去模仿它。 这背后的逻辑是:完整模型要学的是整个互联网的图像分布,极其复杂,蒸馏模型只需要学老师模型的分布,简单得多。 Cosmos 的生产版本已经可以做到 4 步甚至 1 步生成(针对图生图等简单任务)。 ## 训练一个视频模型到底要花多少钱 Ethan He 做了一个粗略的估算,数字很有参考价值。 模型规模: 视频模型和中等规模语言模型相当。 LTX 是 19B(190 亿)参数的稠密模型,也有人在探索 MoE(混合专家模型,一种让模型只激活部分参数的架构),激活参数约 20B,总参数可能达到数百 B。 Cosmos 公开披露的视觉 token 数量也在数十万亿量级,和语言模型的训练规模接近。 存储成本: 假设有 10 亿个视频,每个视频 5MB,光存储就需要 5PB(5000TB)。 加上 VAE 提取的特征文件,总存储量翻倍,达到约 10PB。 在 AWS S3 标准存储上,5PB 的月存储费用约 23 万人民币,加上数据出口费用,每月总成本可能达到数百万人民币,还没算 GPU 训练成本。 他特别提到:数据出口费用(把数据从云端传输出去的费用)比存储本身更贵。 每次训练都需要把数据拉取一遍,如果多次训练,费用成倍叠加。 这也是为什么大规模训练团队通常会自建存储基础设施,而不是完全依赖公有云。 I/O 瓶颈: 视频训练天然是 I/O 密集型任务,数据加载速度很容易成为 GPU 利用率的瓶颈。 Ethan He 在英伟达做 Cosmos 时专门做了大量 I/O 优化。 ## 世界模型的定义:三个缺一不可的条件 Ethan He 给世界模型下了一个工程意义上的定义,三个维度。 交互性: 模型可以响应键盘、鼠标、语音等多种输入,并给出合理的反馈。 实时性: 响应延迟要足够低。 CS 职业选手需要亚 3 毫秒的响应(300FPS 对应约 3 毫秒每帧),60FPS 游戏需要 16 毫秒,实时语音交互的容忍上限大约是 200 毫秒。现有视频模型大多达不到这个要求。 长时程: 不是生成几秒钟的片段,而是能持续生成几分钟甚至几小时的内容,同时保持角色、场景、声音的一致性。 三个条件同时满足,才算世界模型。 目前的视频模型在任何一个维度上都还有很大差距。 长时程的工程难题 Cosmos 里 5 秒视频就有约 55K 到 60K 个 token,50 秒就是 500K token,再长就很难处理。 现有视频模型的上下文窗口大约在几百万 token 量级,但实际使用中很快就会爆炸。 Ethan He 在 xAI 主导的视频扩展(Video Extension)功能,是迈向长时程的第一步:让模型在生成下一段视频时,能访问之前所有视频的完整历史上下文,而不只是最后一帧或最后一秒。 这解决了多次扩展后视频质量退化、人物声音漂移的问题。 参考视频(Reference-to-Video)是另一个折中方案:允许用户上传最多 7 张图片作为条件,让模型在生成时参考特定角色或场景。 Ethan He 自己也承认这是个"作弊"方案,真正的解法是让模型自己学会从历史中选择性地提取相关上下文。 FramePack(一篇论文提出的方法)提供了一个启发式思路:最近的历史保留完整分辨率,越早的历史压缩得越小,总 token 数保持固定上限。 这和人类记忆的工作方式有些相似,但 Ethan He 认为更理想的状态是让模型自己决定哪些历史值得保留,而不是靠人工设计的规则。 ## 视频模型最大的进步,来自语言模型 这是整个访谈里最反直觉的判断,也是 Ethan He 离职的核心原因。 扩散模型本身其实很"笨" 扩散模型(Diffusion Model,一种通过去噪生成图像或视频的模型)在训练时被要求按照极其详细的文字描述生成视频,所以推理时也会字面理解用户的输入。 你说"一只猫",它就生成一只猫,白色背景,静止不动,因为你没说背景,没说动作。 它取的是训练数据里那种极度详细的描述风格,用户的简短输入和这个分布完全不匹配。 提示词重写器才是真正的智能来源 真正让模型变聪明的,是提示词重写器(Prompt Rewriter),一个更大的语言模型,负责把用户的简单描述扩展成极其详细的视频描述。 Cosmos 用的是 Llama 或 Mixtral,而且提示词重写器比视频扩散模型本身(7B 参数)还要大。 他举了一个具体例子:同样是生成一只快乐的羊,不经过重写,结果看起来像 CGI;经过重写之后,画面质量有质的飞跃,而且这个提升不需要任何联合训练。 GPT Image 生成一张图需要 3 分钟,其中大部分时间不是在生成像素,而是在"思考",也就是提示词重写和规划阶段。 语言模型的角色还在扩展 提示词重写只是第一步。 现在语言模型在视频生成中的角色已经扩展到: - 工具调用: 生成图片前先联网查今天的新闻,处理后再生成 - 智能体协调: 调用视频生成、视频编辑、图像处理、FFmpeg 等多种工具,迭代生成高质量内容 - 布局规划: 决定视频的结构、时间线和内容组织 Grok Imagine 已经有了一个智能体模式的早期版本,可以通过调用不同工具来生成更长的视频。 ## 音频:被低估的难题 Grok Imagine 0.9 是 Ethan He 认为业内首个大规模部署的音频视频联合生成模型。 音频的难点在于它有两个截然不同的成分: - 语音部分: 接近离散 token(可以理解为有限词汇表里的单词),可以用类似语言模型的方式处理 - 音乐部分: 完全连续,无法离散化,现有语言模型对音乐的理解非常有限 让语言模型描述音乐细节,就像让盲人描述颜色一样困难。 大多数语言模型可以识别"这是哪首歌",但无法描述音乐的节拍、音调和细节,更无法生成高质量的音乐。 更大的挑战是时间对齐。 文本和图像之间的对应是松散的,你可以用一段话描述整张图。 但音频和视频必须在时间轴上精确对齐:哪一秒有什么声音,必须和画面严格同步。 这种精确的时间感知,是现有多模态模型普遍缺失的能力。 ## 生成式 UI:扩散模型作为前端 访谈中展示了两个产品案例,代表了 Ethan He 对未来交互方式的判断。 Flipbook: 一个用图像生成模型实时渲染的浏览器界面。 页面里的所有内容都是模型生成的,公司不存在,场景是虚构的。 用户点击链接,模型就生成新的子页面。 比如点击"金字塔建造技术",模型会生成一个详细介绍杠杆技术的新页面,配有对应的生成图像。 Neural OS(神经操作系统): 用视频模型模拟一个完整的操作系统,可以运行 Doom、Firefox 等应用,所有画面都是模型实时生成的。 Ethan He 的预测是:随着推理成本下降,扩散模型会成为人机交互的前端层,语言模型和确定性代码在后端运行,用户看到的所有界面都由生成模型实时渲染。 每个用户可以有完全不同的界面,邮件可以像 TikTok 一样滑动,Instagram 可以去掉你总是误触的点赞按钮。 他估算了一下成本:如果每 100 个请求 1 美元,每天用 8 小时,每月大约 240 美元。 现在确实贵,但推理成本每年大约下降 2 倍,他认为这个未来会到来。 他还提出了一个关于人机带宽的判断:人类的最大输入带宽是视觉(看),最大输出带宽是语音(说)。 所以未来最自然的人机交互方式,是用户说话,AI 用生成式画面回应,这是神经链接(Neuralink)出现之前的最高带宽交互形式。 ## 为什么离开 xAI Ethan He 的回答很直接:有些研究在公司里做不了,而且公司的优先级会快速变化。 他想做的,是语言模型方向的研究,特别是模型如何自主管理自己的上下文。 他的具体预测是:语言模型很快会出现真正的上下文感知能力,模型知道自己的上下文窗口用了多少,能主动决定压缩、丢弃或保留哪些内容,而不是依赖外部 harness(智能体框架,一种包裹模型的工程系统)的启发式规则。 他举了一个例子:现在 Claude(Anthropic 的 AI 模型)在上下文接近上限时会自动触发压缩,但模型本身对这个过程毫不知情,还在按照原来的方式工作。 理想状态是模型自己感知到"我快到上限了",并主动调整策略。 他认为视频模型在这方面的探索某种程度上比语言模型更超前,因为视频的长时程问题更紧迫,研究者被迫更早面对这个问题。 他还提到一个更激进的想法:如果把智能体框架的代码直接放进模型的上下文,让模型能够修改自己的运行规则,比如决定"读长文档时我要分块处理还是只读前 200 行",这种自我修改的智能体框架可能是一个值得探索的方向。 ## 职业轨迹:每一次转型都是主动押注 Ethan He 的职业路径本身也值得单独说一下。 十年前他在做 ResNet(残差网络,一种经典的图像识别模型架构)时代的图像识别和目标检测研究,同时做模型压缩。 他当时想当教授,已经有几篇顶会一作论文,自信地申请了顶校博士,结果全部被拒。 被迫进入工业界,反而让他在 Facebook FAIR(Meta 的 AI 研究院,由 Yann LeCun 领导)做了自监督学习,之后到英伟达做 Cosmos 和 MoE(混合专家模型)扩展,再到 xAI 做视频多模态。 他在英伟达的另一个重要工作是 Megatron MoE,这是第一个开源的、能够高效训练超大规模 MoE 模型的框架,支持从 1000 亿参数到万亿参数的训练,MFU(模型浮点利用率,衡量 GPU 利用效率的指标)达到约 40%。 他的结论是:在机器学习内部切换方向,比大多数人想象的容易。训练大模型的核心原则是通用的,换个方向并不需要从零开始。 很多人觉得"我做计算机视觉,就只能做计算机视觉",但他的经验证明这个边界没有那么硬。 ## 关键判断汇总 ## 局限性和没说清楚的地方 这篇访谈有几个地方值得注意: 信息不对称: Ethan He 在涉及 Grok Imagine 具体架构时多次说"不方便评论",比如它是否是端到端扩散模型还是语言模型加扩散头的组合。这意味着一些关键技术细节仍然不透明。 成本估算是粗略的: 他的存储和训练成本计算是信封背面的估算,实际情况会因数据规模、训练次数、基础设施选择而有很大差异。 "语言模型是瓶颈"这个判断有边界: 他承认扩散模型本身的改进仍然重要,只是说在当前阶段,语言模型的改进带来的增益更大。这不等于视频模型架构研究没有价值。 世界模型的定义是他个人的: 他在访谈开头就声明,世界模型有很多定义,他只是分享自己的视角,不打算辩论谁对谁错。 本文根据 Latent Space 播客对 Ethan He 的访谈整理重写。 Ethan He 曾任英伟达 Cosmos 视频基础模型核心作者,xAI Grok Imagine 视频多模态负责人。
译xAI前视频多模态负责人Ethan He在离职转向语言模型研究时表示,视频模型最大的瓶颈是语言模型。他曾在NVIDIA参与Cosmos模型开发,并在加入xAI后三个月内从零搭建出Grok Imagine 0.9。他指出训练视频模型成本高昂,例如存储10亿个视频需5PB,仅AWS月费就达数百万人民币。视频模型需先预训练图像模型,再通过VLM生成合成字幕以解决数据对齐问题。当前模型在生成长视频时上下文容易爆炸,而他认为扩散模型对文本的理解过于字面化,对语言意图的深层理解才是突破关键。
Kim受邀首次参加微软Build,参观GitHub HQ、参与多场会议并见到Satya Nadella,认为远超预期。微软发布7个新AI模型(定位中端、约Sonnet级别、价格亲民),新Surface Laptop Ultra配新芯片对标MacBook Pro,展示Project Solaris和智能体手持设备等实验项目,推出改版Copilot应用,企业版新增智能体功能及新量子芯片。作者认为微软正认真听取反馈,在各个方向推动变革。
The VFX industry is cooked
Watch this video. Now imagine this swarm, controlled by AI agents, with an explosive on each drone. It's Biblical.
I don't believe any company accidentally spent $500 million on Claude in a month. The number is an order of magnitude to...
World Labs团队与李飞飞发文,梳理“世界模型”这一被滥用的术语。对比语言模型学习文本统计,世界模型学习空间与时间统计(如光照、物理规律)。基于部分可观马尔可夫决策过程(POMDP)框架,智能体通过动作影响世界状态,观测是部分视图。当前被称为“世界模型”的不同系统本质上是同一循环的不同投影:第一类为渲染器,输出给人眼看的像素,以视觉保真度为核心。文章着重于概念分层,未给出具体模型名、参数或基准分数。
Google 推出 Gemma 4 12B(Apache 2.0),采用无独立视觉编码器的统一多模态架构。仅用 35M 参数的轻量嵌入器,将图像切为 48×48 块、音频(16kHz 原始波形)切为 40ms 帧,直接作为 token 输入 Transformer。M4 Max 上 4-bit 量化识图延迟 1.2-1.5 秒,官方称 16GB 内存可用,但社区指出高分辨率多图会压线。该设计暗示:当基座模型足够大,专用子模块不再是必需,未来一个微调好的统一模型可能取代传统拼装 Whisper、LLaVa 等多模态 pipeline。
Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to y...
We also asked forecasters to predict the longest 80% success time horizon achieved by the end of 2026. All three groups ...
been asking others at Anthropic how they stay in the loop with Claude and fully understand the work being done this is o...
StepFun Step 3.7 Flash smashed DeepSeek V4-Flash in a physics contest We gave two open-weight models the same task: writ...
DAIR.AI的Elvis Saravia将微软SkillOpt论文集成到智能体编排器中后,所有智能体技能获得测试框架与自我演化机制。应用于多模态论文图表提取技能时,质量评分从0.73提升至0.93(+20点),提取结果显著改善。Saravia认为这是自我改进AI的早期范例,该思路可扩展至智能体模式优化、工具使用、上下文工程、智能体搜索及工作流评估等环节。他已基于SkillOpt启动多项后续实验。
GPT 5.5 Pro 调研生成了一份 Codex 的 Goal 指令使用文档,分享两个技巧:1. 写不好 Goal 时先用 /plan 模式,让 AI 反问用户来完善命令,提示词模板为 `/plan Help me turn this vague task into a strong Codex goal...`;2. 写好 Goal 的六要素:结果、验证、约束、边界、迭代策略、阻塞条件。官方标准模板为 `/goal [Outcome]. Verification: [...] Constraints: [...] Boundaries: [...] Iteration policy: [...] Stop when: [...] Pause if: [...]`。详细报告含多个可直接使用的模板。
Lee Robinson 认为该说法是过度简化的播客话术。现实更复杂:即便大量“技术专家”存在,仍需要有人百分百专注产品或设计;AI 虽让生成代码变易,但缺乏优秀工程师会导致灾难。硅谷常把创业公司经验套用于大公司,却难以颠覆内部政治、遗留系统等极度人性化的部分。他判断 AI 颠覆知识工作需要数十年,因为本质是社会/组织问题,而非纯智力问题。
> Change the screen so it shows that she's on a facetime call
若多台电脑均安装 Codex 并登录同一 ChatGPT 账号,可在设置 -> 连接 -> 控制其他设备中添加其他电脑。之后本机创建项目时即可选择添加远程项目,例如远程控制家中电脑上的 Codex 进行代码编写。该功能无需额外配置,利用账号同步实现跨设备协作。
Introducing 5 Capafy e-commerce Skills. Behind each of these 5 Skills is an operator who has spent years on the e-commer...
Google Cloud营收同比增长63%,Microsoft Intelligence Cloud增长30%,AWS增长28%。但AWS利润率环比提升213bps,领先其他云服务商。AWS Bedrock与Anthropic采用Token-as-a-Service(TaaS)商业模式,包含三部分:固定IaaS费用、token收入分成、以及超额绩效支付(达到特定token/消费阈值触发额外付款)。该模式风险是无保底收入,但赌注成功,Anthropic单季度新增210亿美元净新ARR。
黄仁勋在COMPUTEX 2026上逛至技嘉展台,席地而坐与技嘉老总喝啤酒近10分钟,引来围观。技嘉股价当场被拉,期间已五连涨超20%。深层信号显示Nvidia供应链逻辑中技嘉地位加深。引用推文回顾:2009年Nvidia市值仅40亿美元(Intel 1000亿),黄仁勋押注CUDA和异构计算,17年后Nvidia市值5万亿,Intel约五千亿,25倍劣势变为近10倍反超,体现其远见与护城河。
同样站在 2009 年那个路口,有人只看见一块显卡, 有人看见了往后二十年整个计算的样子。 那年 Nvidia 市值 40 亿,是 Intel 的零头, 所有人都笑黄仁勋不过是个卖游戏配件的。 那时候 Nvidia 市值 40 亿,Inte...
Bookmarking tweets and not going back to them has become an epidemic
microsoft MAI tech report is a gold mine, one of the most transparent for a model at this scale. this model uses zero sy...
Claude Code 工程负责人 Fiona Fung 在 Code w/ Claude SF 2026 分享管理 AI-native 团队经验:写代码不再是瓶颈,验证、评审、安全与专业判断成为新限制。四个流程变化:规划从半年路线图转向短周期原型与反馈;上下文获取从“问谁写的”转为沉淀到代码/PR/日志;AI 处理常规代码评审,人负责法律/安全/业务判断;团队角色模糊但深度专业仍稀缺。组织上建议定期清理过时流程、默认使用 AI、管理者贴近一线。可跟踪新人首周交付真实代码、PR 周期变短、AI 辅助提交比例,但产出量不是成功本身。
关联讨论 3 条X:Ethan Mollick (@emollick)X:邵猛 (@shao__meng)Claude:Blog(网页)Wow. This is crazy. A developer trained an AI agent in simulation and deployed it onto a real robotic air hockey table u...
国内团队开源项目OpenSquilla用Python重写“小龙虾”,解决费token、不按规则执行及安全问题。它集成小模型对请求实时分类:简单任务走便宜模型,复杂任务走顶级模型。测试25个任务,纯Claude Opus 4.7成本6.2美金,OpenSquilla混跑Opus 4.7、GLM5.1、DS4 Flash成本仅0.68美金,分数几乎一样。同时,它根据对话语义只注入匹配度最高的Skill(原90+个),每轮省约9000 Token,100次对话累计省100万Token。
Codex和Claude Code的额度限制采用5小时滚动窗口,从用户发送第一条消息开始计时,用完需等待窗口结束才能重置。但窗口结束后系统不会自动开启新窗口,需等到下一条消息才重新计时。利用此机制,可在主要工作时段前3小时(如上午11点)提前发送一条消息激活窗口,使重置时间落在工作时段中间(如下午4点)。这样在2-6点的核心工作中,能享受两个5小时窗口,变相将额度翻倍。设置方法:Codex可在自动化中创建每日定时任务发送短消息;Claude CLI可通过crontab(Mac)或任务计划程序(Windows)实现。注意仍有周额度上限,适度使用即可。
xAI前视频多模态负责人Ethan He在离职转向语言模型研究时表示,视频模型最大的瓶颈是语言模型。他曾在NVIDIA参与Cosmos模型开发,并在加入xAI后三个月内从零搭建出Grok Imagine 0.9。他指出训练视频模型成本高昂,例如存储10亿个视频需5PB,仅AWS月费就达数百万人民币。视频模型需先预训练图像模型,再通过VLM生成合成字幕以解决数据对齐问题。当前模型在生成长视频时上下文容易爆炸,而他认为扩散模型对文本的理解过于字面化,对语言意图的深层理解才是突破关键。