AIHOT

全部动态X · 9321 条

全部一手资讯 X 论文

Chubby♨️@kimmonismus · 6月4日65

I took a "behind-the-scenes" tour at Microsoft today, where I was able to inspect the Surface Laptop Ultra firsthand and therefore was able to record those clips. The most obvious takeaway: Microsoft is now aiming to enter into direct competition with Apple and challenge the MacBook Pro. Needless to say, I wasn't able to conduct any real-world testing. However, the build quality, thermal management, the display, and- above all- the NVIDIA chip are certainly impressive. Whether it will truly manage to challenge Apple's MacBooks remains to be seen. But one thing is certain: Microsoft means business.

译微软推出全新Surface Laptop Ultra，定位创作者和AI笔记本，搭载NVIDIA新芯片（RTX GPU），最高提供1 petaflop AI算力、128GB统一内存。配备15英寸mini-LED PixelSense Ultra触摸屏（3:2比例，262 PPI，峰值2000尼特HDR亮度），厚度不足18mm。作者在幕后参观中亲手检测，认为做工、散热、显示屏和芯片令人印象深刻，微软明确将目标对准MacBook Pro，意在直接挑战苹果。

查看原推 ↗

Chubby♨️@kimmonismus · 6月4日71

Gemma 4 12B shipped today under the label "encoder-free." A local 12b model that shows really good results. I'm a big fan of Gemma Gemma 4 12B is out: a dense, fully open model (Apache 2.0) that runs on a 16GB laptop and does agentic reasoning, vision and audio at a quality Google puts near its 26B model. The reason a 12B can pull this off: Google removed the separate vision and audio encoders and feeds both straight into the model, which keeps the memory footprint small enough for consumer GPUs. For on-device assistants and private coding agents, that lowers the bar a lot. always look forward to the updates. 12b is a good sweet spot in terms of size. a few facts: Vision: the 550M encoder (27 transformer layers) is now a 35M embedder, one matmul on 48x48 pixel patches. Roughly 15x smaller. Audio: the 300M encoder (12 conformer layers) is gone. Raw 16kHz audio cut into 40ms frames, projected straight into the LLM. So encoding didn't vanish, it collapsed into the backbone. The payoff is real: one shared set of weights, so you LoRA-tune vision, audio and text in a single pass.

译Google 开源 Gemma 4 12B（密集参数，Apache 2.0 许可），采用全新无编码器架构：移除独立的视觉（550M 参数、27 层 Transformer）和音频（300M 参数、12 层 Conformer）编码器。视觉改为 35M 嵌入层（约缩小 15 倍），音频以 40ms 帧直接投影到大语言模型。模型在 16GB VRAM 笔记本上即可运行智能体推理、视觉和音频任务，性能接近 26B 参数模型。共享权重支持一次 LoRA 调优覆盖视觉、音频和文本。

查看原推 ↗

Fei-Fei Li@drfeifei · 6月4日78

http://x.com/i/article/2062244283940544512 # A Functional Taxonomy of World Models > “The world is everything that is the case.” — Ludwig Wittgenstein, Tractatus Logico-Philosophicus, 1921 ## The world is not made of words. In an earlier essay, we argued that spatial intelligence is AI’s next frontier and that world models are the path to it. Here, the World Labs team and I want to go one level deeper: of the many things now being built and called ‘world models,’ which functional pieces actually compose that capacity — and what is each one for? Language models have given machines an extraordinary command of concepts, vocabulary, and reasoning, but the physical world, virtual or real, runs on a different substrate. Where language models learn the statistical structure of text, world models learn the statistical structure of space and time: how light falls on a surface, how a garden looks from an angle no camera has captured, how objects respond to force and follow the laws of physics. That makes “world model” one of the most important and most overloaded terms in AI today. Computer vision, robotics, reinforcement learning, and generative AI each claim to be building world models, and each means something quite different. A video model that produces gorgeous but physically impossible flames, a language model improvising a playable game, and a physics engine that faithfully simulates combustion all go by the same name. The ancient Greeks could never agree on what the world was made of, whether fire, water, or indivisible atoms, because “world” was never a single thing. It was always a stand-in for whatever totality a given thinker needed to reason about. AI has inherited the same problem, at exactly the moment when the field needs precision. ## The loop beneath the taxonomy Cutting through that confusion starts with a diagram older than any of the technology in question. Reinforcement learning textbooks, including the canonical Sutton and Barto, have used a version of the same picture for decades to describe how an agent interacts with a world. The formal name for this picture is the partially observable Markov decision process, or POMDP, and the original definition of the term “world model” belongs to that tradition. An agent, which can be a person, a robot, or a software system, takes actions. Those actions affect the state of the world. The agent never sees the state directly. What reaches the agent are observations: the photons that fall on a retina, the readings from a sensor, and the pixels in a video frame. New observations inform new actions, and the loop continues. The word “state” needs unpacking, because the meaning shifts from field to field. This is not the chemist’s state, the difference between solid, liquid, and gas. This is the physicist’s and roboticist’s state: a complete description of what is happening in the world at a given moment, including every object, every position, every velocity, every property. State is the underlying reality of the world; complete in principle, but never directly visible to any agent inside it. Observations are an agent’s partial view of that reality. Actions are what the agent does in response. This loop — agent to action to state to observation and back — is the structure that gave the modern term “world model” its technical meaning. The phrase itself is older, traced to Kenneth Craik’s 1943 proposal that minds reason by running “small-scale models” of reality, and carried into neural networks by the late 1980s and early 1990s. And the loop also explains what people mean by the term today. The different things now being called world models are in fact different projections of this same loop. Each one outputs a different piece of it. ## Three functions of a world model The first kind of world model is a renderer. A renderer outputs observations in the form of pixels meant for human eyes, and the quality that matters most is visual fidelity. A video model that turns a text prompt into a cinematic drone shot is a renderer. So is an interactive system like Google’s Genie 3, or World Labs’ own RTFM, where the model generates frames in real time conditioned on user input. The model carries no explicit understanding of three-dimensional structure. It produces what a viewer would see, not what is. The buildings in the drone shot may look flawless from above, but try to drive through the city below and they fall apart. The second kind is a simulator. A simulator outputs state: a geometrically, physically or dynamically faithful representation of the world that humans and computer programs can both compute on and interact with. Where the renderer’s contract is purely visual, the simulator’s contract is structural, demanding geometry that holds up under inspection, physics that respects Newton’s laws, and dynamics that behave the way the world needs to behave given the laws of physics. A simulator serves two consumers at once. Human professionals such as architects, designers, filmmakers, and game developers need accuracy beyond visual plausibility. Computer programs such as reinforcement learning agents, robot controllers, and autonomous vehicles use simulators as training grounds where they can interact with the world at scale, testing scenarios that would be dangerous, expensive, or impossible to run in reality. The third kind is a planner. A planner outputs actions. Given an observation and a goal, a planner answers the question of what the agent should do next. This is, in many ways, the inverse of the renderer. Where a renderer takes actions as input and produces observations, a planner takes observations as input and produces actions, closing the perception-action loop. Vision-Language-Action models, model-based systems, and the new wave of World Action Models are all attempts at planners: systems that can decide what a robot should do in an unstructured world. These three categories describe most of what is actually shipping today, and the distinction between them is useful in practice. The categories are not, however, fundamentally separate. The same underlying knowledge of how the world works—geometry, physics, dynamics—sits beneath all of them. A model that can render a cup from any angle ought, in principle, to be able to simulate what happens when the cup is pushed and plan a hand to pick the cup up. Increasingly, the most interesting research deliberately blurs the boundaries between the three. ## Why simulation is the linchpin Of the three categories, the simulator gets the least public attention, and is the most consequential of the three. This essay addresses this asymmetry. The renderer is by far the most commercially mature. A number of image- or text-to-video products are expanding in the consumer or enterprise markets rapidly. Google’s Nano Banana model has put renderer-quality image generation in the hands of potentially hundreds of millions of users. The technology is real, and the markets are real. Yet renderers optimize for visual plausibility rather than physical accuracy, and that ceiling matters. Their outputs are beautiful, but they cannot be trusted to design a building or train a robot. The planner is the most intriguing and the most nascent, closely connected to the rapidly evolving field of robotic learning. The field has produced robotic demos in the last two years that look impressive in videos, but candor is required about what those demos actually show. Almost all have been confined to heavily constrained laboratory setups, with narrow object sets and short task horizons. None have been validated at the complexity, variability, or duration that real-world deployment demands. The gap between a compelling demo reel and a robot that reliably works in a kitchen, a warehouse, or an operating room remains vast. The commercial bets are nonetheless substantial. A wave of well-funded entrants is racing to ship general-purpose planning systems, while the largest infrastructure players are positioning planning atop broader simulation stacks. A robot that can plan is a robot that can work, and the entire industry is racing to be the one that gets there first. Simulation is the bridge between the two. If language is an abstraction of the world and pixels are a projection of it, then geometry, physics, and dynamics are the world itself. A simulator must work at that level: the structural backbone from which both visual appearance (for renderers) and action consequences (for planners) can be derived. A model that masters simulation can project its understanding into pixels for human consumption, and into action predictions for embodied agents. A model that masters only rendering, or only planning, cannot do either. The commercial surface area is enormous. NVIDIA’s Omniverse alone targets what the company estimates as more than a trillion dollars of addressable market in factories, warehouses, supply chains, and digital twins. Robotics training, autonomous vehicle testing, architectural visualization, engineering, and drug discovery all depend on something simulation-shaped. The hardest open problems in the field live there too. Three-dimensional data with explicit geometry, material properties, and physical annotations is orders of magnitude scarcer than the internet video that renderers train on. The sim-to-real gap, which is the difference between how things behave in simulation and how they behave in reality, persists. Generative simulators introduce a new risk on top of that: AI-generated geometry can look correct while containing self-intersections or wrong scale that produce nonsensical physics. Multi-physics simulation at scale, where rigid bodies, deformable objects, fluids, and cloth all interact, remains orders of magnitude more expensive than single-domain simulation. At World Labs, Marble is our first move into this territory. It takes multimodal prompts (text, image, video, or spatial sketch) and generates explorable 3D environments, outputting Gaussian splats for visual exploration alongside collision meshes a physics engine can operate on. But Marble is only the first chapter of a much longer arc being written across the field as the lines between rendering, simulation, and planning begin to collapse. ## Where the boundaries are collapsing and what comes next But more is to come. The most important pattern in the field right now is that the three categories are starting to blend into one another. The shared insight is that the knowledge required to render a world, simulate it, and act in it is largely the same. Continuing the earlier example, a model that truly understands how a cup sits on a table (its geometry, material properties, response to force, etc.) should be able to render that cup from any angle, simulate what happens when the cup is pushed, and plan for a hand to pick the cup up. The three categories are three projections of a single underlying understanding. For example: a small but growing number of recent work from various robotics labs have demonstrated that—at least conceptually—a pretrained video renderer can be used as the backbone for joint world-and-action prediction, suggesting a bridge between the renderer and the planner by letting one model imagine what will happen and what to do. World Labs’ Marble already outputs Gaussian splats and collision meshes from a single model, dissolving the boundary between the renderer and the simulator. Every level is moving from passive output to interactive system, with renderers becoming action-conditioned, simulators generating worlds that are more controllable and editable, and planners deliberating rather than just reacting. The logical endpoint is a unified world model: one foundation model that can render photorealistic views, produce physically accurate structure, and plan action sequences, switching between output modalities depending on what the downstream consumer needs. We will still face a number of daunting challenges. The data picture is uneven, with renderers awash in internet video while simulators and planners face acute shortages of 3D assets and robot demonstrations. Optimizing for visual beauty can sacrifice the precision a robot or a high-fidelity simulation needs. Reconciling these tensions inside a single architecture is the defining open problem in world model research today, and this is what World Labs sets out to do as we continue to evolve Marble. The direction, however, is clear. The same bet the field has been making since the late 1980s — that a sufficiently rich model of the world is all that any agent needs to see worlds, build them, and act in them — is the bet now driving an entire generation of research. What gives that “big bet” weight is the convergence already underway: three threads, each already driving and shaping multi-billion-dollar industries on its own, that began as separate research programs are starting to behave like one. Taken together, as the boundaries between them collapse, they will reshape something larger: the relationship between machine intelligence and the physical world it inhabits - the long arc of spatial intelligence. Language gave machines a way to talk about that world. World models are how machines will finally come to understand, imagine, reason and interact with it.

译World Labs团队与李飞飞发文，梳理“世界模型”这一被滥用的术语。对比语言模型学习文本统计，世界模型学习空间与时间统计（如光照、物理规律）。基于部分可观马尔可夫决策过程（POMDP）框架，智能体通过动作影响世界状态，观测是部分视图。当前被称为“世界模型”的不同系统本质上是同一循环的不同投影：第一类为渲染器，输出给人眼看的像素，以视觉保真度为核心。文章着重于概念分层，未给出具体模型名、参数或基准分数。

查看原推 ↗

OpenAI@OpenAI · 6月4日28

It's time to fly.

译是时候起飞了。

查看原推 ↗

DogeDesigner@cb_doge · 6月4日78

SpaceXAI is cooking.

译Grok Imagine 1.5 预览版已发布，即日起可在 API 中体验。SpaceXAI 正在发力。

查看原推 ↗

Anthropic@AnthropicAI · 6月4日64

How well do the security community's techniques hold up against AI-enabled cyberattacks? We examined 832 malicious accounts and mapped their activity onto a longstanding database of tactics and techniques used by threat actors. Here's what we learned:https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

译安全社区的技术在应对AI驱动的网络攻击方面表现如何？我们检查了832个恶意账户，并将其活动映射到一个长期存在的威胁行为者战术和技术数据库。以下是我们学到的：https://www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

查看原推 ↗

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月4日38

How the hell are journalists not all over this? The OpenAI/a16z Super Pac just got caught running a false flag operation TLDR: they're trying to discredit AI safety advocates, so they operate sockpuppet accounts that CALL FOR VIOLENCE

译AI安全倡导者账号指控，OpenAI与a16z支持的超级政治行动委员会（Super Pac）被曝开展虚假旗号行动：运营“傀儡账号”直接呼吁暴力，试图污名化AI安全阵营。引用推文显示，在将针对Sam Altman的暴力归咎于悲观言论后不到两周，@NathanLeamerDC的Build American AI似乎曾资助同一账号@jonathandoomer，该账号针对AI警告发布了暴力帖子。

查看原推 ↗

Demis Hassabis@demishassabis · 6月4日74

Celebrating the milestone of a massive 150+ million downloads of Gemma 4 with the release of the new Gemma 4 12B model! It's incredibly powerful for such a small model and it’s tiny enough to run locally on a laptop with just 16GB VRAM. Apache 2.0 license - happy building!

译Demis Hassabis 宣布 Gemma 4 系列下载量突破 1.5 亿，并正式发布新版 Gemma 4 12B 模型。该模型是一个统一的、无编码器的多模态模型，兼具边缘端效率与高级推理能力。尽管参数规模仅为 12B，但性能强劲，且足够小巧，可在仅需 16GB VRAM 的笔记本上本地运行。采用 Apache 2.0 开源许可证，方便开发者自由构建。

查看原推 ↗

AYi@AYi_AInotes · 6月4日65

150M 的活，35M 干了， Google 新出的 Gemma 4 12B，把多模态里那个最重的零件，视觉编码器，从 150M-550M 直接压到 35M了，过去做多模态，套路是固定的，图片先扔给一个专门的视觉编码器翻译成模型能懂的语言，再交给大模型理解，就像配了个翻译官。这个翻译官，传统 ViT 编码器要 150M 到 550M 参数。 Gemma 4 12B 直接把翻译官辞了，只留一个 35M 的轻量嵌入器，把图片切成 48×48 的小块，当成 token 直接扔进去，让 Transformer 自己学着看世界，音频也一样，16kHz 原始波形切成 40ms 一帧，直接喂进同一个模型。也就是说，图片、声音、文字，第一次被当成同一种东西。为什么敢这么干，因为它赌的是一件事，当基座模型大到某个临界点，那些专门的子模块，就不再是必需品了。这个剧本你可能见过，当年 ViT 取代 CNN，也是同一个套路，规模够大的时候，与其手工设计一堆专用结构，不如把活儿直接交给一个统一的大模型自己学。现在这套逻辑，正从视觉单模态，蔓延到整个多模态架构。而且 12B 这个尺寸不是随便选的，刚好大到能扔掉编码器，又刚好小到能塞进 16GB 的笔记本里，据 aaryan_kakad 在 M4 Max 上的实测，4-bit 量化下识图延迟 1.2 到 1.5 秒，官方说 16GB 够用，社区的说法更实在，能跑，但高分辨率多图会压线。但这条新闻真正值得琢磨的，不是它能跑在你的笔记本上，是它意味着什么，过去做一个多模态应用，你得拼装 Whisper 转录、LLaVa 看图、再接一个 LLM，像攒一台机器，每个零件都得你自己调好接口、对齐、调试。如果 encoder-free 这条路走通，未来一个微调好的统一模型，可能就把这一整条流水线吃掉了。那一刻贬值的，不是某个工具，是你过去攒那台机器、拼那条 pipeline 攒下的全部手艺。模型不是在帮你省一个零件，是在悄悄重写哪种手艺还值钱。

译Google 推出 Gemma 4 12B（Apache 2.0），采用无独立视觉编码器的统一多模态架构。仅用 35M 参数的轻量嵌入器，将图像切为 48×48 块、音频（16kHz 原始波形）切为 40ms 帧，直接作为 token 输入 Transformer。M4 Max 上 4-bit 量化识图延迟 1.2-1.5 秒，官方称 16GB 内存可用，但社区指出高分辨率多图会压线。该设计暗示：当基座模型足够大，专用子模块不再是必需，未来一个微调好的统一模型可能取代传统拼装 Whisper、LLaVa 等多模态 pipeline。

查看原推 ↗

AYi@AYi_AInotes · 6月4日70

世界最好的开源图像模型，仅次于GPT－image-2和Nanobanana2

译世界最好的开源图像模型，仅次于GPT-image-2和Nanobanana2

查看原推 ↗

Ethan Mollick@emollick · 6月4日68

In early May, the best superforecasters predicted that, by the end of the year, the longest METR 80% task horizons would reach 3-4 hours. In late May, Claude Mythos achieved that number.

译5月初，顶级超级预测者预计2026年底前最长METR 80%任务时间范围可达3-4小时。然而5月底，Anthropic的Claude Mythos模型在METR基准预览中即以80%成功率达到3小时6分钟，直接落在专家和超级预测者对2026年底的中位数预测范围内（3-4小时）。此前基线为1.5小时。此次突破表明AI能力进展速度远超预期。

查看原推 ↗

OpenCode@opencode · 6月4日59

Qwen3.7 Plus now available in Go text · image · 1M context cheaper than 3.6

译Qwen3.7 Plus 现已在 Go 中可用，支持文本和图像，1M 上下文，比 3.6 更便宜。

查看原推 ↗

Artificial Analysis@ArtificialAnlys · 6月4日71

Jensen Huang’s keynote at Computex used Artificial Analysis benchmarks to communicate the performance of Nemotron 3 Ultra Jensen used our Artificial Analysis Intelligence Index vs. Output Speed chart to communicate the performance of NVIDIA’s new Nemotron 3 Ultra model. The presentation also highlighted GDPval-AA, Artificial Analysis' benchmark that uses OpenAI's GDPval dataset to evaluate models on economically valuable tasks NVIDIA additionally highlighted Artificial Analysis Text to Image and Image to Video Arena Elos to promote the NVIDIA Cosmos 3 model family. Congratulations @NVIDIAAI on the launches!

译Jensen Huang 在 Computex 主题演讲中引用 Artificial Analysis 的 Intelligence Index vs. Output Speed 图表，介绍 NVIDIA 新模型 Nemotron 3 Ultra 的性能。演讲还提及 GDPval-AA——Artificial Analysis 基于 OpenAI 的 GDPval 数据集评估模型在经济价值任务上的基准。NVIDIA 同时用 Artificial Analysis 的文生图和图生视频 Arena Elo 评分推广 Cosmos 3 模型族。

查看原推 ↗

Krea@krea_ai · 6月4日74

introducing Ideogram v4.0. 2k native resolution, excellent text rendering, and support for JSON prompts. try it now in Krea.

译介绍 Ideogram v4.0。原生 2K 分辨率，出色的文字渲染，支持 JSON 提示词。立即在 Krea 中体验。

查看原推 ↗

elvis@omarsar0 · 6月4日76

Another banger open-source release. Miso One is an 8B text-to-speech model with real emotional range, so voiceovers carry warmth, hesitation, and excitement instead of sounding flat. It's purpose-built for voiceover work like shorts, podcasts, and educational content, and it runs at 110ms latency, which is faster than human reaction time. The best part is that the weights are fully open source, so you can clone the repo, self-host, fine-tune, and keep your data private. Worth checking out if you're building voice into your tools and products: http://github.com/MisoLabsAI/MisoTTS

译Miso Labs 开源 8B 参数文本转语音模型 Miso One，专注于生成富有情感的表达，如温暖、犹豫或兴奋，告别机械音。模型专为短视频、播客和教育内容等旁白场景设计，推理延迟仅 110 毫秒，快于人类反应时间。模型权重完全开源，支持自托管、微调和数据私有化，API 即将开放。

查看原推 ↗

Yuchen Jin@Yuchenj_UW · 6月4日63

More and more engineers are now burning more money on AI tokens than their base salaries. Tech companies are facing a brutal dilemma: > let everyone tokenmaxx and move at AI speed > add token budgets and kill the vibe > lay off 50% of people and give the rest unlimited tokens

译越来越多的工程师现在在AI token上花费的钱比他们的基本工资还要多。科技公司面临一个残酷的两难选择： > 让每个人尽情使用token并以AI速度前进 > 增加token预算并扼杀氛围 > 裁掉50%的人，给剩下的人无限token

查看原推 ↗

StepFun@StepFun_ai · 6月4日56

Deploy Step 3.7 Flash on @modal with SGLang 🚀 Modal is a serverless AI platform for deploying and scaling compute-intensive workloads without managing infrastructure. Their new guide shows how to serve our open-weight Step 3.7 Flash with SGLang on Modal, using 8×H100 GPUs, Modal Volumes, and an OpenAI-compatible chat completions endpoint. Excited to collaborate with Modal to make StepFun models more accessible to builders. https://modal.com/docs/examples/stepfun_inference

译在 @modal 上用 SGLang 部署 Step 3.7 Flash 🚀 Modal 是一个无服务器 AI 平台，用于部署和扩展计算密集型工作负载，无需管理基础设施。他们的新指南展示了如何在 Modal 上使用 SGLang 服务我们的开源权重 Step 3.7 Flash，采用 8×H100 GPU、Modal Volumes 以及兼容 OpenAI 的聊天补全端点。很高兴与 Modal 合作，让 StepFun 模型更易于构建者使用。 https://modal.com/docs/examples/stepfun_inference

查看原推 ↗

Perplexity@perplexity_ai · 6月4日56

Perplexity Computer is for growing businesses. Computer connects to 400+ tools for every type of company, including Intuit QuickBooks, Vercel, Shopify, Canva, and more. Learn more about how people are using Computer for their business: https://www.perplexity.ai/enterprise/use-cases/growing-businesses

译Perplexity Computer 适用于成长型企业。它可连接超过400种工具，涵盖各类公司需求，包括Intuit QuickBooks、Vercel、Shopify、Canva等。了解更多关于企业如何使用Computer进行业务操作： https://www.perplexity.ai/enterprise/use-cases/growing-businesses

查看原推 ↗

Chubby♨️@kimmonismus · 6月4日17

This is probably GPT-5.6. Either tomorrow or coming week i suppose. Get ready friends. We are in for a wild ride!

译这大概是 GPT-5.6。要么明天，要么下周，我想。朋友们，准备好了。我们即将迎来一场狂野之旅！

查看原推 ↗

Rohan Paul@rohanpaul_ai · 6月4日59

This feels like the natural next step for AI agents. One prompt for the whole email workflow with MCP-backed Claude controlling it. Nitrosend just launched an AI-native email platform that lets Claude build, design, segment, and send complete email campaigns from a single prompt. It connects through MCP, so Claude can act on the email system directly instead of only writing copy that a human must paste into Mailchimp, Klaviyo, or another builder. The key point is agency: Claude is not producing a draft, it is controlling the workflow across design, logic, contact targeting, and delivery. Some example - a user can ask for a newsletter, onboarding flow, or transactional email set, and Nitrosend generates responsive, dark-mode-ready, editable email markup with the sending stack already attached.

译Nitrosend 推出 AI 原生邮件平台，通过 MCP 协议与 Claude 连接。用户只需一条提示词，Claude 即可完成构建、设计、受众分组和发送完整邮件活动，而非仅生成草稿。该平台无传统仪表盘，Claude 直接控制系统工作流，包括设计、逻辑、目标定位和投递。引用推文显示，已有用户通过一条提示词成功向 10,000 人发送发布公告。

查看原推 ↗

xAI@xai · 6月4日70

Try the most natural TTS and cost-effective STT APIs in @Vapi_AI

译试试 @Vapi_AI 上最自然的TTS和性价比最高的STT API。来自 @xai 的Grok STT和Grok TTS现已在企业语音AI平台Vapi上线。基于Vapi构建自定义语音智能体，可让它们用客户的语言交流、在受监管的工作流中捕捉重要细节，并在每次通话中明显更具人性化。

查看原推 ↗

Thariq@trq212 · 6月4日25

If this prompt feels well written to you, it's because Suzanne is a writer in her little spare time! You can read her short story, Mall of America here: https://suzannewang.com/mall-of-america It's one of my favorite short stories about the human condition that happens to involve AI.

译如果这个提示词让你觉得写得很好，那是因为Suzanne在业余时间是一名作家！你可以在这里阅读她的短篇小说《Mall of America》：https://suzannewang.com/mall-of-america 这是我最喜欢的关于人类境况且恰好涉及AI的短篇小说之一。

查看原推 ↗

Josh Woodward@joshwoodward · 6月4日44

A short backstory on this one: A small Google Labs team had an idea to make an app designed to connect you with what matters, without the endless scroll. "Hope scrolling, not doom scrolling" was the hallway pitch. "Go for it." And today, that little experiment is rolling out. Meet Dreambeans, a daily dose of inspiration, brewed fresh for you. We're excited to see what you think!

译Google Labs 发布实验性移动应用 Dreambeans。该应用利用 Personal Intelligence 连接用户 Google 应用，每天推送个性化故事集合，帮助用户发现可能错过的内容，并聚焦真正重要的事。团队将其理念描述为“希望滚动，而非末日滚动”。当前仅限美国符合条件的 Google AI Ultra 用户（18 岁以上）使用，同时开放公开等待名单。

查看原推 ↗

郭明錤｜Ming-Chi Kuo@mingchikuo · 6月4日65

1. 我大約一年前做的這張 Apple 的 XR 頭戴裝置與智慧眼鏡之規劃路線（roadmap）沒什麼參考價值了，目前只剩兩個智慧眼鏡裝置有能見度。 2. 規劃路線大改是由 Apple 的下一任 CEO John Ternus 拍板定案（其實已經改變一段時間，只是我沒即時更新），我認為移除 Vision Pro 系列、並將資源轉向具有更廣大消費潛力的智慧眼鏡類產品是正確決定。 3. 最新的供應鏈調查指出，Apple 具有顯示功能的 AR / XR 智慧眼鏡（採用光波導）將延後到 2029 年。沒有顯示功能的 AI 眼鏡（類似 Ray-Ban Meta）預計還是在 2027 年推出。

译苹果分析师郭明錤更新预测：此前规划的XR头戴装置路线图已作废，目前仅两款智能眼镜设备有能见度。路线图大改由下一任CEO John Ternus拍板，Vision Pro系列被移除，资源转向智能眼镜。最新供应链调查显示，具有显示功能的AR/XR智能眼镜（光波导）推迟至2029年，无显示功能的AI眼镜（类似Ray-Ban Meta）仍预计2027年推出。郭明錤认为智能眼镜将带动下一波消费电子趋势。

查看原推 ↗

郭明錤｜Ming-Chi Kuo@mingchikuo · 6月4日63

1. The Apple XR headset and smart glasses roadmap I put together about a year ago is no longer a useful reference. For now, only two smart glasses products remain visible in the roadmap. 2. The major overhaul was signed off by Apple's next CEO, John Ternus. This shift actually happened a while back. I'm just late updating the chart. I think removing the Vision Pro line was the right call, as Apple shifts resources toward smart glasses with greater mass-market potential. 3. My latest supply chain checks suggest Apple’s display-equipped AR/XR smart glasses device, powered by optical waveguides, has slipped to 2029. The display-less AI glasses, similar to Ray-Ban Meta, are still expected to ship in 2027.

译郭明錤更新苹果XR头显与智能眼镜路线图，原先版本已失效。目前仅剩两款智能眼镜产品在规划中，主要调整由苹果下任CEO John Ternus批准，取消了Vision Pro产品线，将资源转向更具大众市场潜力的智能眼镜。最新供应链调查显示，配备光学波导显示屏的AR/XR智能眼镜设备推迟至2029年；不带显示屏的AI眼镜（类似Ray-Ban Meta）预计2027年出货。

查看原推 ↗

Replit ⠕@Replit · 6月4日67

You shipped your app. Now what? Your app may look great, but if no one can find it, it stays invisible Publishing is only the beginning Meet SEO Agent. It runs a scan for you and suggests fixes to help your app get discovered in web & AI search

译你发布了你的应用。然后呢？你的应用可能看起来很棒，但如果没人能找到它，它就依然不可见。发布只是开始。认识一下SEO Agent。它会为你运行一次扫描，并建议修复措施，帮助你的应用在网页搜索和AI搜索中被发现。

查看原推 ↗

🚨 AI News | TestingCatalog@testingcatalog · 6月4日74

Ideogram announced Ideogram 4.0, a new SOTA open image generation model! > Ideogram 4.0 lands in the 8th spot on LM Arena and the 5th spot on Design Arena in the text-to-image category, and is getting close to Nano Banana Pro's performance. > Ideogram 4.0 features dense, accurate text rendering, native 2K resolution, active background transparency, and precise layout control.

译Ideogram 4.0 开源图像生成模型发布，在 LM Arena 文生图类别排名第 8，Design Arena 第 5，评分 1204，成为该领域排名最高的开放模型，性能接近 Nano Banana Pro。主要特性包括密集准确的文本渲染、原生 2K 分辨率、活动背景透明度及精确布局控制。

查看原推 ↗

🚨 AI News | TestingCatalog@testingcatalog · 6月4日46

GOOGLE 🔥: A new Dreambeans experiment is now available in Google Labs for US-based Google AI Ultra users on the waitlist. This experiment uses Personal Intelligence to deliver daily stories based on the user's data context. Not a testing time for the most 👀

译GOOGLE 🔥: 一项新的 Dreambeans 实验现已于 Google Labs 上线，面向美国地区的 Google AI Ultra 用户（需加入候补名单）。该实验利用个人智能，根据用户的数据上下文提供每日故事。对大多数人来说，这并非测试时间👀

查看原推 ↗

Chubby♨️@kimmonismus · 6月4日75

Miso One is live: an open-weights voice model built to sound like a real person reading, with actual warmth and pacing where most TTS still goes flat. 8B params, free on GitHub, with one-shot voice cloning from a short sample at 110ms latency. Self-host it and your audio data never leaves your machine. No API needed, no lock-in. Type any line into the demo and hear it before you clone the repo.

译Miso One 正式发布，一个 8B 参数的开源权重语音模型（TTS），旨在模拟真实人类朗读的温暖与节奏。它支持一次语音克隆（只需短样本），推理延迟仅 110ms。模型权重已开源至 GitHub，无需 API 即可自托管，音频数据不离开本地。API 访问即将推出。演示已上线，可先试听再克隆仓库。

查看原推 ↗

Ethan Mollick@emollick · 6月4日60

Most people, including really accomplished people, don't have an accurate mental model of how LLMs operate (and why would they?) You see this in wide beliefs that AI is just copying from known sources, or that it only produces average answers, or that it can't generate new ideas

译大多数人，包括非常有成就的人，对LLM的运作方式没有准确的认知（他们凭什么有呢？）你可以从广泛的观念中看到这一点：认为AI只是从已知来源复制，或者它只能产生平均水平的答案，或者它不能产生新想法。

查看原推 ↗

StepFun@StepFun_ai · 6月4日44

Great demo by @atomic_chat_hq. Step 3.7 Flash was designed for real-world agentic coding tasks — not just generating code fast, but keeping logic, visuals, and execution coherent across complex outputs. Love seeing builders test it in creative ways!

译阶跃星辰（StepFun）称其 Step 3.7 Flash 在与 DeepSeek V4-Flash 的物理编程测试中全面胜出。测试要求在不使用库的情况下，生成一个包含高尔顿板、旋转六边形弹球和同步节拍器三个场景的自包含 HTML5 canvas 动画，并实现真实物理。Step 3.7 Flash 输出 59.6k tokens（耗时 9分57秒），DeepSeek V4-Flash 输出 52.5k tokens（耗时 6分21秒）。尽管 DeepSeek 更快，但 StepFun 模型在物理模拟、视觉效果和逻辑渲染上均占优。主推文指出 Step 3.7 Flash 专为真实世界 agentic 编码任务设计，能保持复杂输出中逻辑、视觉和执行的一致性。

查看原推 ↗

Microsoft Research@MSFTResearch · 6月4日62

A three‑month pilot in a Midwestern bottling plant shows what happens when AI moves beyond chat and into decision-making, where constraints shift, stakes are real, and answers must hold. https://msft.it/6015vjYUN

译一份在中西部装瓶厂进行的三个月试点显示，当AI超越聊天进入决策领域时会发生什么——约束条件变化、风险真实、答案必须可靠。 https://msft.it/6015vjYUN

查看原推 ↗

🚨 AI News | TestingCatalog@testingcatalog · 6月4日51

Perplexity Personal Computer is now available to Max and Enterprise Max users on Windows! Waitlist below 👀

译Perplexity Personal Computer 现面向 Max 和 Enterprise Max 用户开放 Windows 版本！等候名单如下 👀

查看原推 ↗

🚨 AI News | TestingCatalog@testingcatalog · 6月4日65

GOOGLE 🔥: A new Gemma 4 12B is now available on Huggingface under Apache 2.0 license! > Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. > This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution.

译Google 最新的 Gemma 4 12B 模型已上线 Hugging Face，采用 Apache 2.0 许可证。该模型与 Gemma 4 E2B/E4B 共享相同多模态能力，支持文本、音频、图像和视频输入，无需单独编码器即可实现原生音频和视觉理解。这种无编码器统一设计方案使其部署体积更小，非常适合消费级设备和本地执行环境。官方称其旨在弥合边缘效率与高级推理之间的差距。

查看原推 ↗

Chubby♨️@kimmonismus · 6月4日57

First hands-on with Microsoft’s new Surface Laptop Ultra. Microsoft is clearly positioning this as a new class of creator and AI laptop, powered by new NVIDIA silicon with an RTX GPU built for local AI, creative workflows, and gaming. A few standout specs: -New NVIDIA chip with RTX GPU -Up to 1 petaflop of AI compute -Up to 128GB unified memory -15-inch mini-LED PixelSense Ultra touchscreen -3:2 aspect ratio -262 PPI -Up to 2,000 nits peak HDR brightness -Less than 18mm thick

译首次上手微软新的 Surface Laptop Ultra。微软明确将其定位为面向创作者和 AI 的新品类笔记本电脑，由搭载 RTX GPU 的新 NVIDIA 芯片驱动，专为本地 AI、创意工作流和游戏打造。几个突出规格： - 带 RTX GPU 的新 NVIDIA 芯片 - 最高 1 petaflop AI 算力 - 最高 128GB 统一内存 - 15 英寸 mini-LED PixelSense Ultra 触摸屏 - 3:2 比例 - 262 PPI - 最高 2000 尼特峰值 HDR 亮度 - 厚度不足 18mm

查看原推 ↗

Google AI Developers@googleaidevs · 6月4日77

We’re launching Gemma 4 12B: Our unified, encoder-free model that brings powerful multimodal intelligence straight to your laptop 🚀 The model bridges the gap between our mobile E4B model and larger 26B MoE models, packaging frontier-class reasoning and native audio into a highly optimized footprint, all under a permissive Apache 2.0 license. Here’s what makes it unique: + Encoder-Less Architecture: We removed the multimodal encoders. The vision and audio inputs flow directly into the LLM backbone. + Agentic Performance (16GB VRAM): Run complex, multi-step workflows locally, with performance nearing our 26B model.

译Google发布Gemma 4 12B，一款无编码器的统一多模态模型，可直接将视觉和音频输入送入LLM主干，无需传统多模态编码器。该模型填补了移动端E4B模型与26B MoE模型之间的空白，封装前沿推理与原生音频能力，采用Apache 2.0许可。在16GB VRAM下即可本地运行复杂多步骤智能体工作流，性能接近26B模型。

查看原推 ↗

elvis@omarsar0 · 6月4日66

This SkillOpt paper from Microsoft is a must-read! (bookmark it) I was a bit skeptical of the results reported in the paper when I shared it a few days ago. However, I managed to integrate it into my agent orchestrator and ran a few experiments. The results are mindblowing. Essentially, all my agent skills now have a proper testing framework and a way to self-evolve. I have started to improve all my agent skills with this. One exciting result was when I applied it to my paper-figure-extraction skill, which requires an agent to do multimodal analysis. In particular, it improved quality by +20 points (0.73 → 0.93). I went to see the extracted tables and figures, and I was absolutely stunned by how much better my skill got at the task. Self-improving AI is in the early days, but I think this work is a clear example of the current ability of agents to self-improve. In this case, it was skills, but it's not hard to imagine how this scales to optimizing agent patterns, tool use, context engineering efforts, agentic search, workflows, evals, and even the harness itself. I already started with a few of these ideas inspired by SkillOpt. Stay tuned!

译DAIR.AI的Elvis Saravia将微软SkillOpt论文集成到智能体编排器中后，所有智能体技能获得测试框架与自我演化机制。应用于多模态论文图表提取技能时，质量评分从0.73提升至0.93（+20点），提取结果显著改善。Saravia认为这是自我改进AI的早期范例，该思路可扩展至智能体模式优化、工具使用、上下文工程、智能体搜索及工作流评估等环节。他已基于SkillOpt启动多项后续实验。

查看原推 ↗

Sundar Pichai@sundarpichai · 6月4日70

On Monday we announced an equity offering for Alphabet - part of our multi-year investment strategy to meet the AI opportunity ahead and support the demand we’re seeing from enterprises and consumers. Pleased to share the offering was well over-subscribed. We raised a total of ~$45B, with an additional $40B to come as part of an “at the market” program starting in Q3 (for a total of ~ $85B). A huge thank you to our investors, including Berkshire Hathaway who invested $10B.

译周一我们宣布了Alphabet的股权融资——这是我们多年投资策略的一部分，旨在抓住未来的AI机遇并支持我们看到的来自企业和消费者的需求。很高兴告诉大家，此次融资已大幅超额认购。我们共募集了约450亿美元，另将通过Q3启动的“按市价发行”计划再募集400亿美元（总计约850亿美元）。非常感谢我们的投资者，包括投资了100亿美元的伯克希尔·哈撒韦。

查看原推 ↗

Runway@runwayml · 6月4日73

Use Aleph 2.0 to turn any video into a green screen asset or clean plate, no rotoscoping required. Learn how with today's Runway Academy.

译使用 Aleph 2.0 将任何视频转换为绿幕资产或干净底板，无需旋转描摹。通过今天的 Runway Academy 学习操作方法。

查看原推 ↗

eric zakariasson@ericzakariasson · 6月4日74

http://x.com/i/article/2061967596568875008 # Don't let your agent guess, give it runtime context If you've ever watched an agent try to fix a bug, you've watched it guess. It reads the code, comes up with a theory, makes an edit, and hopes. Sometimes it's right. A lot of the time you get a fix that looks confident and quietly hides the real bug. Debug Mode is what we built for that. Instead of sitting there reasoning about the code, the agent goes and gets evidence about what the code does when it runs. Here's the loop 1. Agent comes up with multiple hypotheses, and starts to work on the most plausible first 1. Then, logging is added to test one hypothesis (without touching implementation) 1. A little debug server collects the runtime output to .cursor/debug.log while your program runs. 1. You reproduce the bug, and agent can now read the logs and understand what happened instead of having to guess 1. Cursor finds the root cause in the logs, makes the fix, and pulls out the logging it added. Here it is on a real bug, sped up to about a minute: ## How the team uses it Some interesting things that we've solved internally with debug mode: - A race condition that hit 1 in 20 runs. It was corrupting git metadata in our best-of-N runs. Debug Mode pinned it down in under an hour - A memory leak, traced in one pass. It came down to a misuse of our frontend framework. The fix was a single line. - A native crash deep in C++. An Electron crash people would normally route around. The logs made it findable. - An SSR flicker that had been given up on. A rendering bug nobody wanted to touch, fixed once the agent could see what the page was doing at runtime. Try it with Shift+Tab (it's in the CLI too, via /debug). I'm sure people are using it in ways I haven't thought of, so let me know!

译Cursor 发布 Debug Mode，解决 AI 智能体靠猜测修 Bug 的问题。工作流程：Agent 先生成多个假设，为最可能的假设添加日志（不修改代码）；调试服务器在程序运行时收集输出到 `.cursor/debug.log`；用户重现 Bug 后，Agent 读取日志而非猜测；最后 Cursor 从日志找到根因并修复，自动移除添加的日志。内部案例：追踪 1/20 概率出现的 git 元数据竞争条件（1 小时内定位）；一次单趟追踪内存泄漏（修复仅一行）；定位 Electron 中 C++ 原生崩溃；修复此前无人敢碰的 SSR 闪烁问题。用户可通过 Shift+Tab 或在 CLI 中使用 `/debug` 触发。

查看原推 ↗

6月4日

03:20

Chubby♨️@kimmonismus

65

微软新Surface Laptop Ultra上手体验

微软推出全新Surface Laptop Ultra，定位创作者和AI笔记本，搭载NVIDIA新芯片（RTX GPU），最高提供1 petaflop AI算力、128GB统一内存。配备15英寸mini-LED PixelSense Ultra触摸屏（3:2比例，262 PPI，峰值2000尼特HDR亮度），厚度不足18mm。作者在幕后参观中亲手检测，认为做工、散热、显示屏和芯片令人印象深刻，微软明确将目标对准MacBook Pro，意在直接挑战苹果。

Chubby♨️: First hands-on with Microsoft's new Surface Laptop Ultra. Microsoft is clearly positioning this as a new class of creato...

Microsoft产品更新端侧

03:20

Chubby♨️@kimmonismus

71

Google 开源 Gemma 4 12B：无编码器架构，本地 16GB VRAM 运行

Google 开源 Gemma 4 12B（密集参数，Apache 2.0 许可），采用全新无编码器架构：移除独立的视觉（550M 参数、27 层 Transformer）和音频（300M 参数、12 层 Conformer）编码器。视觉改为 35M 嵌入层（约缩小 15 倍），音频以 40ms 帧直接投影到大语言模型。模型在 16GB VRAM 笔记本上即可运行智能体推理、视觉和音频任务，性能接近 26B 参数模型。共享权重支持一次 LoRA 调优覆盖视觉、音频和文本。

Google: Today we're introducing Gemma 4 12B - our latest open model that brings advanced agentic reasoning, vision and audio dir...

Google多模态开源生态模型发布

03:20

Fei-Fei Li@drfeifei

精选78

世界模型的功能分类

World Labs团队与李飞飞发文，梳理“世界模型”这一被滥用的术语。对比语言模型学习文本统计，世界模型学习空间与时间统计（如光照、物理规律）。基于部分可观马尔可夫决策过程（POMDP）框架，智能体通过动作影响世界状态，观测是部分视图。当前被称为“世界模型”的不同系统本质上是同一循环的不同投影：第一类为渲染器，输出给人眼看的像素，以视觉保真度为核心。文章着重于概念分层，未给出具体模型名、参数或基准分数。

具身智能大佬观点现象/趋势

推荐理由：李飞飞亲手给纷乱的「世界模型」下了个三分类——渲染、模拟、规划，而且点破模拟才是根基。做机器人、空间智能的人，这篇是今年的坐标系。

03:08

OpenAI@OpenAI

28

是时候起飞了。

OpenAI产品更新

02:58

DogeDesigner@cb_doge

78

Grok Imagine 1.5 预览版已发布，即日起可在 API 中体验。SpaceXAI 正在发力。

Grok: Grok @Imagine 1.5 Preview is here Try it today in the API: http://x.ai/api/imagine

xAI图像生成模型发布

关联讨论 1 条

02:56

Anthropic@AnthropicAI

64

安全社区的技术在应对AI驱动的网络攻击方面表现如何？我们检查了832个恶意账户，并将其活动映射到一个长期存在的威胁行为者战术和技术数据库。以下是我们学到的：https：//www.anthropic.com/news/AI-enabled-cyber-threats-mitre-attack

Anthropic安全/对齐论文/研究

关联讨论 2 条

02:55

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes

38

AI安全倡导者账号指控，OpenAI与a16z支持的超级政治行动委员会（Super Pac）被曝开展虚假旗号行动：运营"傀儡账号"直接呼吁暴力，试图污名化AI安全阵营。引用推文显示，在将针对Sam Altman的暴力归咎于悲观言论后不到两周，@NathanLeamerDC的Build American AI似乎曾资助同一账号@jonathandoomer，该账号针对AI警告发布了暴力帖子。

Tyler Johnston: I find it unbelievable that, less than two weeks before blaming the violence against Sam Altman on doomer rhetoric, @Nat...

OpenAI安全/对齐行业动态

02:36

Demis Hassabis@demishassabis

精选74

Demis Hassabis 宣布 Gemma 4 系列下载量突破 1.5 亿，并正式发布新版 Gemma 4 12B 模型。该模型是一个统一的、无编码器的多模态模型，兼具边缘端效率与高级推理能力。尽管参数规模仅为 12B，但性能强劲，且足够小巧，可在仅需 16GB VRAM 的笔记本上本地运行。采用 Apache 2.0 开源许可证，方便开发者自由构建。

Google Gemma: Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to y...

Google开源生态模型发布端侧

关联讨论 2 条

推荐理由：Gemma 4 12B 用 Apache 2.0 许可把多模态模型压进笔记本，16GB 显存就能跑，端侧智能的性价比又一次被 Google 拉高，做本地推理的可以马上试试。

02:16

AYi@AYi_AInotes

65

Google 发布 Gemma 4 12B：无独立视觉编码器的统一多模态架构

Google 推出 Gemma 4 12B（Apache 2.0），采用无独立视觉编码器的统一多模态架构。仅用 35M 参数的轻量嵌入器，将图像切为 48×48 块、音频（16kHz 原始波形）切为 40ms 帧，直接作为 token 输入 Transformer。M4 Max 上 4-bit 量化识图延迟 1.2-1.5 秒，官方称 16GB 内存可用，但社区指出高分辨率多图会压线。该设计暗示：当基座模型足够大，专用子模块不再是必需，未来一个微调好的统一模型可能取代传统拼装 Whisper、LLaVa 等多模态 pipeline。

Google Gemma: Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to y...

Google多模态大佬观点端侧

02:16

AYi@AYi_AInotes

70

世界最好的开源图像模型，仅次于GPT-image-2和Nanobanana2

Ideogram: Introducing Ideogram 4.0: the best open image model in the world. Think it. Make it. Own it. Download the weights, fine-...

图像生成开源生态模型发布

02:15

Ethan Mollick@emollick

68

5月初，顶级超级预测者预计2026年底前最长METR 80%任务时间范围可达3-4小时。然而5月底，Anthropic的Claude Mythos模型在METR基准预览中即以80%成功率达到3小时6分钟，直接落在专家和超级预测者对2026年底的中位数预测范围内（3-4小时）。此前基线为1.5小时。此次突破表明AI能力进展速度远超预期。

Forecasting Research Institute: We also asked forecasters to predict the longest 80% success time horizon achieved by the end of 2026. All three groups ...

智能体Anthropic大佬观点

01:56

OpenCode@opencode

59

Qwen3.7 Plus 现已在 Go 中可用，支持文本和图像，1M 上下文，比 3.6 更便宜。

产品更新多模态编码

01:51

Artificial Analysis@ArtificialAnlys

71

Jensen Huang Computex 演讲引用 Artificial Analysis 基准介绍 Nemotron 3 Ultra 性能

Jensen Huang 在 Computex 主题演讲中引用 Artificial Analysis 的 Intelligence Index vs. Output Speed 图表，介绍 NVIDIA 新模型 Nemotron 3 Ultra 的性能。演讲还提及 GDPval-AA——Artificial Analysis 基于 OpenAI 的 GDPval 数据集评估模型在经济价值任务上的基准。NVIDIA 同时用 Artificial Analysis 的文生图和图生视频 Arena Elo 评分推广 Cosmos 3 模型族。

推理模型发布评测/基准

01:49

Krea@krea_ai

精选74

介绍 Ideogram v4.0。原生 2K 分辨率，出色的文字渲染，支持 JSON 提示词。立即在 Krea 中体验。

图像生成模型发布

关联讨论 1 条

推荐理由：图像生成模型的军备竞赛又添一员，Ideogram v4.0的2k原生分辨率和JSON prompt对接工作流，做设计生成的同学可以直接上手试试。

01:48

elvis@omarsar0

76

Miso One 8B开源情感TTS模型发布

Miso Labs 开源 8B 参数文本转语音模型 Miso One，专注于生成富有情感的表达，如温暖、犹豫或兴奋，告别机械音。模型专为短视频、播客和教育内容等旁白场景设计，推理延迟仅 110 毫秒，快于人类反应时间。模型权重完全开源，支持自托管、微调和数据私有化，API 即将开放。

Aoden Teo: Today, we're excited to introduce Miso One, the most emotive voice model in the world. Miso One is an 8-billion-paramete...

开源生态模型发布语音

01:47

Yuchen Jin@Yuchenj_UW

63

越来越多的工程师现在在AI token上花费的钱比他们的基本工资还要多。科技公司面临一个残酷的两难选择： > 让每个人尽情使用token并以AI速度前进 > 增加token预算并扼杀氛围 > 裁掉50%的人，给剩下的人无限token

现象/趋势

01:45

StepFun@StepFun_ai

56

在 @modal 上用 SGLang 部署 Step 3.7 Flash 🚀 Modal 是一个无服务器 AI 平台，用于部署和扩展计算密集型工作负载，无需管理基础设施。他们的新指南展示了如何在 Modal 上使用 SGLang 服务我们的开源权重 Step 3.7 Flash，采用 8×H100 GPU、Modal Volumes 以及兼容 OpenAI 的聊天补全端点。很高兴与 Modal 合作，让 StepFun 模型更易于构建者使用。 https：//modal.com/docs/examples/stepfun_inference

教程/实践部署/工程

01:36

Perplexity@perplexity_ai

56

Perplexity Computer 适用于成长型企业。它可连接超过400种工具，涵盖各类公司需求，包括Intuit QuickBooks、Vercel、Shopify、Canva等。了解更多关于企业如何使用Computer进行业务操作： https：//www.perplexity.ai/enterprise/use-cases/growing-businesses

智能体MCP/工具产品更新

01:20

Chubby♨️@kimmonismus

17

这大概是 GPT-5.6。要么明天，要么下周，我想。朋友们，准备好了。我们即将迎来一场狂野之旅！

leo 🐾: mercury-alpha

OpenAI其他

01:18

Rohan Paul@rohanpaul_ai

59

Nitrosend 发布 AI 邮件平台，Claude 单提示词控制全流程

Nitrosend 推出 AI 原生邮件平台，通过 MCP 协议与 Claude 连接。用户只需一条提示词，Claude 即可完成构建、设计、受众分组和发送完整邮件活动，而非仅生成草稿。该平台无传统仪表盘，Claude 直接控制系统工作流，包括设计、逻辑、目标定位和投递。引用推文显示，已有用户通过一条提示词成功向 10,000 人发送发布公告。

George Hartley ☄️: I just sent our launch announcement to 10,000 people. It took one prompt in Claude. Today we're launching @nitrosendx - ...

智能体AnthropicMCP/工具产品更新

01:08

xAI@xai

70

试试 @Vapi_AI 上最自然的TTS和性价比最高的STT API。来自 @xai 的Grok STT和Grok TTS现已在企业语音AI平台Vapi上线。基于Vapi构建自定义语音智能体，可让它们用客户的语言交流、在受监管的工作流中捕捉重要细节，并在每次通话中明显更具人性化。

Vapi: Grok STT and Grok TTS from @xai are now live on Vapi, the platform for enterprise voice AI. Build on Vapi to create cust...

xAI产品更新语音

关联讨论 1 条

01:05

Thariq@trq212

25

如果这个提示词让你觉得写得很好，那是因为Suzanne在业余时间是一名作家！你可以在这里阅读她的短篇小说《Mall of America》：https：//suzannewang.com/mall-of-america 这是我最喜欢的关于人类境况且恰好涉及AI的短篇小说之一。

Thariq: been asking others at Anthropic how they stay in the loop with Claude and fully understand the work being done this is o...

Anthropic其他

01:05

Josh Woodward@joshwoodward

44

Google Labs 发布实验性移动应用 Dreambeans。该应用利用 Personal Intelligence 连接用户 Google 应用，每天推送个性化故事集合，帮助用户发现可能错过的内容，并聚焦真正重要的事。团队将其理念描述为"希望滚动，而非末日滚动"。当前仅限美国符合条件的 Google AI Ultra 用户（18 岁以上）使用，同时开放公开等待名单。

Google Labs: 🚨 NEW EXPERIMENT 🚨 Dreambeans is a new, experimental mobile app that uses Personal Intelligence to connect to your Goo...

Google产品更新

01:00

郭明錤｜Ming-Chi Kuo@mingchikuo

65

苹果砍掉Vision Pro，智能眼镜路线图延迟至2027/2029

苹果分析师郭明錤更新预测：此前规划的XR头戴装置路线图已作废，目前仅两款智能眼镜设备有能见度。路线图大改由下一任CEO John Ternus拍板，Vision Pro系列被移除，资源转向智能眼镜。最新供应链调查显示，具有显示功能的AR/XR智能眼镜（光波导）推迟至2029年，无显示功能的AI眼镜（类似Ray-Ban Meta）仍预计2027年推出。郭明錤认为智能眼镜将带动下一波消费电子趋势。

郭明錤|Ming-Chi Kuo: Apple Vision系列與智慧眼鏡產品規劃預測 (2025-2028):智慧眼鏡可望帶動下一個消費電子趨勢全文連結:https://mingchikuo.craft.me/FgF89wv0af9Bpw

多模态端侧行业动态

01:00

郭明錤｜Ming-Chi Kuo@mingchikuo

63

苹果智能眼镜路线图更新：取消Vision Pro，AR眼镜推迟至2029

郭明錤更新苹果XR头显与智能眼镜路线图，原先版本已失效。目前仅剩两款智能眼镜产品在规划中，主要调整由苹果下任CEO John Ternus批准，取消了Vision Pro产品线，将资源转向更具大众市场潜力的智能眼镜。最新供应链调查显示，配备光学波导显示屏的AR/XR智能眼镜设备推迟至2029年；不带显示屏的AI眼镜（类似Ray-Ban Meta）预计2027年出货。

郭明錤|Ming-Chi Kuo: Apple Vision Series and Smart Glasses Roadmap (2025-2028): Smart Glasses Set to Drive the Next Wave in Consumer Electron...

端侧行业动态

00:58

Replit ⠕@Replit

精选67

你发布了你的应用。然后呢？你的应用可能看起来很棒，但如果没人能找到它，它就依然不可见。发布只是开始。认识一下SEO Agent。它会为你运行一次扫描，并建议修复措施，帮助你的应用在网页搜索和AI搜索中被发现。

产品更新部署/工程

推荐理由：Replit 把 SEO 优化做进了开发流程，对于靠内容获客的产品人，部署完直接跑一遍 SEO Agent 可能比手动改 meta 标签省心十倍。虽然不是什么底层突破，但解决的是真痛点。

00:55

🚨 AI News | TestingCatalog@testingcatalog

74

Ideogram 4.0 开源图像生成模型发布，在 LM Arena 文生图类别排名第 8，Design Arena 第 5，评分 1204，成为该领域排名最高的开放模型，性能接近 Nano Banana Pro。主要特性包括密集准确的文本渲染、原生 2K 分辨率、活动背景透明度及精确布局控制。

Arena.ai: New open model Ideogram-4.0-Quality has landed at #8 in the Text-to-Image Arena. This makes the new model by @ideogram_a...

图像生成开源生态模型发布

00:55

🚨 AI News | TestingCatalog@testingcatalog

46

GOOGLE 🔥：一项新的 Dreambeans 实验现已于 Google Labs 上线，面向美国地区的 Google AI Ultra 用户（需加入候补名单）。该实验利用个人智能，根据用户的数据上下文提供每日故事。对大多数人来说，这并非测试时间👀

Google产品更新

00:50

Chubby♨️@kimmonismus

精选75

Miso One 正式发布，一个 8B 参数的开源权重语音模型（TTS），旨在模拟真实人类朗读的温暖与节奏。它支持一次语音克隆（只需短样本），推理延迟仅 110ms。模型权重已开源至 GitHub，无需 API 即可自托管，音频数据不离开本地。API 访问即将推出。演示已上线，可先试听再克隆仓库。

Aoden Teo: Today, we're excited to introduce Miso One, the most emotive voice model in the world. Miso One is an 8-billion-paramete...

开源生态模型发布语音

推荐理由：Miso One这种8B参数、110ms延迟的情感TTS模型，直接把声音克隆和自托管做成了开箱即用，做语音产品的可以马上 clone 一个玩玩，比等 API 爽多了。

00:45

Ethan Mollick@emollick

60

大多数人，包括非常有成就的人，对LLM的运作方式没有准确的认知（他们凭什么有呢？）你可以从广泛的观念中看到这一点：认为AI只是从已知来源复制，或者它只能产生平均水平的答案，或者它不能产生新想法。

大佬观点现象/趋势

00:45

StepFun@StepFun_ai

44

阶跃星辰（StepFun）称其 Step 3.7 Flash 在与 DeepSeek V4-Flash 的物理编程测试中全面胜出。测试要求在不使用库的情况下，生成一个包含高尔顿板、旋转六边形弹球和同步节拍器三个场景的自包含 HTML5 canvas 动画，并实现真实物理。Step 3.7 Flash 输出 59.6k tokens（耗时 9分57秒），DeepSeek V4-Flash 输出 52.5k tokens（耗时 6分21秒）。尽管 DeepSeek 更快，但 StepFun 模型在物理模拟、视觉效果和逻辑渲染上均占优。主推文指出 Step 3.7 Flash 专为真实世界 agentic 编码任务设计，能保持复杂输出中逻辑、视觉和执行的一致性。

atomic.chat: StepFun Step 3.7 Flash smashed DeepSeek V4-Flash in a physics contest We gave two open-weight models the same task: writ...

DeepSeek编码评测/基准

00:33

Microsoft Research@MSFTResearch

62

一份在中西部装瓶厂进行的三个月试点显示，当AI超越聊天进入决策领域时会发生什么--约束条件变化、风险真实、答案必须可靠。 https：//msft.it/6015vjYUN

Microsoft推理论文/研究部署/工程

00:25

🚨 AI News | TestingCatalog@testingcatalog

51

Perplexity Personal Computer 现面向 Max 和 Enterprise Max 用户开放 Windows 版本！等候名单如下 👀

Perplexity: Join the waitlist for Personal Computer on Windows: https://www.perplexity.ai/hub/products/computer-for-windows

产品更新端侧

00:25

🚨 AI News | TestingCatalog@testingcatalog

65

Google 最新的 Gemma 4 12B 模型已上线 Hugging Face，采用 Apache 2.0 许可证。该模型与 Gemma 4 E2B/E4B 共享相同多模态能力，支持文本、音频、图像和视频输入，无需单独编码器即可实现原生音频和视觉理解。这种无编码器统一设计方案使其部署体积更小，非常适合消费级设备和本地执行环境。官方称其旨在弥合边缘效率与高级推理之间的差距。

Google Gemma: Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to y...

Google多模态模型发布端侧

00:20

Chubby♨️@kimmonismus

57

首次上手微软新的 Surface Laptop Ultra。微软明确将其定位为面向创作者和 AI 的新品类笔记本电脑，由搭载 RTX GPU 的新 NVIDIA 芯片驱动，专为本地 AI、创意工作流和游戏打造。几个突出规格： - 带 RTX GPU 的新 NVIDIA 芯片 - 最高 1 petaflop AI 算力 - 最高 128GB 统一内存 - 15 英寸 mini-LED PixelSense Ultra 触摸屏 - 3：2 比例 - 262 PPI - 最高 2000 尼特峰值 HDR 亮度 - 厚度不足 18mm

Microsoft产品更新端侧

00:19

Google AI Developers@googleaidevs

77

Google推出Gemma 4 12B无编码器多模态模型

Google发布Gemma 4 12B，一款无编码器的统一多模态模型，可直接将视觉和音频输入送入LLM主干，无需传统多模态编码器。该模型填补了移动端E4B模型与26B MoE模型之间的空白，封装前沿推理与原生音频能力，采用Apache 2.0许可。在16GB VRAM下即可本地运行复杂多步骤智能体工作流，性能接近26B模型。

Google多模态开源生态模型发布

关联讨论 5 条

00:17

elvis@omarsar0

66

微软SkillOpt论文：AI智能体技能实现自我进化

DAIR.AI的Elvis Saravia将微软SkillOpt论文集成到智能体编排器中后，所有智能体技能获得测试框架与自我演化机制。应用于多模态论文图表提取技能时，质量评分从0.73提升至0.93（+20点），提取结果显著改善。Saravia认为这是自我改进AI的早期范例，该思路可扩展至智能体模式优化、工具使用、上下文工程、智能体搜索及工作流评估等环节。他已基于SkillOpt启动多项后续实验。

智能体Microsoft多模态大佬观点

00:09

Sundar Pichai@sundarpichai

精选70

周一我们宣布了Alphabet的股权融资--这是我们多年投资策略的一部分，旨在抓住未来的AI机遇并支持我们看到的来自企业和消费者的需求。很高兴告诉大家，此次融资已大幅超额认购。我们共募集了约450亿美元，另将通过Q3启动的"按市价发行"计划再募集400亿美元（总计约850亿美元）。非常感谢我们的投资者，包括投资了100亿美元的伯克希尔·哈撒韦。

Google行业动态

推荐理由：850亿美金，伯克希尔押注10亿，这是AI军备竞赛以来最大单笔融资。谷歌在说：这场仗，我们准备打到2030年。

00:09

Runway@runwayml

73

使用 Aleph 2.0 将任何视频转换为绿幕资产或干净底板，无需旋转描摹。通过今天的 Runway Academy 学习操作方法。

产品更新教程/实践视频

关联讨论 3 条

00:01

eric zakariasson@ericzakariasson

74

Cursor 推出 Debug Mode：让 AI 智能体通过运行时日志修复 Bug

Cursor 发布 Debug Mode，解决 AI 智能体靠猜测修 Bug 的问题。工作流程：Agent 先生成多个假设，为最可能的假设添加日志（不修改代码）；调试服务器在程序运行时收集输出到 `.cursor/debug.log`；用户重现 Bug 后，Agent 读取日志而非猜测；最后 Cursor 从日志找到根因并修复，自动移除添加的日志。内部案例：追踪 1/20 概率出现的 git 元数据竞争条件（1 小时内定位）；一次单趟追踪内存泄漏（修复仅一行）；定位 Electron 中 C++ 原生崩溃；修复此前无人敢碰的 SSR 闪烁问题。用户可通过 Shift+Tab 或在 CLI 中使用 `/debug` 触发。

智能体产品更新编码部署/工程