AIHOT

全部动态X · 611 条

全部一手资讯 X 论文

SenseTime@SenseTime_AI · 6月4日69

"𝗦𝗲𝗿𝗶𝗼𝘂𝘀𝗹𝘆 𝗶𝗺𝗽𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝘀𝘁𝘂𝗳𝗳". Thanks for the kind words, @gurru_tech — that's 𝗦𝗲𝗻𝘀𝗲𝗡𝗼𝘃𝗮 𝗨𝟭 turning prompts into professional infographics. Unified model that natively understands and generates text and images. Open-sourced. Run it yourself. 🎥Watch the video: https://youtu.be/HKz2e3STUwg 🎛️ SenseNova Studio: https://unify.light-ai.top/ (Try infographics; also join Discord for text-image interleaved gen) 🤗 https://huggingface.co/collections/sensenova/sensenova-u1 🛠️ https://github.com/OpenSenseNova/SenseNova-U1 👾 Discord: https://discord.com/invite/BuTXPHmQub

译商汤 SenseTime 推出 SenseNova U1 开源多模态模型，实现原生理解与生成文本和图像，可一键将提示词转化为专业信息图。该模型被开发者 @gurru_tech 评价为“非常令人印象深刻”。项目已开源，提供 SenseNova Studio 在线试用，并公开 HuggingFace 模型集合、GitHub 源码仓库及 Discord 社区入口。

查看原推 ↗

elvis@omarsar0 · 6月4日74

NEW: NVIDIA ships 550B MoE open model for long-running agents. Very exciting times to see more open models to support local long-running coding agents.

译NVIDIA 今日发布 Nemotron 3 Ultra，一个 550B MoE 前沿智能开源模型，专为长时间运行智能体设计。相比其他开源前沿模型，推理速度提升 5 倍，复杂智能体任务成本降低 30%。

查看原推 ↗

Artificial Analysis@ArtificialAnlys · 6月4日74

NVIDIA has just released Nemotron 3 Ultra, the new most intelligent US open weights model, with leading speed for its intelligence Nemotron 3 Ultra scores 47.7 on the Artificial Analysis Intelligence Index, well ahead of the next strongest US open weights models, Gemma 4 31B (39.2), Nemotron 3 Super (36.0) and gpt-oss-120b (33.3), but behind the Chinese-led open weights frontier (Kimi K2.6 at 53.9). We partnered with @NVIDIA to evaluate this model for intelligence and speed ahead of its public release. These figures use the final NVFP4 weights that NVIDIA recommends for inference, but our tests show minimal intelligence impact compared to BF16 testing, with higher precision resulting in an Artificial Analysis Intelligence Index score of 48.2 vs. the NVFP4 score of 47.7. Key Takeaways: ➤ Nemotron 3 Ultra leads in speed for its intelligence: through BlackBox AI ahead of release, Nemotron 3 Ultra is served at over 400 output tokens per second - this is slightly faster than the typical serving speed of gpt-oss-120b despite being >4X larger, and comes with significantly greater intelligence ➤ Largest Nemotron 3 model so far: with approximately 550 billion total parameters and 55 billion active, Nemotron 3 Ultra is significantly larger than its siblings and is the largest and most intelligent US open weights model release ever ➤ Nemotron 3 Ultra is the leading US open weights model on the Artificial Analysis Intelligence and Agentic Indexes by far, but Gemma 4 31B scores ~1 point higher on the Coding Index (comprised of Terminal-Bench Hard and SciCode)

译NVIDIA 发布 Nemotron 3 Ultra，为目前最智能的美国开源权重模型。在 Artificial Analysis Intelligence Index 得分 47.7，领先 Gemma 4 31B（39.2）、Nemotron 3 Super（36.0）和 gpt-oss-120b（33.3），但低于中国开源模型 Kimi K2.6（53.9）。模型总参数约 550B，激活 55B，推理速度超 400 tokens/s，较 gpt-oss-120b 略快且智能显著更高。NVFP4 精度得分 47.7，BF16 得分 48.2，精度差异极小。

查看原推 ↗

StepFun@StepFun_ai · 6月4日77

Great to see Step 3.7 Flash live on @FireworksAI_HQ. Designed for inference from day one, Step 3.7 Flash combines a hardware-friendly architecture with MTP-assisted decoding to reach up to 400 tokens/s. Fast, multimodal, and ready to power capable agents in real-world workflows.

译阶跃星辰的 Step 3.7 Flash 已上架 Fireworks AI。该模型为 198B 稀疏 MoE 多模态大模型（VLM），含 196B 语言骨干和 1.8B 视觉编码器，从设计之初优化推理效率，采用硬件友好架构与 MTP 辅助解码，速度达 400 tokens/s。具备原生多模态理解与行动、可靠工具使用、增强搜索能力，面向真实智能体工作负载，采用 Apache 2.0 开源许可。

查看原推 ↗

StepFun@StepFun_ai · 6月4日73

Thanks @ArtificialAnlys for the detailed independent evaluation. Step 3.7 Flash is built with a clear focus on the intelligence-speed frontier: MTP-assisted decoding, 400+ output tokens/s, stronger agentic performance, native multimodal capabilities, and Apache 2.0 open weights. This is the direction we believe matters for production agent workloads: capable, efficient, and deployable at scale.

译阶跃星辰发布开源 Step 3.7 Flash（Apache 2.0），采用 MoE 架构（198B 总参/11B 活跃参），配备 MTP 辅助解码（3 个预测头），输出速度超 400 tokens/s，是同类两倍多。Artificial Analysis Intelligence Index 得分 42.6，较 Step 3.5 Flash 提升 4 分。智能体能力明显增强：GDPval-AA Elo 升至 1298，TerminalBench Hard 升至 35.6%。新增 1.8B 视觉编码器，MMMU-Pro 得分 75.3%。上下文窗口 256K tokens，提供 BF16、FP8、NVFP4 版本。缺点：AA-Omniscience 准确率仅 25.4%，幻觉率 84.4%。

查看原推 ↗

DogeDesigner@cb_doge · 6月4日65

Grok Imagine Video 1.5 is now ranked #1 on the Video Arena Leaderboard. 🥇

译Grok Imagine Video 1.5 现在在 Video Arena 排行榜上排名第一。🥇

查看原推 ↗

Artificial Analysis@ArtificialAnlys · 6月4日67

StepFun's Step 3.7 Flash sits on the Intelligence vs Output Speed Pareto frontier, scoring 43 on the Artificial Analysis Intelligence Index and is served at over 400 output tokens/s Step 3.7 Flash (open weights, Apache 2.0) is a significant upgrade on Step 3.5 Flash and stands out for its speed and gains in agentic performance (particularly GDPval-AA). 400 output tokens/s is more than double other models of a similar size class. Contributing to this speed is that the model has only 11B active parameters and the model ships with trained Multi-Token Prediction heads (3) that predict several tokens in a single forward pass, letting it decode multiple tokens at once using speculative decoding. Key results for Step 3.7 Flash with the high reasoning level: ➤ 4 point Intelligence Index improvement: Step 3.7 Flash scores 42.6 on the Artificial Analysis Intelligence Index, up 4 points from Step 3.5 Flash 2603 (38.5). It is equivalent to Qwen3.5 122B A10B (41.6) and trails MiniMax-M2.7 (49.6) and DeepSeek V4 Flash (Max Effort, 46.5) ➤ Speed-intelligence frontier: Step 3.7 Flash achieves ~400 output tokens/s on StepFun's first-party API, placing the model on the Intelligence vs Output Speed Pareto frontier. StepFun has released the weights for this model and we expect several third-party providers to serve this model ➤ Agentic capability improvements: Step 3.7 Flash improves over Step 3.5 Flash 2603 across our agentic evaluations, in both GDPval-AA (real-world agentic tasks) and TerminalBench Hard (agentic coding and terminal use). It achieves a GDPval-AA Elo of 1298, up from 1070 for Step 3.5 Flash 2603, and it's TerminalBench Hard score increases to 35.6% from 32.6%. AA-LCR (Long Context Reasoning) improves to 63.7% from 54.3%. Scores for other evals remain relatively flat ➤ Weaker on knowledge and hallucination than peers: While Step 3.7 Flash trails competitors overall on AA-Omniscience (-38), it improves from Step 3.5 Flash 2603 (-44). It has an AA-Omniscience accuracy of 25.4% and a hallucination rate of 84.4% ➤ Native multimodal support, new in this generation: Step 3.7 Flash introduces a 1.8B-parameter vision encoder for native image understanding, where Step 3.5 Flash was text-only. On MMMU-Pro (multimodal reasoning) it scores 75.3%, roughly matching Qwen3.5 122B A10B (75.0%). Among its same-size open weights peers, MiniMax-M2.7, DeepSeek V4 Flash, and gpt-oss-120b are text-only Key model details: ➤ Context window: 256K tokens ➤ Parameters: 198B total, 11B active (MoE). At BF16 native precision, Step 3.7 Flash requires ~400GB to store the weights. StepFun has also released FP8 (~200GB) and NVFP4 (~100GB) versions for lower-memory deployment ➤ License: Apache 2.0 ➤ Availability: Currently Step 3.7 Flash is available on @StepFun_ai 's first-party API

译StepFun 开源 Step 3.7 Flash（Apache 2.0），总参数 198B、激活 11B（MoE），上下文 256K。在 Artificial Analysis 智能指数上得分 42.6，较 Step 3.5 Flash 提升 4 分，输出速度超 400 tokens/s，通过 Multi-Token Prediction（3 个 token）加速。新增 1.8B 视觉编码器支持原生多模态，MMMU-Pro 得分 75.3%。代理能力提升：GDPval-AA Elo 从 1070 升至 1298，TerminalBench Hard 达 35.6%，AA-LCR 63.7%。知识/幻觉仍弱：AA-Omniscience 准确率 25.4%，幻觉率 84.4%。提供 BF16、FP8、NVFP4 精度权重以降低部署成本。

查看原推 ↗

歸藏(guizang.ai)@op7418 · 6月4日61

Reve 2.0 这个图像模型强啊原生 4K 输出，主要是它支持类似于你在 PS 里用到的图像分层之后的编辑能力就。图像中的每一个部分，你点它就能选中。而且这个不需要中间的处理，他给你处理好了。就是你想要编辑哪个部分，就点哪个部分

译Reve 2.0 图像模型支持原生4K输出，核心亮点在于类似 Photoshop 的图像分层编辑能力。用户点击图像中的任意部分即可选中该区域，无需复杂的中间处理步骤，直接进行针对性编辑。该功能大幅简化了图像局部修改的工作流。

查看原推 ↗

Jeff Dean@JeffDean · 6月4日75

Check out our Gemma 4 12B model: it's a super capable open weights model that can run directly on your laptop.

译来看看我们的 Gemma 4 12B 模型：它是一个功能非常强大的开源权重模型，可以直接在你的笔记本电脑上运行。

查看原推 ↗

MiniMax (official)@MiniMax_AI · 6月4日71

M3 is back in the free tier on @opencode 🚀 Jump in and try it while it lasts!

译MiniMax M3 即将推出，现在即可在 OpenCode 免费试用。M3 已回到免费层，快来体验！

查看原推 ↗

小互@xiaohu · 6月4日73

Ideogram 发布首个开源AI图像模型：Ideogram 4.0 宣称文字渲染和版面控制拉到了开源天花板传统文生图只能写一段 prompt 然后祈祷模型把东西放对位置 Ideogram 4.0 引入了 bounding box（边界框）控制：你可以用坐标精确指定每个元素放在画面的哪个区域。结构化 JSON 提示词：Ideogram 4.0 不只接受纯文本 prompt，还支持一套结构化 JSON 提示词格式。多语言文字渲染：英文 OCR 准确率达到 0.97（X-Omni 基准测试），并支持跨语言的密集文字渲染，支持（中日韩等非拉丁文字）

译Ideogram 发布首个开源 AI 图像模型 Ideogram 4.0，主推文字渲染与版面控制。模型引入 bounding box（边界框）控制，允许用坐标精确指定元素位置；支持结构化 JSON 提示词格式，不再仅限纯文本；英文 OCR 准确率达 0.97（X-Omni 基准），支持跨语言密集文字渲染，涵盖中日韩等非拉丁文字。

查看原推 ↗

Elon Musk@elonmusk · 6月4日72

Grok Imagine on Vercel

译Vercel 的 AI Gateway 上现已推出 Grok Imagine Video 1.5。该服务支持图生视频并同步音频，一次性完成。示例代码： `await generateVideo({ model: 'xai/grok-imagine-video-1.5-preview', prompt: 'a rabbit sprinting through nyc' });`

查看原推 ↗

Elon Musk@elonmusk · 6月4日73

Iliad (Troy) trailer made by Grok Imagine 1.5, which was just released

译伊利亚特（特洛伊）预告片由刚刚发布的 Grok Imagine 1.5 制作

查看原推 ↗

Berryxia.AI@berryxia · 6月4日67

大家还在把音频AI当成视觉和文本的边缘附属品时，一个开源模型直接把语音、音乐、环境音三件事彻底统一到一个模型里，干翻了所有闭源方案。真的试试实际效果如何，看着是真的不粗~~ 大家本地搭音频Agent，想让AI不光听懂人说话，还能分辨背景音乐、环境音效，甚至自动剪辑播客。之前所有方案不是闭源贵得离谱，就是语音和音乐两套系统，串起来一塌糊涂。今天MOSS-Audio直接把这个痛点干掉了。 OpenMOSS团队这个模型刚刚冲上Hugging Face Trending第一。它把Speech、Sound、Music真正做到了audio-language统一建模：扔一段带背景音乐的对话，它能同时转录语音、识别环境音、理解音乐情绪，还能生成文本描述或者直接做下游任务。不是简单堆数据，而是真正从架构上打通了音频世界。开源可商用，Hugging Face和GitHub代码全放出来了，普通开发者现在就能拉下来本地跑。这其实把行业当前最主流的认知直接反转了：真正通往超级智能的下一块拼图，不是继续卷视觉+文本，而是让AI像人一样同时感知声音世界。音频从来不是附属，将和文本同等重要的感官入口。谁先把这一块做通，谁就抢到了下一代agent的先机。以前我们总觉得音频AI要等闭源大厂慢慢迭代，现在开源社区用一个模型就把“语音+声音+音乐”这个三合一难题端上来了，速度和开放度反而领先。

译OpenMOSS团队发布MOSS-Audio，一个融合语音（Speech）、环境音（Sound）、音乐（Music）的开源音频-语言模型，已冲上Hugging Face Trending第一。该模型从架构上打通三大音频域，可同时转录对话、识别背景音、理解音乐情绪并生成文本或执行下游任务。模型完全开源可商用，代码和权重已在Hugging Face及GitHub公布，开发者可本地运行。

查看原推 ↗

小互@xiaohu · 6月4日71

Google 发布 Gemma 4 12B 开源模型 16GB 笔记本跑全模态 AI Gemma 4 12B 采用了一种叫"Unified"的无编码器架构，让文字、图像、音频、视频四种输入直接进入同一个 Transformer 主干网络处理。模型可直接处理原始的图像和声音用一个类比讲清楚传统多模态模型处理图片和音频的方式，类似于一个只会中文的老板配了两个翻译：一个英文翻译（视觉编码器），一个日文翻译（音频编码器）。每次有英文或日文材料进来，必须先让翻译转成中文，老板才能看懂。翻译本身占工位（显存），翻译过程要排队等（延迟），而且老板拿到的是翻译加工过的版本，不是原文。 Gemma 4 12B 做的事情是：把两个翻译都裁了，让老板自己学会了直接看英文和日文。几个关键数字： 16GB 显存或统一内存能跑，4-bit 量化低到 8GB，目标就是在普通笔记本上本地运行 256K Token 上下文窗口，支持 140+ 种语言内置 Thinking 模式（逐步推理）和原生 Function Calling

译Google 发布 Gemma 4 12B 开源模型，采用无编码器 Unified 架构，可直接处理文本、图像、音频、视频，无需独立编码器。16GB 显存可运行，4-bit 量化后低至 8GB。支持 256K token 上下文、140+ 语言，内置 Thinking 模式和 Function Calling。

查看原推 ↗

MiniMax (official)@MiniMax_AI · 6月4日77

15.6× faster decoding at 1M tokens 🔥 Thanks @FireworksAI_HQ for powering the inference behind M3. Try it now 👇

译15.6× faster decoding at 1M tokens 🔥 感谢 @FireworksAI_HQ 为 M3 提供推理支持。立即尝试 👇

查看原推 ↗

Berryxia.AI@berryxia · 6月4日69

Google 昨晚发布Gemma 4 12B 多模态的大模型，至少需要16G 内存就可以运行。应该和Qwen 的模型进行对比其效果如何～

译Google 昨晚发布Gemma 4 12B 多模态的大模型，至少需要16G 内存就可以运行。应该和Qwen 的模型进行对比其效果如何～

查看原推 ↗

DogeDesigner@cb_doge · 6月4日70

SpaceXAI keeps raising the bar. 🔥 Grok Imagine Video 1.5 preview is now live on the API, and the results look insanely cinematic. 📽️ Go try it yourself. 💻 Godspeed SpaceXAI. 🚀

译SpaceXAI 不断刷新标准。🔥 Grok Imagine Video 1.5 预览版现已上线 API，效果看起来极为电影感。📽️ 去亲自试试吧。💻 祝 SpaceXAI 好运。🚀

查看原推 ↗

MiniMax (official)@MiniMax_AI · 6月4日78

Mem0 is an official launch partner for MiniMax M3! M3's 1M token context window + @mem0ai 's memory layer = AI apps that truly remember. Build personalized AI agents with persistent memory, now with 50% off M3 during launch week. Get started with Minimax → https://platform.minimax.io/docs/guides/models-intro Sign up with mem0 → http://app.mem0.ai/?utm_source=minimax_x_post

译Mem0 是 MiniMax M3 的官方启动合作伙伴！ M3 的 1M token 上下文窗口 + @mem0ai 的记忆层 = 真正记住的 AI 应用。构建具有持久记忆的个性化 AI 智能体，现在启动周内 M3 享五折优惠。开始使用 Minimax → https://platform.minimax.io/docs/guides/models-intro 注册 mem0 → http://app.mem0.ai/?utm_source=minimax_x_post

查看原推 ↗

Greg Brockman@gdb · 6月4日71

Major upgrade to GPT-Rosalind, with much better intelligence for drug discovery, analysis, design, and experimental workflows:

译GPT-Rosalind 重大升级，药物发现、分析、设计和实验工作流的智能大幅提升：

查看原推 ↗

🚨 AI News | TestingCatalog@testingcatalog · 6月4日53

Reve 2.0 is now available, and it landed in second place in the text-to-image arena, outranking Nano Banana 2. > We invented a new way to generate and edit any image using precise layouts. For the first time, it’s possible to create images you can touch. > Images are represented as code, so every part of an image becomes addressable, editable, and manipulable. > Every image in Reve is segmented and labeled, giving you precise control over every region and element.

译新模型 Reve 2.0 上线，在 Text-to-Image 竞技场中排名第二，超越 Nano Banana 2 和 GPT-Image-1.5。该模型采用全新图像生成与编辑方式，利用精确布局实现可交互的图像创作：图像被表示为代码，每个区域均可寻址、编辑和操控；图像被自动分割并标注，用户可对每一元素进行精细化控制。

查看原推 ↗

OpenAI@OpenAI · 6月4日67

We’re bringing new capabilities to GPT-Rosalind, a model series purpose-built for life sciences research at enterprise scale. It brings GPT-5.5’s agentic coding and tool use together with stronger intelligence for drug discovery, analysis, design, and experimental workflows. https://openai.com/index/introducing-new-capabilities-to-gpt-rosalind

译我们正在为 GPT-Rosalind 带来新功能，这是一个专为企业级生命科学研究打造的模型系列。它将 GPT-5.5 的智能体编码和工具使用能力与更强大的智能相结合，用于药物发现、分析、设计和实验工作流程。 https://openai.com/index/introducing-new-capabilities-to-gpt-rosalind

查看原推 ↗

fofr@fofrAI · 6月4日61

Ideogram v4 > a scan of a page from my high school A3 art pad, highly original niche pencil piece working on the aura of unusual cross sections and fluidity of otherwise solid surfaces in human portraiture with offset recursion, not anatomical, the cross sections reveal something else, very detailed and complex, no other anatomy, no embellishments, no pencil shavings, no tea stains, clean white paper

译Ideogram v4 表现出色，开放权重。图像清晰，感觉焕然一新。

查看原推 ↗

MiniMax (official)@MiniMax_AI · 6月4日65

@mem0ai is an official launch partner for MiniMax M3! M3's 1M token context window + @mem0ai 's memory layer = AI apps that truly remember. Build personalized AI agents with persistent memory, now with 50% off M3 during launch week. Get started with Minimax → https://platform.minimax.io/docs/guides/models-intro Sign up with mem0 → http://app.mem0.ai/?utm_source=minimax_x_post

译@mem0ai 是 MiniMax M3 的官方发布合作伙伴！ M3 的百万 token 上下文窗口 + @mem0ai 的记忆层 = 真正能记住的 AI 应用。构建带有持久记忆的个性化 AI 智能体，发布周期间 M3 可享 5 折优惠。开始使用 Minimax → https://platform.minimax.io/docs/guides/models-intro 注册 mem0 → http://app.mem0.ai/?utm_source=minimax_x_post

查看原推 ↗

Sundar Pichai@sundarpichai · 6月4日73

Our new Gemma 4 12B model hits a sweet spot between size + performance: it can run locally on a laptop, while enabling powerful multi-step reasoning and agentic workflows. Can’t wait to see what the community does with this one!

译Gemma 4 系列累计下载量突破1.5亿次，Google随之推出新成员Gemma 4 12B。该模型仅12B参数，可在16GB VRAM笔记本上本地运行，兼顾尺寸与性能，支持多步推理和智能体工作流。采用Apache 2.0开源许可，供社区使用。

查看原推 ↗

fofr@fofrAI · 6月4日69

Ideogram v4 is really good, and open weights. Images are crisp and feel fresh.

译Ideogram v4 真的很好，而且开源权重。图像清新锐利，令人耳目一新。

查看原推 ↗

Chubby♨️@kimmonismus · 6月4日71

Gemma 4 12B shipped today under the label "encoder-free." A local 12b model that shows really good results. I'm a big fan of Gemma Gemma 4 12B is out: a dense, fully open model (Apache 2.0) that runs on a 16GB laptop and does agentic reasoning, vision and audio at a quality Google puts near its 26B model. The reason a 12B can pull this off: Google removed the separate vision and audio encoders and feeds both straight into the model, which keeps the memory footprint small enough for consumer GPUs. For on-device assistants and private coding agents, that lowers the bar a lot. always look forward to the updates. 12b is a good sweet spot in terms of size. a few facts: Vision: the 550M encoder (27 transformer layers) is now a 35M embedder, one matmul on 48x48 pixel patches. Roughly 15x smaller. Audio: the 300M encoder (12 conformer layers) is gone. Raw 16kHz audio cut into 40ms frames, projected straight into the LLM. So encoding didn't vanish, it collapsed into the backbone. The payoff is real: one shared set of weights, so you LoRA-tune vision, audio and text in a single pass.

译Google 开源 Gemma 4 12B（密集参数，Apache 2.0 许可），采用全新无编码器架构：移除独立的视觉（550M 参数、27 层 Transformer）和音频（300M 参数、12 层 Conformer）编码器。视觉改为 35M 嵌入层（约缩小 15 倍），音频以 40ms 帧直接投影到大语言模型。模型在 16GB VRAM 笔记本上即可运行智能体推理、视觉和音频任务，性能接近 26B 参数模型。共享权重支持一次 LoRA 调优覆盖视觉、音频和文本。

查看原推 ↗

DogeDesigner@cb_doge · 6月4日78

SpaceXAI is cooking.

译Grok Imagine 1.5 预览版已发布，即日起可在 API 中体验。SpaceXAI 正在发力。

查看原推 ↗

Demis Hassabis@demishassabis · 6月4日74

Celebrating the milestone of a massive 150+ million downloads of Gemma 4 with the release of the new Gemma 4 12B model! It's incredibly powerful for such a small model and it’s tiny enough to run locally on a laptop with just 16GB VRAM. Apache 2.0 license - happy building!

译Demis Hassabis 宣布 Gemma 4 系列下载量突破 1.5 亿，并正式发布新版 Gemma 4 12B 模型。该模型是一个统一的、无编码器的多模态模型，兼具边缘端效率与高级推理能力。尽管参数规模仅为 12B，但性能强劲，且足够小巧，可在仅需 16GB VRAM 的笔记本上本地运行。采用 Apache 2.0 开源许可证，方便开发者自由构建。

查看原推 ↗

AYi@AYi_AInotes · 6月4日70

世界最好的开源图像模型，仅次于GPT－image-2和Nanobanana2

译世界最好的开源图像模型，仅次于GPT-image-2和Nanobanana2

查看原推 ↗

Krea@krea_ai · 6月4日74

introducing Ideogram v4.0. 2k native resolution, excellent text rendering, and support for JSON prompts. try it now in Krea.

译介绍 Ideogram v4.0。原生 2K 分辨率，出色的文字渲染，支持 JSON 提示词。立即在 Krea 中体验。

查看原推 ↗

elvis@omarsar0 · 6月4日76

Another banger open-source release. Miso One is an 8B text-to-speech model with real emotional range, so voiceovers carry warmth, hesitation, and excitement instead of sounding flat. It's purpose-built for voiceover work like shorts, podcasts, and educational content, and it runs at 110ms latency, which is faster than human reaction time. The best part is that the weights are fully open source, so you can clone the repo, self-host, fine-tune, and keep your data private. Worth checking out if you're building voice into your tools and products: http://github.com/MisoLabsAI/MisoTTS

译Miso Labs 开源 8B 参数文本转语音模型 Miso One，专注于生成富有情感的表达，如温暖、犹豫或兴奋，告别机械音。模型专为短视频、播客和教育内容等旁白场景设计，推理延迟仅 110 毫秒，快于人类反应时间。模型权重完全开源，支持自托管、微调和数据私有化，API 即将开放。

查看原推 ↗

🚨 AI News | TestingCatalog@testingcatalog · 6月4日74

Ideogram announced Ideogram 4.0, a new SOTA open image generation model! > Ideogram 4.0 lands in the 8th spot on LM Arena and the 5th spot on Design Arena in the text-to-image category, and is getting close to Nano Banana Pro's performance. > Ideogram 4.0 features dense, accurate text rendering, native 2K resolution, active background transparency, and precise layout control.

译Ideogram 4.0 开源图像生成模型发布，在 LM Arena 文生图类别排名第 8，Design Arena 第 5，评分 1204，成为该领域排名最高的开放模型，性能接近 Nano Banana Pro。主要特性包括密集准确的文本渲染、原生 2K 分辨率、活动背景透明度及精确布局控制。

查看原推 ↗

Chubby♨️@kimmonismus · 6月4日75

Miso One is live: an open-weights voice model built to sound like a real person reading, with actual warmth and pacing where most TTS still goes flat. 8B params, free on GitHub, with one-shot voice cloning from a short sample at 110ms latency. Self-host it and your audio data never leaves your machine. No API needed, no lock-in. Type any line into the demo and hear it before you clone the repo.

译Miso One 正式发布，一个 8B 参数的开源权重语音模型（TTS），旨在模拟真实人类朗读的温暖与节奏。它支持一次语音克隆（只需短样本），推理延迟仅 110ms。模型权重已开源至 GitHub，无需 API 即可自托管，音频数据不离开本地。API 访问即将推出。演示已上线，可先试听再克隆仓库。

查看原推 ↗

🚨 AI News | TestingCatalog@testingcatalog · 6月4日65

GOOGLE 🔥: A new Gemma 4 12B is now available on Huggingface under Apache 2.0 license! > Built with the same multimodal functionality as Gemma 4 E2B and E4B (text, audio, image, and video inputs), it brings native audio and vision understanding directly to local environments without the need for separate encoders. > This unified approach to multimodality makes the model encoder-free, offering a deployment size that is perfect for consumer devices and streamlined local execution.

译Google 最新的 Gemma 4 12B 模型已上线 Hugging Face，采用 Apache 2.0 许可证。该模型与 Gemma 4 E2B/E4B 共享相同多模态能力，支持文本、音频、图像和视频输入，无需单独编码器即可实现原生音频和视觉理解。这种无编码器统一设计方案使其部署体积更小，非常适合消费级设备和本地执行环境。官方称其旨在弥合边缘效率与高级推理之间的差距。

查看原推 ↗

Google AI Developers@googleaidevs · 6月4日77

We’re launching Gemma 4 12B: Our unified, encoder-free model that brings powerful multimodal intelligence straight to your laptop 🚀 The model bridges the gap between our mobile E4B model and larger 26B MoE models, packaging frontier-class reasoning and native audio into a highly optimized footprint, all under a permissive Apache 2.0 license. Here’s what makes it unique: + Encoder-Less Architecture: We removed the multimodal encoders. The vision and audio inputs flow directly into the LLM backbone. + Agentic Performance (16GB VRAM): Run complex, multi-step workflows locally, with performance nearing our 26B model.

译Google发布Gemma 4 12B，一款无编码器的统一多模态模型，可直接将视觉和音频输入送入LLM主干，无需传统多模态编码器。该模型填补了移动端E4B模型与26B MoE模型之间的空白，封装前沿推理与原生音频能力，采用Apache 2.0许可。在16GB VRAM下即可本地运行复杂多步骤智能体工作流，性能接近26B模型。

查看原推 ↗

SenseTime@SenseTime_AI · 6月3日73

A plain sneaker image went in. Marketing visuals came out. #SenseNova U1 — see, think, create — all in one model. #OpenSourced. This is the architecture shift people keep talking about. Shoutout @AiLockup for the demo 🔥 🎥Watch the video: https://youtu.be/9IFgPqMWBGg Try it today: 🎛️ SenseNova Studio: https://unify.light-ai.top/ (Try infographics; also join Discord for text-image interleaved gen) 🤗 https://huggingface.co/collections/sensenova/sensenova-u1 🛠️ https://github.com/OpenSenseNova/SenseNova-U1 👾 Discord: https://discord.com/invite/BuTXPHmQub @huggingface @github

译商汤（SenseTime）开源SenseNova U1模型，宣称实现“看、思考、创作”一体——从一张普通运动鞋图片直接生成营销视觉效果。该模型代表了架构上的范式转变。用户可通过SenseNova Studio、HuggingFace和GitHub尝试使用。

查看原推 ↗

Alibaba Cloud@alibaba_cloud · 6月3日71

Qwen: Foundation Models for the Agent Era with Steven Hoi, Head of Multimodal Interaction, Tongyi Large Model BU Qwen3.7 delivers major breakthroughs in reasoning, fully upgrading native agentic capabilities across tool use, coding, and long-horizon tasks.

译Qwen：面向智能体时代的基座模型，由通义大模型BU多模态交互负责人Steven Hoi介绍。 Qwen3.7在推理方面取得重大突破，全面升级了工具使用、编码和长程任务的原生智能体能力。

查看原推 ↗

Satya Nadella@satyanadella · 6月3日82

With the new MAI models and Frontier Tuning capabilities we announced today, we're focused on helping every company move from just consuming a frontier model to fully participating at the frontier.

译凭借我们今天宣布的全新MAI模型和前沿调优能力，我们致力于帮助每家公司从仅仅使用前沿模型，转变为全面参与前沿领域。

查看原推 ↗

Berryxia.AI@berryxia · 6月3日74

老树开新花了，这个老大哥微软今天发布新模型了😄 刷一波存在感哈哈哈，不然都没有人记得了~ Microsoft AI今天直接甩出七个全新MAI模型。官方说：不是简单迭代，而是从零开始、干净数据血统、零蒸馏训练的一整个家族。 MAI-Thinking-1主推理、MAI-Code-1-Flash主编码、MAI-Image-2.5主图像、MAI-Transcribe-1.5主转录、MAI-Voice-2主语音，还有各自的Flash版本。最狠的是MAI-Code-1-Flash，直接在SWE-Bench Verified上干到71.6，比Claude Haiku 4.5高5分，Pro榜单高16分，还省60% token，现在已经在Copilot里逐步上线。 MAI-Image-2.5在Arena图像编辑排第二、文本生图排第三，精准保留人脸、logo和细节，已经直接塞进PowerPoint和OneDrive。 MAI-Transcribe-1.5在43种语言上同时拿准度和速度第一，一小时音频15秒搞定。 MAI-Voice-2能控情绪、支持多语言code-switching，长内容说话人身份也稳。它们不是各自为战，而是设计成一个能无缝协作的家族。Microsoft这次没玩“一个大模型通吃”，而是把每个任务拆开，用干净数据从头训，公开所有技术细节和学习心得。这其实把行业当前最主流的路径反过来了。大家都在卷参数规模、卷蒸馏别人家的输出，Microsoft却在说：真正长期有竞争力的，是从零构建、血统干净、任务专精、还能互相配合的模型家族。实际效果如何，其实还有待大家的测试~~期待看看实际表现！

译微软在Build大会宣布推出七个全新的MAI模型家族。该家族以“干净数据血统”从零开始训练，旨在任务专精并能无缝协作。其中，MAI-Code-1-Flash在SWE-Bench Verified上得分71.6，比Claude Haiku 4.5高出5分，并能节省60% token。MAI-Transcribe-1.5处理一小时音频仅需15秒，在43种语言上实现速度与准度领先。微软此次发布旨在展示其从零构建、专精且能协同工作的模型发展路径。

查看原推 ↗

6月4日

22:46

SenseTime@SenseTime_AI

精选69

SenseNova U1 开源统一模型：原生图文生成

商汤 SenseTime 推出 SenseNova U1 开源多模态模型，实现原生理解与生成文本和图像，可一键将提示词转化为专业信息图。该模型被开发者 @gurru_tech 评价为“非常令人印象深刻”。项目已开源，提供 SenseNova Studio 在线试用，并公开 HuggingFace 模型集合、GitHub 源码仓库及 Discord 社区入口。

图像生成多模态开源生态模型发布

关联讨论 1 条

推荐理由：商汤这回把图文统一模型开源了，SenseNova U1的infographic功能比市面上大多数文生图工具更懂文字和布局，做内容的朋友可以上手试试。

22:22

elvis@omarsar0

74

NVIDIA 今日发布 Nemotron 3 Ultra，一个 550B MoE 前沿智能开源模型，专为长时间运行智能体设计。相比其他开源前沿模型，推理速度提升 5 倍，复杂智能体任务成本降低 30%。

NVIDIA AI: Today we're shipping Nemotron 3 Ultra. A 550B MoE frontier-intelligence open model built for long-running agents. It del...

智能体开源生态模型发布

21:54

Artificial Analysis@ArtificialAnlys

74

NVIDIA 发布 Nemotron 3 Ultra，成美国开源权重模型智能新标杆

NVIDIA 发布 Nemotron 3 Ultra，为目前最智能的美国开源权重模型。在 Artificial Analysis Intelligence Index 得分 47.7，领先 Gemma 4 31B（39.2）、Nemotron 3 Super（36.0）和 gpt-oss-120b（33.3），但低于中国开源模型 Kimi K2.6（53.9）。模型总参数约 550B，激活 55B，推理速度超 400 tokens/s，较 gpt-oss-120b 略快且智能显著更高。NVFP4 精度得分 47.7，BF16 得分 48.2，精度差异极小。

开源生态推理模型发布评测/基准

21:18

StepFun@StepFun_ai

精选77

阶跃星辰的 Step 3.7 Flash 已上架 Fireworks AI。该模型为 198B 稀疏 MoE 多模态大模型（VLM），含 196B 语言骨干和 1.8B 视觉编码器，从设计之初优化推理效率，采用硬件友好架构与 MTP 辅助解码，速度达 400 tokens/s。具备原生多模态理解与行动、可靠工具使用、增强搜索能力，面向真实智能体工作负载，采用 Apache 2.0 开源许可。

Fireworks AI: Many research labs only consider inference efficiency after the fact. Step 3.7 Flash is a 198B sparse MoE VLM designed b...

多模态推理模型发布

关联讨论 3 条

推荐理由：198B稀疏MoE加MTP解码把速度推到400 tok/s，还开源Apache 2.0，这规格做agent的大脑正合适，做实时应用的可以试试手。

12:17

StepFun@StepFun_ai

73

阶跃星辰发布开源 Step 3.7 Flash（Apache 2.0），采用 MoE 架构（198B 总参/11B 活跃参），配备 MTP 辅助解码（3 个预测头），输出速度超 400 tokens/s，是同类两倍多。Artificial Analysis Intelligence Index 得分 42.6，较 Step 3.5 Flash 提升 4 分。智能体能力明显增强：GDPval-AA Elo 升至 1298，TerminalBench Hard 升至 35.6%。新增 1.8B 视觉编码器，MMMU-Pro 得分 75.3%。上下文窗口 256K tokens，提供 BF16、FP8、NVFP4 版本。缺点：AA-Omniscience 准确率仅 25.4%，幻觉率 84.4%。

Artificial Analysis: StepFun's Step 3.7 Flash sits on the Intelligence vs Output Speed Pareto frontier, scoring 43 on the Artificial Analysis...

智能体多模态推理模型发布

关联讨论 3 条

12:00

DogeDesigner@cb_doge

65

Grok Imagine Video 1.5 现在在 Video Arena 排行榜上排名第一。🥇

Elon Musk: Iliad (Troy) trailer made by Grok Imagine 1.5, which was just released

多模态模型发布视频

11:52

Artificial Analysis@ArtificialAnlys

67

StepFun 开源 Step 3.7 Flash 模型，性能与速度并进

StepFun 开源 Step 3.7 Flash（Apache 2.0），总参数 198B、激活 11B（MoE），上下文 256K。在 Artificial Analysis 智能指数上得分 42.6，较 Step 3.5 Flash 提升 4 分，输出速度超 400 tokens/s，通过 Multi-Token Prediction（3 个 token）加速。新增 1.8B 视觉编码器支持原生多模态，MMMU-Pro 得分 75.3%。代理能力提升：GDPval-AA Elo 从 1070 升至 1298，TerminalBench Hard 达 35.6%，AA-LCR 63.7%。知识/幻觉仍弱：AA-Omniscience 准确率 25.4%，幻觉率 84.4%。提供 BF16、FP8、NVFP4 精度权重以降低部署成本。

多模态开源生态推理模型发布

11:00

歸藏(guizang.ai)@op7418

61

Reve 2.0 图像模型：原生4K输出与分层编辑能力

Reve 2.0 图像模型支持原生4K输出，核心亮点在于类似 Photoshop 的图像分层编辑能力。用户点击图像中的任意部分即可选中该区域，无需复杂的中间处理步骤，直接进行针对性编辑。该功能大幅简化了图像局部修改的工作流。

图像生成模型发布

10:23

Jeff Dean@JeffDean

75

来看看我们的 Gemma 4 12B 模型：它是一个功能非常强大的开源权重模型，可以直接在你的笔记本电脑上运行。

Google Gemma: Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to y...

Google多模态开源生态模型发布

关联讨论 4 条

09:58

MiniMax (official)@MiniMax_AI

71

MiniMax M3 即将推出，现在即可在 OpenCode 免费试用。M3 已回到免费层，快来体验！

OpenCode: MiniMax M3 will be launching soon You can try it right now in OpenCode For free

开源生态模型发布

关联讨论 11 条

09:40

小互@xiaohu

73

Ideogram 4.0 开源：边界框控制+多语言文字渲染

Ideogram 发布首个开源 AI 图像模型 Ideogram 4.0，主推文字渲染与版面控制。模型引入 bounding box（边界框）控制，允许用坐标精确指定元素位置；支持结构化 JSON 提示词格式，不再仅限纯文本；英文 OCR 准确率达 0.97（X-Omni 基准），支持跨语言密集文字渲染，涵盖中日韩等非拉丁文字。

图像生成开源生态模型发布

09:06

Elon Musk@elonmusk

精选72

Vercel 的 AI Gateway 上现已推出 Grok Imagine Video 1.5。该服务支持图生视频并同步音频，一次性完成。示例代码： `await generateVideo（{ model： 'xai/grok-imagine-video-1.5-preview'， prompt： 'a rabbit sprinting through nyc' }）；`

Vercel Developers: Grok Imagine Video 1.5 on AI Gateway. Image-to-video generation with synced audio in one pass. await generateVideo({ mod...

xAI图像生成模型发布视频

推荐理由：Grok Imagine Video 1.5 把同步音频塞进了图生视频，一条 prompt 直接出带声短片，做短视频和创意的可以换上这条流水线了。

09:06

Elon Musk@elonmusk

精选73

伊利亚特（特洛伊）预告片由刚刚发布的 Grok Imagine 1.5 制作

xAI多模态模型发布视频

推荐理由：Elon 亲自演示 Grok Imagine 1.5，生成的《伊利亚特》预告片质感让我觉得视频生成赛道又要卷一轮，做短片的可以盯一下。

08:51

Berryxia.AI@berryxia

67

MOSS-Audio：开源统一语音、环境音、音乐的音频-语言模型登顶HF Trending第一

OpenMOSS团队发布MOSS-Audio，一个融合语音（Speech）、环境音（Sound）、音乐（Music）的开源音频-语言模型，已冲上Hugging Face Trending第一。该模型从架构上打通三大音频域，可同时转录对话、识别背景音、理解音乐情绪并生成文本或执行下游任务。模型完全开源可商用，代码和权重已在Hugging Face及GitHub公布，开发者可本地运行。

MOSI: MOSS-Audio just hit #1 on @huggingface Trending. Speech. Sound. Music. One open audio-language model. Try it: Hugging Fa...

多模态模型发布语音

08:40

小互@xiaohu

71

Google 发布 Gemma 4 12B 开源模型

Google 发布 Gemma 4 12B 开源模型，采用无编码器 Unified 架构，可直接处理文本、图像、音频、视频，无需独立编码器。16GB 显存可运行，4-bit 量化后低至 8GB。支持 256K token 上下文、140+ 语言，内置 Thinking 模式和 Function Calling。

Google多模态开源生态模型发布

07:58

MiniMax (official)@MiniMax_AI

77

15.6× faster decoding at 1M tokens 🔥 感谢 @FireworksAI_HQ 为 M3 提供推理支持。立即尝试 👇

Fireworks AI: MiniMax M3 arrives with MiniMax Sparse Attention (MSA), 15.6x faster decoding at 1M tokens. We're partnering with @MiniM...

推理模型发布

关联讨论 11 条

07:51

Berryxia.AI@berryxia

69

Google 昨晚发布Gemma 4 12B 多模态的大模型，至少需要16G 内存就可以运行。应该和Qwen 的模型进行对比其效果如何~

Google Gemma: Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to y...

Google多模态模型发布端侧

06:59

DogeDesigner@cb_doge

70

SpaceXAI 不断刷新标准。🔥 Grok Imagine Video 1.5 预览版现已上线 API，效果看起来极为电影感。📽️ 去亲自试试吧。💻 祝 SpaceXAI 好运。🚀

多模态模型发布视频

05:58

MiniMax (official)@MiniMax_AI

精选78

Mem0 是 MiniMax M3 的官方启动合作伙伴！ M3 的 1M token 上下文窗口 + @mem0ai 的记忆层 = 真正记住的 AI 应用。构建具有持久记忆的个性化 AI 智能体，现在启动周内 M3 享五折优惠。开始使用 Minimax → https：//platform.minimax.io/docs/guides/models-intro 注册 mem0 → http：//app.mem0.ai/？utm_source=minimax_x_post

智能体MCP/工具模型发布

关联讨论 11 条

推荐理由：MiniMax 把 1M 上下文和 Mem0 记忆层绑在一起，不是单纯秀参数，是给 Agent 装了个硬盘，做长期记忆产品的该关注一下。

05:57

Greg Brockman@gdb

71

GPT-Rosalind 重大升级，药物发现、分析、设计和实验工作流的智能大幅提升：

OpenAI: We're bringing new capabilities to GPT-Rosalind, a model series purpose-built for life sciences research at enterprise s...

智能体OpenAI模型发布

05:57

🚨 AI News | TestingCatalog@testingcatalog

53

新模型 Reve 2.0 上线，在 Text-to-Image 竞技场中排名第二，超越 Nano Banana 2 和 GPT-Image-1.5。该模型采用全新图像生成与编辑方式，利用精确布局实现可交互的图像创作：图像被表示为代码，每个区域均可寻址、编辑和操控；图像被自动分割并标注，用户可对每一元素进行精细化控制。

Reve: Our independent research lab ranks top 2 on @arena Text-to-Image, ahead of Nano Banana 2 and GPT-Image-1.5.

图像生成模型发布

05:39

OpenAI@OpenAI

67

我们正在为 GPT-Rosalind 带来新功能，这是一个专为企业级生命科学研究打造的模型系列。它将 GPT-5.5 的智能体编码和工具使用能力与更强大的智能相结合，用于药物发现、分析、设计和实验工作流程。 https：//openai.com/index/introducing-new-capabilities-to-gpt-rosalind

OpenAI推理模型发布

关联讨论 2 条

04:31

fofr@fofrAI

61

Ideogram v4 表现出色，开放权重。图像清晰，感觉焕然一新。

fofr: Ideogram v4 is really good, and open weights. Images are crisp and feel fresh.

图像生成开源生态模型发布

04:28

MiniMax (official)@MiniMax_AI

65

@mem0ai 是 MiniMax M3 的官方发布合作伙伴！ M3 的百万 token 上下文窗口 + @mem0ai 的记忆层 = 真正能记住的 AI 应用。构建带有持久记忆的个性化 AI 智能体，发布周期间 M3 可享 5 折优惠。开始使用 Minimax → https：//platform.minimax.io/docs/guides/models-intro 注册 mem0 → http：//app.mem0.ai/？utm_source=minimax_x_post

智能体MCP/工具模型发布

关联讨论 11 条

03:40

Sundar Pichai@sundarpichai

精选73

Gemma 4 系列累计下载量突破1.5亿次，Google随之推出新成员Gemma 4 12B。该模型仅12B参数，可在16GB VRAM笔记本上本地运行，兼顾尺寸与性能，支持多步推理和智能体工作流。采用Apache 2.0开源许可，供社区使用。

Demis Hassabis: Celebrating the milestone of a massive 150+ million downloads of Gemma 4 with the release of the new Gemma 4 12B model! ...

Google开源生态模型发布端侧

关联讨论 1 条

推荐理由：Gemma 4 12B 把多步推理塞进笔记本能跑的尺寸，Apache 2.0 开源，对想做本地 agent 的开发者是实实在在的新弹药，小模型的可用性正在逼近临界点。

03:31

fofr@fofrAI

69

Ideogram v4 真的很好，而且开源权重。图像清新锐利，令人耳目一新。

Ideogram: Introducing Ideogram 4.0: the best open image model in the world. Think it. Make it. Own it. Download the weights, fine-...

图像生成多模态开源/仓库模型发布

03:20

Chubby♨️@kimmonismus

71

Google 开源 Gemma 4 12B：无编码器架构，本地 16GB VRAM 运行

Google 开源 Gemma 4 12B（密集参数，Apache 2.0 许可），采用全新无编码器架构：移除独立的视觉（550M 参数、27 层 Transformer）和音频（300M 参数、12 层 Conformer）编码器。视觉改为 35M 嵌入层（约缩小 15 倍），音频以 40ms 帧直接投影到大语言模型。模型在 16GB VRAM 笔记本上即可运行智能体推理、视觉和音频任务，性能接近 26B 参数模型。共享权重支持一次 LoRA 调优覆盖视觉、音频和文本。

Google: Today we're introducing Gemma 4 12B - our latest open model that brings advanced agentic reasoning, vision and audio dir...

Google多模态开源生态模型发布

02:58

DogeDesigner@cb_doge

78

Grok Imagine 1.5 预览版已发布，即日起可在 API 中体验。SpaceXAI 正在发力。

Grok: Grok @Imagine 1.5 Preview is here Try it today in the API: http://x.ai/api/imagine

xAI图像生成模型发布

关联讨论 1 条

02:36

Demis Hassabis@demishassabis

精选74

Demis Hassabis 宣布 Gemma 4 系列下载量突破 1.5 亿，并正式发布新版 Gemma 4 12B 模型。该模型是一个统一的、无编码器的多模态模型，兼具边缘端效率与高级推理能力。尽管参数规模仅为 12B，但性能强劲，且足够小巧，可在仅需 16GB VRAM 的笔记本上本地运行。采用 Apache 2.0 开源许可证，方便开发者自由构建。

Google Gemma: Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to y...

Google开源生态模型发布端侧

关联讨论 1 条

推荐理由：Gemma 4 12B 用 Apache 2.0 许可把多模态模型压进笔记本，16GB 显存就能跑，端侧智能的性价比又一次被 Google 拉高，做本地推理的可以马上试试。

02:16

AYi@AYi_AInotes

70

世界最好的开源图像模型，仅次于GPT-image-2和Nanobanana2

Ideogram: Introducing Ideogram 4.0: the best open image model in the world. Think it. Make it. Own it. Download the weights, fine-...

图像生成开源生态模型发布

01:49

Krea@krea_ai

精选74

介绍 Ideogram v4.0。原生 2K 分辨率，出色的文字渲染，支持 JSON 提示词。立即在 Krea 中体验。

图像生成模型发布

关联讨论 1 条

推荐理由：图像生成模型的军备竞赛又添一员，Ideogram v4.0的2k原生分辨率和JSON prompt对接工作流，做设计生成的同学可以直接上手试试。

01:48

elvis@omarsar0

76

Miso One 8B开源情感TTS模型发布

Miso Labs 开源 8B 参数文本转语音模型 Miso One，专注于生成富有情感的表达，如温暖、犹豫或兴奋，告别机械音。模型专为短视频、播客和教育内容等旁白场景设计，推理延迟仅 110 毫秒，快于人类反应时间。模型权重完全开源，支持自托管、微调和数据私有化，API 即将开放。

Aoden Teo: Today, we're excited to introduce Miso One, the most emotive voice model in the world. Miso One is an 8-billion-paramete...

开源生态模型发布语音

00:55

🚨 AI News | TestingCatalog@testingcatalog

74

Ideogram 4.0 开源图像生成模型发布，在 LM Arena 文生图类别排名第 8，Design Arena 第 5，评分 1204，成为该领域排名最高的开放模型，性能接近 Nano Banana Pro。主要特性包括密集准确的文本渲染、原生 2K 分辨率、活动背景透明度及精确布局控制。

Arena.ai: New open model Ideogram-4.0-Quality has landed at #8 in the Text-to-Image Arena. This makes the new model by @ideogram_a...

图像生成开源生态模型发布

00:50

Chubby♨️@kimmonismus

精选75

Miso One 正式发布，一个 8B 参数的开源权重语音模型（TTS），旨在模拟真实人类朗读的温暖与节奏。它支持一次语音克隆（只需短样本），推理延迟仅 110ms。模型权重已开源至 GitHub，无需 API 即可自托管，音频数据不离开本地。API 访问即将推出。演示已上线，可先试听再克隆仓库。

Aoden Teo: Today, we're excited to introduce Miso One, the most emotive voice model in the world. Miso One is an 8-billion-paramete...

开源生态模型发布语音

推荐理由：Miso One这种8B参数、110ms延迟的情感TTS模型，直接把声音克隆和自托管做成了开箱即用，做语音产品的可以马上 clone 一个玩玩，比等 API 爽多了。

00:25

🚨 AI News | TestingCatalog@testingcatalog

65

Google 最新的 Gemma 4 12B 模型已上线 Hugging Face，采用 Apache 2.0 许可证。该模型与 Gemma 4 E2B/E4B 共享相同多模态能力，支持文本、音频、图像和视频输入，无需单独编码器即可实现原生音频和视觉理解。这种无编码器统一设计方案使其部署体积更小，非常适合消费级设备和本地执行环境。官方称其旨在弥合边缘效率与高级推理之间的差距。

Google Gemma: Meet Gemma 4 12B! A unified, encoder-free multimodal model designed to bring high-performance intelligence directly to y...

Google多模态模型发布端侧

00:19

Google AI Developers@googleaidevs

77

Google推出Gemma 4 12B无编码器多模态模型

Google发布Gemma 4 12B，一款无编码器的统一多模态模型，可直接将视觉和音频输入送入LLM主干，无需传统多模态编码器。该模型填补了移动端E4B模型与26B MoE模型之间的空白，封装前沿推理与原生音频能力，采用Apache 2.0许可。在16GB VRAM下即可本地运行复杂多步骤智能体工作流，性能接近26B模型。

Google多模态开源生态模型发布

关联讨论 4 条

6月3日

22:39

SenseTime@SenseTime_AI

精选73

商汤开源SenseNova U1：视觉理解推理生成一体模型

商汤（SenseTime）开源SenseNova U1模型，宣称实现“看、思考、创作”一体——从一张普通运动鞋图片直接生成营销视觉效果。该模型代表了架构上的范式转变。用户可通过SenseNova Studio、HuggingFace和GitHub尝试使用。

GitHubHugging Face图像生成多模态

关联讨论 1 条

推荐理由：商汤把理解、推理、创作塞进一个模型，而且直接开源，做视觉营销的可以不用再拼凑工具链了。

13:08

Alibaba Cloud@alibaba_cloud

71

Qwen：面向智能体时代的基座模型，由通义大模型BU多模态交互负责人Steven Hoi介绍。 Qwen3.7在推理方面取得重大突破，全面升级了工具使用、编码和长程任务的原生智能体能力。

智能体推理模型发布

关联讨论 10 条

10:32

Satya Nadella@satyanadella

82

凭借我们今天宣布的全新MAI模型和前沿调优能力，我们致力于帮助每家公司从仅仅使用前沿模型，转变为全面参与前沿领域。

Microsoft数据/训练模型发布

关联讨论 4 条

09:48

Berryxia.AI@berryxia

74

微软在Build大会发布七款MAI新模型

微软在Build大会宣布推出七个全新的MAI模型家族。该家族以“干净数据血统”从零开始训练，旨在任务专精并能无缝协作。其中，MAI-Code-1-Flash在SWE-Bench Verified上得分71.6，比Claude Haiku 4.5高出5分，并能节省60% token。MAI-Transcribe-1.5处理一小时音频仅需15秒，在43种语言上实现速度与准度领先。微软此次发布旨在展示其从零构建、专精且能协同工作的模型发展路径。

Microsoft AI: Seven new models launching at Build: let's go! Reasoning. Code. Image. Transcribe. Voice. Built from scratch on a clean ...

Microsoft图像生成模型发布编码