NVIDIA 正式发布 Nemotron 3 Ultra,550B 总参数(55B 活跃)的完全开源 MoE 模型,权重、训练数据和完整配方全部公开。采用混合 Mamba-Attention 架构,专为长上下文快速解码和轻内存占用设计。在长输出智能体工作负载上,吞吐量约为可比开源模型的 6 倍(推理速度提升 5 倍),复杂智能体任务成本降低最多 30%。该模型在 4-bit(NVFP4)精度下预训练 20T tokens,后训练使用 MOPD 技术,由十余个专家教师模型蒸馏技能至学生模型。这是首个达到前沿水平且可完全复现的开源模型。
1/ NVIDIA shipped Nemotron 3 Ultra today, a fully open 550B model with 55B active params, with the weights, training data, and complete recipe all released openly. That alone is rare at this scale.
The headline however actually is speed. Ultra is a hybrid Mamba-Attention MoE, an architecture built for fast decoding and a light memory footprint over long contexts, and NVIDIA clocks it at roughly 6x (!) the throughput of comparable open models on long-output agent workloads while holding the same accuracy.
That's a serious engineering result, and it's aimed exactly where the industry is heading: autonomous agents that run long, multi-turn tasks where throughput per GPU is what actually costs money.
It was pre-trained in 4-bit (NVFP4) across 20T tokens, the largest stable run of its kind shown to date. And the post-training introduces MOPD, where ten-plus specialist teacher models distill their skills into the student on its own rollouts, sometimes pushing it past the teachers themselves.
The interesting aspect:This is a frontier-class model you can fully reproduce.