Google 推出开源实验性模型 DiffusionGemma,基于 Gemma 4 的文本扩散研究。该模型为 26B MoE 架构,仅激活 3.8B 参数,量化后可适配 18GB VRAM。核心突破在于每轮前向传播并行生成 256 个 token,实现推理速度提升 4 倍:H100 上可达 1000+ tokens/s,RTX 5090 达 700+ tokens/s。DiffusionGemma 通过初始化随机占位符画布并运行多轮并行去噪,同时生成整段文本,许可证为 Apache 2.0。
Great news for local LLMS. Google just released DiffusionGemma, an open experimental 26B MoE, activates only 3.8B.
Open model, Apache 2.0 license. fits within 18GB VRAM when quantized
The big deal is the speed, DiffusionGemma generates 256 tokens in parallel per forward pass.
This gives it up to 4x faster inference, with 1000+ tokens/s on an H100 and 700+ tokens/s on an RTX 5090.
Normal autoregressive LLMs behave like left-to-right printers, so each new token waits for the previous token, which makes local GPU inference slow for a single user.
DiffusionGemma initializes a 256-token canvas with random placeholder tokens, then runs multiple denoising passes that refine the whole canvas in parallel.