Gemma 4 的量化版把模型压到 1GB 以下,手机本地跑大模型的门槛又低了一大截。Google 这次没用传统的训练后量化,而是把压缩直接嵌进训练里,效果比 PTQ 好一截,搞端侧部署的可以拿 checkpoint 试起来了。
谷歌发布 Gemma 4 量化感知训练 (QAT) 检查点,支持在消费级 GPU 和移动设备上本地运行,质量损失极小。新检查点提供 GGUF(Q4_0)格式,覆盖所有尺寸及起草模型,实现最佳本地性能。自定义移动模式采用混合精度方案,将 Gemma 4 压缩至 1GB 以下,包含 2-bit 解码层、优化 KV 缓存和静态激活。通过在训练中模拟压缩(而非训练后量化),大幅降低内存占用并加速解码,同时保持推理质量。
New @GoogleGemma 4 QAT (Quantization-Aware Training) checkpoints are here, so you can run models locally on consumer GPUs and mobile devices with minimal quality loss.
What's new:
🔹 GGUF (Q4_0): Checkpoints: Max local performance across all sizes and drafter models 🔹 Custom Mobile Schema: We shrunk Gemma 4 down to less than 1GB for mobile devices by using a custom mixed precision schema designed for edge hardware (featuring targeted 2-bit decoding layers, optimized KV caches, and static activations)
By simulating compression during training rather than after (Post-Training Quantization), we've drastically reduced the memory footprint and accelerated decode speeds while preserving reasoning quality. https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4/