大语言模型中的模型合并缩放定律
研究发现语言模型合并遵循一个紧凑的幂律定律,它将模型大小与专家数量相关联:模型容量越大,其性能下限越低;而合并带来的性能提升尾部则随专家数量增加呈现明显的收益递减。该定律在领域内和跨领域均成立,紧密契合不同架构与方法下的实测曲线,并解释了大部分收益在早期获得、且性能波动性随专家增多而缩小这两个稳健规律。基于此的简单理论将性能下限和尾部与基础模型特性及领域多样性联系起来。这一定律使得预测性规划成为可能,例如估算达到目标损失所需的专家数量,或在固定预算下权衡扩展基础模型与增加专家,从而将模型合并从启发式实践转变为一种可计算、可规划的高效方案。
We study empirical scaling laws for language model merging measured by cross-entropy. Despite its wide practical use, merging lacks a quantitative rule that predicts returns as we add experts or scale the model size. We identify a compact power law that links model size and expert number: the size-dependent floor decreases with model capacity, while the merging tail exhibits clear diminishing returns in the number of experts. The law holds in-domain and cross-domain, tightly fits measured curves across diverse architectures and methods (Average, TA, TIES, DARE), and explains two robust regularities: most gains arrive early, and variability shrinks as more experts are included. Building on this, we present a simple theory that explains why gains fall roughly as 1/k and links the floor and tail to properties of the base model and the diversity across domains. This law enables predictive planning: estimate how many experts are needed to reach a target loss, decide when to stop adding experts, and trade off scaling the base model versus adding experts under a fixed budget--turning merging from heuristic practice into a computationally efficient, planable alternative to multitask training. This suggests a scaling principle for distributed generative AI: predictable gains can be achieved by composing specialists, offering a complementary path toward AGI-level systems.