Muon能微调Adam预训练的模型吗?
本研究探讨了在微调Adam预训练模型时,若将优化器直接切换为Muon会导致性能下降的问题,并将其归因于两者不同的隐式偏差造成的优化器不匹配。这种不匹配会破坏预训练知识,且其影响程度与参数更新强度成正比。实验表明,通过采用LoRA等参数高效微调方法来约束更新,可以有效缓解该问题。在语言和视觉任务中,LoRA显著缩小了全参数微调下Adam与Muon之间的性能差距。对LoRA秩、灾难性遗忘及变体的进一步研究证实,不匹配的严重程度确实与更新强度相关。相关代码已开源。
Muon has emerged as an efficient alternative to Adam for pretraining, yet remains underused for fine-tuning. A key obstacle is that most open models are pretrained with Adam, and naively switching to Muon for fine-tuning leads to degraded performance due to an optimizer mismatch. We investigate this mismatch through controlled experiments and relate it to the distinct implicit biases of Adam and Muon. We provide evidence that the mismatch disrupts pretrained knowledge, and that this disruption scales with update strength. This leads us to hypothesize that constraining updates should mitigate the mismatch. We validate this with LoRA: across language and vision tasks, LoRA reduces the performance gap between Adam and Muon observed under full fine-tuning. Studies on LoRA rank, catastrophic forgetting, and LoRA variants further confirm that mismatch severity correlates with update strength. These results shed light on how optimizer mismatch affects fine-tuning and how it can be mitigated. Our code is available at https://github.com/XingyuQu/muon-finetune.