正交梯度投影用于缓解安全对齐税

2026-05-12 08:00·35天前

AI 摘要

大语言模型的安全后训练可能削弱其通用能力，产生“对齐税”。本研究将其视为持续学习问题：安全训练的梯度可能干扰已习得的通用能力方向。为此，我们提出正交梯度投影安全对齐方法。该方法从少量通用数据梯度中估计参考子空间，并在安全梯度更新时移除该空间的分量，从而在提升安全性的同时保留通用能力。实验证明，在SFT、DPO及SFT→DPO等流程中，该方法能显著改善安全与效用的权衡，例如在Qwen2.5-7B上平均性能增益从33.98%提升至42.74%。

原文 · 未翻译

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the alignment tax. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFTrightarrowDPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFTrightarrowDPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.

安全/对齐数据/训练论文/研究

HuggingFace Daily Papers（社区热门论文）

正交梯度投影用于缓解安全对齐税

2026-05-12 08:00·35天前

AI 摘要

原文 · 保持原样，未翻译

安全/对齐数据/训练论文/研究

阅读原文