提示-激活对偶性:通过注意力层干预改进激活引导
针对传统残差流引导在多轮对话中因KV缓存污染导致连贯性下降的问题,本研究提出门控裁剪注意力差值引导(GCAD)方法。该方法从系统提示对自注意力的贡献中提取引导信号,并通过令牌级门控机制施加干预,从而避免累积性污染。在角色引导实验中,GCAD在保持特质控制的同时,显著提升了长程对话的连贯性。在多轮基准测试中,它将平均连贯性漂移从-18.6改善至-1.9,并将第10轮的特质表达率从78.0%提升至93.1%。结果表明,沿模型已有的提示介导路径进行干预,能使激活引导更为可靠。
Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.