提示-激活对偶性：通过注意力层干预改进激活引导

2026-05-11 08:00·36天前

AI 摘要

针对传统残差流引导在多轮对话中因KV缓存污染导致连贯性下降的问题，本研究提出门控裁剪注意力差值引导（GCAD）方法。该方法从系统提示对自注意力的贡献中提取引导信号，并通过令牌级门控机制施加干预，从而避免累积性污染。在角色引导实验中，GCAD在保持特质控制的同时，显著提升了长程对话的连贯性。在多轮基准测试中，它将平均连贯性漂移从-18.6改善至-1.9，并将第10轮的特质表达率从78.0%提升至93.1%。结果表明，沿模型已有的提示介导路径进行干预，能使激活引导更为可靠。

原文 · 未翻译

Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.

安全/对齐论文/研究

HuggingFace Daily Papers（社区热门论文）

提示-激活对偶性：通过注意力层干预改进激活引导

2026-05-11 08:00·36天前

AI 摘要

原文 · 保持原样，未翻译

安全/对齐论文/研究

阅读原文arxiv.org