一篇论文系统研究了Transformer注意力中QKV投影的必要性,发现Key和Value可共享同一投影(Q-K=V变体),仅增加3.1%的困惑度,便将KV cache削减50%,大幅降低推理内存。最佳变体保留Query独立,使注意力保持方向性。与GQA和MQA结合时,可分别实现87.5%和96.9%的cache缩减。弱变体Q=K-V因导致因果注意力过于对称且无cache节省而无效。
Interesting, this paper shows that Transformers may not need separate key and value projections to work well.
This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.
A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.
Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.
The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.
When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.
The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.
----
Link - arxiv. org/abs/2606.04032v2
Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"