Lighthouse Attention是一种用于加速长上下文预训练的子二次注意力包装器。其核心是在训练时,通过一个无梯度的分层选择层对称压缩查询、键和值,从而包装标准SDPA注意力并保持因果性。关键优势在于,训练末期可通过简短恢复阶段完全移除该包装器,使得部署模型仍使用原始注意力机制,不增加任何推理开销。初步实验表明,它能缩短总训练时间并降低最终损失。与多数需改变架构或牺牲质量的方案不同,该方法作为纯训练时优化,成功规避了这两大问题,若未来可扩展,将成为长上下文预训练的重要加速工具。
Cool idea from Nous Research.
What if you could speed up long-context pretraining with a subquadratic wrapper that you remove before deployment?
That is the idea behind Lighthouse Attention.
The method wraps ordinary SDPA with a hierarchical, gradient-free selection layer that compresses and decompresses queries, keys, and values symmetrically, preserving left-to-right causality.
Crucially, it can be removed near the end of training in a short recovery phase, so the deployed model still runs vanilla attention with no architectural cost at inference.
Preliminary LLM experiments report faster total training time and lower final loss than full-attention baselines.
Why does it matter?
Most efficient-attention work either changes the deployment-time architecture or pays a quality tax to do so. A training-only wrapper that survives a clean recovery phase sidesteps both. If it scales, this becomes an important training-time speedup for long-context pretraining.
Paper: https://arxiv.org/abs/2605.06554
Learn to build effective AI agents in our academy: https://academy.dair.ai/