SubQ模型发布,这是首个基于完全次二次稀疏注意力架构(SSA)的前沿LLM,拥有1200万token的上下文窗口。它在处理100万token时比FlashAttention快52倍,成本低于Opus的5%。该模型突破传统Transformer注意力计算所有token关系的限制,通过稀疏注意力选择性聚焦重要关系,使长上下文处理的计算量减少近1000倍,显著改变了LLM的成本曲线和扩展方式。
The first frontier model with a 12 million token context window just launched.
- 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus
@subquadratic just announced a major breakthrough in changing the cost curve of attention in LLM.
They brought a frontier-scale LLM built entirely around sub-quadratic sparse attention, where the model selectively computes only the important token relationships so very long context can scale far cheaper and faster than standard transformer attention.
In normal transformers, long context is painfully expensive because as context grows, the attention work grows roughly with the square of the sequence length.
A 1M-token document is not just "a long document" for a normal model; it is a massive grid of possible token relationships.
SubQ's key technique is that most of that grid is useless.
A legal contract does not need every comma to compare itself with every sentence from 400 pages ago.
A codebase does not need every variable name to attend equally to every unrelated file.
SubQ is saying: let the model find the few relationships that probably matter, then spend compute there.