EpiCache：面向资源受限环境的长程对话场景式KV Cache管理 · AI HOT

Apple Machine Learning Research（RSS）

46

EpiCache：面向资源受限环境的长程对话场景式KV Cache管理

2026-05-19 08:00·27天前

AI 摘要

现有大语言模型虽能处理超长对话，但随对话历史线性增长的KV Cache会导致内存占用迅速超出设备限制。当前KV Cache压缩方法大多在处理完整上下文后才进行缓存淘汰，造成无界峰值内存占用。此外，基于查询的淘汰机制将缓存语义狭窄化至单次查询，导致失效。

原文 · 未翻译

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

AuthorsMinsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

View publication

View source code (GitHub)

Copy Bibtex

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x, respectively.

Related readings and updates.

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

May 5, 2026research area Methods and Algorithms, research area Speech and Natural Language Processing

Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the depth dimension offers…

CommVQ: Commutative Vector Quantization for KV Cache Compression

CommVQ: Commutative Vector Quantization for KV Cache Compression

July 11, 2025research area Speech and Natural Language Processingconference ICML

Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as con- text lengths grow. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long context LLM inference. First, we leverage additive quantization by introducing a lightweight encoder and codebook to compress the KV…

Discover opportunities in Machine Learning.

Our research in machine learning breaks new ground every day.

Work with us

论文/研究部署/工程

Apple Machine Learning Research（RSS）

46

EpiCache：面向资源受限环境的长程对话场景式KV Cache管理

2026-05-19 08:00·27天前

AI 摘要

现有大语言模型虽能处理超长对话，但随对话历史线性增长的KV Cache会导致内存占用迅速超出设备限制。当前KV Cache压缩方法大多在处理完整上下文后才进行缓存淘汰，造成无界峰值内存占用。此外，基于查询的淘汰机制将缓存语义狭窄化至单次查询，导致失效。

原文 · 保持原样，未翻译

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

AuthorsMinsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho

View publication

View source code (GitHub)

Copy Bibtex

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model’s memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x, respectively.

Related readings and updates.

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

May 5, 2026research area Methods and Algorithms, research area Speech and Natural Language Processing

Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the depth dimension offers…