Mela:基于转换假说的测试时记忆巩固模型
受神经科学记忆巩固理论与转换假说启发,本研究提出分层记忆模块(HMM)。该模块包含低频与高频子模块,分别生成抽象概要表征与细粒度细节表征,并通过动态重构组合输出。将其集成至Transformer解码器,形成Mela系列模型,可在测试时进行在线记忆巩固。同时引入MemStack方法,将多粒度记忆特征分布至解码器早期层。实验表明,Mela在所有模型规模上均优于Transformer基线,且在预训练上下文长度固定为4K时,能在显著更长的上下文中保持稳定性能,而基线模型一旦超出训练长度则性能急剧下降。
Memory consolidation, the process by which transient experiences are transformed into stable, structured representations, is a foundational organizing principle in the human brain, yet it remains largely unexplored as a design principle for modern sequence models. In this work, we leverage established neuroscientific theories of memory consolidation and cross-frequency coupling to propose the Hierarchical Memory Module (HMM), a neural memory architecture composed of two functionally distinct sub-modules that operate at different update frequencies. Inspired by the transformation hypothesis, the low-frequency sub-module produces high-level representations that capture abstract, gist-level knowledge, while the high-frequency sub-module produces fine-grained representations that preserve richer episodic detail. The final memory output is dynamically reconstructed as a context-dependent combination of both representations, analogous to the reconstructive nature of human memory retrieval. We integrate HMM into a Transformer-based language decoder to form Mela, a family of memory-augmented language models that perform online memory consolidation at test time. To further exploit the multi-granularity memory representations produced by HMM, we introduce MemStack, a method that distributes different levels of memory features across the early layers of the decoder without introducing additional tokens. Experiments on language modeling demonstrate that Mela outperforms Transformer baselines across all the model sizes. Moreover, with the pretrained context length fixed at 4K, Mela maintains performance on significantly longer contexts, whereas Transformer baselines degrade rapidly beyond their training length. Extensive ablation studies validate the contribution of each component and provide guidance for practical configuration.