Papers
arxiv:2605.28640

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Published on May 27
· Submitted by
XiuyingWei
on Jun 8
Authors:
,

Abstract

RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.

Community

Paper submitter

Besides focusing solely on improving downstream inference methods, we can also design upstream architectures that are inherently more capable with sparse inference. This paper is based on our previous RAT (NeurIPS 2025) and RAT+ (ICML 2026), where we augment attention with an additional recurrence to support flexible dilated pattern at inference. In this paper, we further prove that such an architecture boosts other inference-time sparsity as well!

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.28640
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28640 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28640 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28640 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.