arxiv:2605.28640

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Published on May 27

· Submitted by

Authors:

Abstract

RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

barpitf

Paper submitter about 5 hours ago

Besides focusing solely on improving downstream inference methods, we can also design upstream architectures that are inherently more capable with sparse inference. This paper is based on our previous RAT (NeurIPS 2025) and RAT+ (ICML 2026), where we augment attention with an additional recurrence to support flexible dilated pattern at inference. In this paper, we further prove that such an architecture boosts other inference-time sparsity as well!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.28640

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28640 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28640 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28640 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.