Description

Models trained from RAT+ Paper. Datasets we release the 100BT tokenized version. For the 200BT version, it's tokenized from fineweb_edu_350B raw dataset. Readers can use our code for tokenization.

Note that the models under the pretrain/ directory should not be used directly for evaluation. These models need to undergo the resolution adaptation phase with the corresponding inference dilation size, local size, and initial size. Since there can be various configurations, we provide only one example, D=16 and W=256, under the adapt/ directory. For other configurations, we leave the adaptation to the readers. This resolution adaptation process is fast and stable: 1B tokens are sufficient for all pretrained models; we used a simple optimization scheme and found it to work well, and we observed that other optimization hyperparameters also work well.

Citation

If you find it useful, please consider citing the paper:

@article{wei2026rat+,
  title={RAT+: Train Dense, Infer Sparse--Recurrence Augmented Attention for Dilated Inference},
  author={Wei, Xiuying and Gulcehre, Caglar},
  journal={arXiv preprint arXiv:2602.18196},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train barpitf/ratplus

Paper for barpitf/ratplus

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

Paper • 2602.18196 • Published Feb 20 • 1