StateSMix: Online Lossless Compression via Mamba State Space Models and Sparse N-gram Context Mixing Paper • 2605.02904 • Published Apr 5 • 8
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation Paper • 2604.10098 • Published Apr 11 • 81
InCoder-32B: Code Foundation Model for Industrial Scenarios Paper • 2603.16790 • Published Mar 17 • 311
FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach Paper • 2603.13364 • Published Mar 9 • 9
The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training Paper • 2603.10444 • Published Mar 11 • 12
Mixture of Attention Heads: Selecting Attention Heads Per Token Paper • 2210.05144 • Published Oct 11, 2022 • 3
MeKi: Memory-based Expert Knowledge Injection for Efficient LLM Scaling Paper • 2602.03359 • Published Feb 3 • 10