Papers
arxiv:2602.06563

TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

Published on Feb 6
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

TokenMixer-Large addresses scalability limitations in recommendation models through enhanced residual connections, sparse MoE mechanisms, and stable gradient propagation for deep architectures.

AI-generated summary

While scaling laws for recommendation models have gained significant traction, existing architectures such as Wukong, HiFormer and DHEN, often struggle with sub-optimal designs and hardware under-utilization, limiting their practical scalability. Our previous TokenMixer architecture (introduced in RankMixer paper) addressed effectiveness and efficiency by replacing self-attention with a ightweight token-mixing operator; however, it faced critical bottlenecks in deeper configurations, including sub-optimal residual paths, vanishing gradients, incomplete MoE sparsification and constrained scalability. In this paper, we propose TokenMixer-Large, a systematically evolved architecture designed for extreme-scale recommendation. By introducing a mixing-and-reverting operation, inter-layer residuals and the auxiliary loss, we ensure stable gradient propagation even as model depth increases. Furthermore, we incorporate a Sparse Per-token MoE to enable efficient parameter expansion. TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer-Large has achieved significant offline and online performance gains, delivering an increase of +1.66\% in orders and +2.98\% in per-capita preview payment GMV for e-commerce, improving ADSS by +2.0\% in advertising and achieving a +1.4\% revenue growth for live streaming.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06563 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.06563 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06563 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.