Papers
arxiv:2602.24289

Mode Seeking meets Mean Seeking for Fast Long Video Generation

Published on Feb 27
ยท Submitted by
Shengqu Cai
on Mar 2
ยท nvidia NVIDIA
Authors:
,
,
,
,
,
,
,
,

Abstract

A training paradigm combining mode seeking and mean seeking in a Decoupled Diffusion Transformer enables efficient generation of high-quality long videos by leveraging both global flow matching and local distribution matching techniques.

AI-generated summary

Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.

Community

Paper author Paper submitter

arXivLens breakdown of this paper ๐Ÿ‘‰ https://arxivlens.com/PaperView/Details/mode-seeking-meets-mean-seeking-for-fast-long-video-generation-9984-24b950b2

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.24289 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.24289 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.24289 in a Space README.md to link it from this page.

Collections including this paper 2