MUG-V 10B: Advancing High-Efficiency Training for Large Video Generation

TLDR: MUG-V 10B is a 10-billion-parameter video generation model with an open-sourced, high-efficiency training pipeline. It optimizes data processing, model architecture, training strategy, and infrastructure, leveraging Megatron-Core for scalable training. The model achieves state-of-the-art performance, particularly in e-commerce video generation, and aims to accelerate research by releasing its full stack.

Training large-scale video generation models is a complex and resource-intensive endeavor. These models face significant hurdles, including aligning text with video content, managing long video sequences, and capturing intricate spatiotemporal relationships. Addressing these challenges, researchers from Shopee Pte. Ltd. have introduced MUG-V 10B, a new training framework designed for high efficiency and improved performance in large video generation.

A Comprehensive Approach to Video Generation

The MUG-V 10B framework optimizes four critical areas: data processing, model architecture, training strategy, and infrastructure. These optimizations lead to substantial gains in efficiency and performance across various stages, from data preprocessing and video compression to parameter scaling, curriculum-based pretraining, and alignment-focused post-training.

Scalable Data Processing

The team developed a robust data processing pipeline to curate high-quality video clips from vast datasets. This pipeline involves rigorous video-level screening for licensing, privacy, and content diversity. It then employs advanced techniques like PySceneDetect and Color-Struct SVM for accurate video splitting, ensuring semantically coherent segments. Visual quality is maintained through a four-stage filtering process, including sharpness tests, aesthetic scoring, motion amplitude checks, and a multimodal LLM filter to remove heavily post-processed or altered footage. High-fidelity captions are generated using a fine-tuned Qwen2-VL-7B model, and data balancing and deduplication are performed to control bias and eliminate redundancies. A smaller, high-quality subset is also curated with human-verified labels for post-training, focusing on aspects like motion continuity, content stability, and visual fidelity, as well as human preference annotations for error-free generation and motion quality.

Innovative Model Architecture

At the core of MUG-V 10B is a 10-billion-parameter Diffusion Transformer (DiT). This model is trained to handle text-to-video, image-to-video, and text-plus-image-to-video synthesis, unifying different conditioning modalities. Before the DiT, a Video Variational Autoencoder (Video VAE) compresses pixel-space video frames into a compact latent representation. This Video VAE achieves an impressive 8x8x8 compression ratio across time, height, and width, while maintaining high reconstruction quality. The DiT itself uses a transformer block architecture similar to autoregressive language models, incorporating full attention for global coherence and 3D Rotary Position Embedding (RoPE) for accurate positional cues. A novel image/frame conditioning scheme is also introduced, where conditioned regions receive the given image/frame latent with zero noise, improving fidelity to provided visual content.

Efficient Training Strategy

MUG-V 10B employs a sophisticated training strategy to ensure efficiency and stability. This includes a parameter-expansion strategy, where a smaller 2B parameter model is first trained and then expanded to the 10B scale using a method similar to HyperCloning, which increases channel width while preserving functional behavior. A multi-stage pre-training curriculum guides the model’s learning: Stage 1 mixes image data with low-resolution video, Stage 2 increases clip length at 360p, and Stage 3 uses high-resolution 720p clips. Post-training involves annealed supervised fine-tuning (SFT) with a post-hoc EMA variant and preference optimization using human-annotated data, focusing on error-free generation and motion quality through algorithms like KTO and DPO.

Robust Training Infrastructure

The training framework is built on Megatron-Core, a system designed for large-scale model training. It utilizes a hybrid parallelization scheme combining data parallelism (DP), tensor parallelism (TP), pipeline parallelism (PP), and sequence parallelism (SP) to maximize throughput and manage memory consumption for long video sequences. An asynchronous I/O pipeline with aggressive pre-fetching and caching ensures efficient data ingestion, while dynamic balanced sampling minimizes pipeline stalls. Furthermore, kernel fusion techniques, including merging linear-layer bias addition, per-pixel scale-and-shift modulation, and residual accumulation into single GPU kernels, significantly reduce memory overhead and increase arithmetic intensity, leading to end-to-end speed-ups.

Performance and Real-World Applications

MUG-V 10B demonstrates strong performance in quantitative evaluations using the VBench protocol, ranking competitively among state-of-the-art models, especially in text-image to video (TI2V) tasks. Human evaluations tailored for e-commerce video generation tasks show that MUG-V 10B surpasses leading open-source baselines in terms of pass rate and high-quality rate, indicating its ability to generate realistic and consistent product videos. While significant progress has been made, the researchers acknowledge that challenges remain in fine-grained appearance fidelity and scaling to even longer durations and higher resolutions.

Also Read:

Open-Sourcing for Community Advancement

Crucially, the MUG-V 10B team has open-sourced the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. This is a significant contribution, as it is the first public release of large-scale video generation training code leveraging Megatron-Core for high training efficiency and near-linear multi-node scaling. This open-sourcing effort aims to accelerate research and lower the barrier for practitioners to explore scalable visual world modeling. You can find more details about this groundbreaking work in the research paper.