spot_img
HomeResearch & DevelopmentSimplifying Large-Scale Omni-Modal AI Model Training with VeOmni

Simplifying Large-Scale Omni-Modal AI Model Training with VeOmni

TLDR: VeOmni is a new training framework that makes it easier and more efficient to train large AI models capable of understanding and generating information across various data types (like text, images, audio, video). It achieves this by decoupling model definition from parallel processing logic, offering flexible distributed strategies such as Fully Sharded Data Parallel (FSDP), Sequence Parallelism (SP), and Expert Parallelism (EP). The framework also includes various system optimizations like dynamic batching and efficient kernels. This design enables scalable training of complex “omni-modal” models with high throughput and memory efficiency, as demonstrated by its performance on models up to 72B parameters and context lengths up to 192K tokens.

Recent advancements in artificial intelligence, particularly with large language models (LLMs), have led to impressive progress in understanding and generating information across various types of data, known as omni-modal capabilities. Models like GPT-4o are now capable of handling tasks that involve visual questions, image generation, and multimodal reasoning. However, training these sophisticated omni-modal LLMs presents significant challenges due to their complex and diverse architectures, which require highly efficient system designs for large-scale training.

Existing training frameworks often combine the model’s definition with the logic for parallel processing, which limits their scalability and increases the engineering effort needed for end-to-end omni-modal training. To address these limitations, researchers have introduced VeOmni, a new modular and efficient training framework designed to accelerate the development of omni-modal LLMs.

VeOmni’s Core Innovation: Model-Centric Distributed Recipes

VeOmni introduces a novel concept called “model-centric distributed recipes.” This approach fundamentally separates communication operations from computation, allowing for efficient 3D parallelism on omni-modal LLMs. This decoupling means that developers can define their models without needing to worry about the intricate details of how the training will be distributed across many computing units. This makes the framework highly flexible and reduces engineering overhead.

The framework also boasts a flexible configuration interface, making it easy to integrate new data modalities (like images, audio, or video) with minimal changes to the existing code. This plug-and-play architecture allows any combination of multimodal encoders and decoders to be attached to a foundation model, creating a truly unified and extensible system.

Key Distributed Training Strategies

VeOmni incorporates a comprehensive suite of distributed training strategies to handle the demands of large-scale omni-modal models:

  • Fully and Hybrid Sharded Data Parallel (FSDP/HSDP): FSDP significantly reduces memory usage on each GPU by distributing model parameters, gradients, and optimizer states across all available devices. HSDP further enhances efficiency by minimizing communication overhead through a 2D device mesh, combining FSDP within groups and Distributed Data Parallel (DDP) across groups. Both are non-intrusive, meaning they don’t require changes to the model’s architecture.

  • Sequence Parallelism (SP) for Long Context Training: As omni-modal LLMs handle longer sequences (e.g., high-resolution images or videos), memory and computational costs soar. VeOmni adopts DeepSpeed Ulysses, a sequence parallelism technique that splits activations along the sequence dimension and uses efficient communication to maintain scalability for ultra-long sequences. It even enhances Flash Attention for better performance.

  • Expert Parallelism (EP) for MoE Model Scaling: Mixture-of-Experts (MoE) architectures are crucial for scaling large models efficiently by activating only a subset of parameters. VeOmni provides a user-friendly interface for expert parallelism, allowing easy sharding of experts across devices. It also includes fine-grained communication-computation overlapping techniques to mitigate the communication bottleneck often seen in MoE training.

These strategies are designed to be fully composable, meaning they can be flexibly applied to different components of an omni-modal LLM. For instance, a vision encoder might use FSDP, while the language backbone leverages a combination of EP for MoE layers and SP for long-context processing. This fine-grained control ensures efficient and scalable training across diverse model architectures.

Other System Optimizations

Beyond parallelism, VeOmni integrates various system-level optimizations, all decoupled from the model’s core logic for seamless integration:

  • Dynamic Batching: To improve training efficiency, VeOmni dynamically packs samples with varying sequence lengths into batches, minimizing padding overhead and maximizing GPU utilization.

  • Efficient Kernels: The framework incorporates highly optimized operator kernels (like RMSNorm, LayerNorm, FlashAttention, and MoE-specific operations) for high performance across different transformer-based architectures.

  • Memory Optimization: Techniques such as layer-wise recomputation, activation offloading, and optimizer state offloading are used to reduce memory consumption, allowing for larger batch sizes and better communication-computation overlap.

  • Efficient Distributed Checkpointing: VeOmni leverages ByteCheckpoint for efficient saving and resuming of training across different distributed configurations, even for multimodal models.

  • Meta Device Initialization: Large models can be initialized on a meta device without allocating physical memory, significantly accelerating the initialization and loading processes.

Also Read:

Experimental Validation

Experiments conducted on GPU clusters ranging from 8 to 128 GPUs demonstrated VeOmni’s superior performance and scalability. It was evaluated on diverse models, including dense models like Qwen2-VL (7B and 72B parameters) and a 30B parameter Mixture-of-Experts (MoE) omni-modal LLM based on Qwen3-MoE. The framework showed strong performance in handling long-sequence training and scaling MoE models, supporting context lengths up to 192K tokens for a 7B model and 160K tokens for a 30B MoE model, all while maintaining competitive throughput.

The convergence studies on various omni-modal LLMs (Janus, LLaMA#Omni, Qwen3-Moe#Omni) also confirmed that VeOmni enables stable and robust training for both multimodal understanding and generation tasks across text, image, video, and audio modalities.

In conclusion, VeOmni represents a significant step forward in scaling any-modality model training. By offering a model-centric, composable, and highly optimized framework, it simplifies the development and deployment of the next generation of omni-modal LLMs. For more technical details, you can refer to the research paper.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -