Simplifying Large-Scale Omni-Modal AI Model Training with VeOmni

TLDR: VeOmni is a new training framework that makes it easier and more efficient to train large AI models capable of understanding and generating information across various data types (like text, images, audio, video). It achieves this by decoupling model definition from parallel processing logic, offering flexible distributed strategies such as Fully Sharded Data Parallel (FSDP), Sequence Parallelism (SP), and Expert Parallelism (EP). The framework also includes various system optimizations like dynamic batching and efficient kernels. This design enables scalable training of complex “omni-modal” models with high throughput and memory efficiency, as demonstrated by its performance on models up to 72B parameters and context lengths up to 192K tokens.

Recent advancements in artificial intelligence, particularly with large language models (LLMs), have led to impressive progress in understanding and generating information across various types of data, known as omni-modal capabilities. Models like GPT-4o are now capable of handling tasks that involve visual questions, image generation, and multimodal reasoning. However, training these sophisticated omni-modal LLMs presents significant challenges due to their complex and diverse architectures, which require highly efficient system designs for large-scale training.

Existing training frameworks often combine the model’s definition with the logic for parallel processing, which limits their scalability and increases the engineering effort needed for end-to-end omni-modal training. To address these limitations, researchers have introduced VeOmni, a new modular and efficient training framework designed to accelerate the development of omni-modal LLMs.

VeOmni’s Core Innovation: Model-Centric Distributed Recipes

VeOmni introduces a novel concept called “model-centric distributed recipes.” This approach fundamentally separates communication operations from computation, allowing for efficient 3D parallelism on omni-modal LLMs. This decoupling means that developers can define their models without needing to worry about the intricate details of how the training will be distributed across many computing units. This makes the framework highly flexible and reduces engineering overhead.

The framework also boasts a flexible configuration interface, making it easy to integrate new data modalities (like images, audio, or video) with minimal changes to the existing code. This plug-and-play architecture allows any combination of multimodal encoders and decoders to be attached to a foundation model, creating a truly unified and extensible system.

Key Distributed Training Strategies

VeOmni incorporates a comprehensive suite of distributed training strategies to handle the demands of large-scale omni-modal models:

Fully and Hybrid Sharded Data Parallel (FSDP/HSDP): FSDP significantly reduces memory usage on each GPU by distributing model parameters, gradients, and optimizer states across all available devices. HSDP further enhances efficiency by minimizing communication overhead through a 2D device mesh, combining FSDP within groups and Distributed Data Parallel (DDP) across groups. Both are non-intrusive, meaning they don’t require changes to the model’s architecture.
Sequence Parallelism (SP) for Long Context Training: As omni-modal LLMs handle longer sequences (e.g., high-resolution images or videos), memory and computational costs soar. VeOmni adopts DeepSpeed Ulysses, a sequence parallelism technique that splits activations along the sequence dimension and uses efficient communication to maintain scalability for ultra-long sequences. It even enhances Flash Attention for better performance.
Expert Parallelism (EP) for MoE Model Scaling: Mixture-of-Experts (MoE) architectures are crucial for scaling large models efficiently by activating only a subset of parameters. VeOmni provides a user-friendly interface for expert parallelism, allowing easy sharding of experts across devices. It also includes fine-grained communication-computation overlapping techniques to mitigate the communication bottleneck often seen in MoE training.

These strategies are designed to be fully composable, meaning they can be flexibly applied to different components of an omni-modal LLM. For instance, a vision encoder might use FSDP, while the language backbone leverages a combination of EP for MoE layers and SP for long-context processing. This fine-grained control ensures efficient and scalable training across diverse model architectures.

Other System Optimizations

Beyond parallelism, VeOmni integrates various system-level optimizations, all decoupled from the model’s core logic for seamless integration:

Dynamic Batching: To improve training efficiency, VeOmni dynamically packs samples with varying sequence lengths into batches, minimizing padding overhead and maximizing GPU utilization.
Efficient Kernels: The framework incorporates highly optimized operator kernels (like RMSNorm, LayerNorm, FlashAttention, and MoE-specific operations) for high performance across different transformer-based architectures.
Memory Optimization: Techniques such as layer-wise recomputation, activation offloading, and optimizer state offloading are used to reduce memory consumption, allowing for larger batch sizes and better communication-computation overlap.
Efficient Distributed Checkpointing: VeOmni leverages ByteCheckpoint for efficient saving and resuming of training across different distributed configurations, even for multimodal models.
Meta Device Initialization: Large models can be initialized on a meta device without allocating physical memory, significantly accelerating the initialization and loading processes.

Also Read:

Experimental Validation

Experiments conducted on GPU clusters ranging from 8 to 128 GPUs demonstrated VeOmni’s superior performance and scalability. It was evaluated on diverse models, including dense models like Qwen2-VL (7B and 72B parameters) and a 30B parameter Mixture-of-Experts (MoE) omni-modal LLM based on Qwen3-MoE. The framework showed strong performance in handling long-sequence training and scaling MoE models, supporting context lengths up to 192K tokens for a 7B model and 160K tokens for a 30B MoE model, all while maintaining competitive throughput.

The convergence studies on various omni-modal LLMs (Janus, LLaMA#Omni, Qwen3-Moe#Omni) also confirmed that VeOmni enables stable and robust training for both multimodal understanding and generation tasks across text, image, video, and audio modalities.

In conclusion, VeOmni represents a significant step forward in scaling any-modality model training. By offering a model-centric, composable, and highly optimized framework, it simplifies the development and deployment of the next generation of omni-modal LLMs. For more technical details, you can refer to the research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Simplifying Large-Scale Omni-Modal AI Model Training with VeOmni

VeOmni’s Core Innovation: Model-Centric Distributed Recipes

Key Distributed Training Strategies

Other System Optimizations

Experimental Validation

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates