spot_img
HomeResearch & DevelopmentToMA: Practical Speedups for High-Fidelity Image Generation

ToMA: Practical Speedups for High-Fidelity Image Generation

TLDR: ToMA (Token Merge with Attention) is a new method designed to make diffusion models like SDXL and Flux faster and more efficient for high-fidelity image generation. It addresses the scalability issues of transformers by intelligently merging redundant tokens using a submodular optimization approach and GPU-friendly attention-like operations. Unlike previous methods that suffered from GPU-inefficient processes, ToMA achieves significant practical speedups (up to 24% for SDXL and 23% for Flux) with minimal impact on image quality, by also exploiting latent space locality and reusing merge patterns.

Diffusion models have become a cornerstone in the field of high-fidelity image generation, creating stunning and realistic visuals. However, their reliance on transformer architectures, particularly in models like U-ViT and DiT, introduces a significant challenge: the self-attention mechanism, which is crucial for these models, has a computational complexity that scales quadratically with the number of tokens. This means that as images become larger or more detailed, the processing time and resource demands increase dramatically, leading to bottlenecks in latency during the denoising steps.

Previous attempts to address this scalability issue have fallen into two main categories: optimizing attention mechanisms (like FlashAttention) and reducing the number of tokens. While token reduction has been explored for discriminative tasks, generative models like diffusion models have stricter requirements. Tokens must be restored to their original count after merging (an ‘unmerging’ step) to maintain spatial consistency for the iterative refinement process. Earlier methods such as ToMeSD and ToFu tried to adapt token reduction for generative tasks, but they often relied on GPU-inefficient operations like sorting and scattered memory writes. These inefficiencies could negate any theoretical speedups, especially when paired with highly optimized attention implementations, causing the merging overhead to dominate computation time.

To bridge this gap between theoretical efficiency and practical speed, researchers have introduced TokenMerge with Attention (ToMA). This innovative, off-the-shelf method rethinks token reduction for GPU-aligned efficiency, offering a robust solution for accelerating diffusion models without compromising image quality. ToMA is a training-free framework that focuses on three key contributions to achieve its practical speedups.

ToMA’s Core Innovations

Firstly, ToMA reformulates the token merge problem as a submodular optimization task. This mathematical approach allows for the selection of diverse and representative tokens, ensuring that critical information is preserved even when the token count is reduced. The greedy algorithm, a well-known method for submodular maximization, is employed here, guaranteeing a near-optimal subset selection with minimal information loss. This process is also optimized for GPU execution, making it highly efficient.

Secondly, ToMA implements the merge and unmerge operations as attention-like linear transformations. Instead of GPU-inefficient sorting or scattered writes, ToMA uses GPU-friendly matrix operations. The destination tokens act as queries, and all input tokens serve as keys and values, with Scaled Dot-Product Attention (SDPA) generating similarity scores that guide the merging process. The unmerge step, which restores the tokens to their original resolution, is efficiently performed using a transpose-based approximation of the pseudo-inverse, which is both computationally inexpensive and empirically effective.

Thirdly, ToMA exploits two intrinsic properties of diffusion models to further minimize overhead: latent space locality and sequential redundancy. Latent space locality recognizes that tokens in natural images exhibit strong spatial coherence, allowing for parallel merging within non-overlapping local windows (e.g., 8×8 patches). This significantly reduces computation. Sequential redundancy acknowledges that merge patterns often persist across adjacent denoising timesteps and consecutive transformer layers. By reusing these merge patterns, ToMA amortizes the overhead across multiple steps and layers, further boosting efficiency.

Also Read:

Performance and Impact

The empirical validation of ToMA demonstrates its effectiveness. It reduces the total generation latency for SDXL-base by 24% and Flux.1-dev by 23%, all while maintaining negligible degradation in image quality (with a DINO score change of less than 0.07). This performance significantly outperforms prior methods like ToMeSD and ToFu, which either fail to accelerate modern attention implementations or introduce visual artifacts at comparable compression rates.

ToMA’s design ensures real-world speedup, not just theoretical FLOP reductions. It achieves at least 1.24x practical speedup when paired with FlashAttention2 and delivers state-of-the-art results across various diffusion models and GPU architectures (NVIDIA RTX6000, V100, RTX8000). The framework is also architecture-agnostic, meaning it can be readily extended to other diffusion models.

The research paper, available at arXiv:2509.10918, details the algorithmic innovations, system co-design, and extensive empirical validation that establish ToMA as a robust and deployable solution for efficient high-resolution image generation. By making diffusion models faster and more accessible, ToMA contributes to democratizing access to high-quality AI art creation, while also acknowledging the importance of responsible development and use of such powerful technologies.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -