ToMA: Practical Speedups for High-Fidelity Image Generation

TLDR: ToMA (Token Merge with Attention) is a new method designed to make diffusion models like SDXL and Flux faster and more efficient for high-fidelity image generation. It addresses the scalability issues of transformers by intelligently merging redundant tokens using a submodular optimization approach and GPU-friendly attention-like operations. Unlike previous methods that suffered from GPU-inefficient processes, ToMA achieves significant practical speedups (up to 24% for SDXL and 23% for Flux) with minimal impact on image quality, by also exploiting latent space locality and reusing merge patterns.

Diffusion models have become a cornerstone in the field of high-fidelity image generation, creating stunning and realistic visuals. However, their reliance on transformer architectures, particularly in models like U-ViT and DiT, introduces a significant challenge: the self-attention mechanism, which is crucial for these models, has a computational complexity that scales quadratically with the number of tokens. This means that as images become larger or more detailed, the processing time and resource demands increase dramatically, leading to bottlenecks in latency during the denoising steps.

Previous attempts to address this scalability issue have fallen into two main categories: optimizing attention mechanisms (like FlashAttention) and reducing the number of tokens. While token reduction has been explored for discriminative tasks, generative models like diffusion models have stricter requirements. Tokens must be restored to their original count after merging (an ‘unmerging’ step) to maintain spatial consistency for the iterative refinement process. Earlier methods such as ToMeSD and ToFu tried to adapt token reduction for generative tasks, but they often relied on GPU-inefficient operations like sorting and scattered memory writes. These inefficiencies could negate any theoretical speedups, especially when paired with highly optimized attention implementations, causing the merging overhead to dominate computation time.

To bridge this gap between theoretical efficiency and practical speed, researchers have introduced TokenMerge with Attention (ToMA). This innovative, off-the-shelf method rethinks token reduction for GPU-aligned efficiency, offering a robust solution for accelerating diffusion models without compromising image quality. ToMA is a training-free framework that focuses on three key contributions to achieve its practical speedups.

ToMA’s Core Innovations

Firstly, ToMA reformulates the token merge problem as a submodular optimization task. This mathematical approach allows for the selection of diverse and representative tokens, ensuring that critical information is preserved even when the token count is reduced. The greedy algorithm, a well-known method for submodular maximization, is employed here, guaranteeing a near-optimal subset selection with minimal information loss. This process is also optimized for GPU execution, making it highly efficient.

Secondly, ToMA implements the merge and unmerge operations as attention-like linear transformations. Instead of GPU-inefficient sorting or scattered writes, ToMA uses GPU-friendly matrix operations. The destination tokens act as queries, and all input tokens serve as keys and values, with Scaled Dot-Product Attention (SDPA) generating similarity scores that guide the merging process. The unmerge step, which restores the tokens to their original resolution, is efficiently performed using a transpose-based approximation of the pseudo-inverse, which is both computationally inexpensive and empirically effective.

Thirdly, ToMA exploits two intrinsic properties of diffusion models to further minimize overhead: latent space locality and sequential redundancy. Latent space locality recognizes that tokens in natural images exhibit strong spatial coherence, allowing for parallel merging within non-overlapping local windows (e.g., 8×8 patches). This significantly reduces computation. Sequential redundancy acknowledges that merge patterns often persist across adjacent denoising timesteps and consecutive transformer layers. By reusing these merge patterns, ToMA amortizes the overhead across multiple steps and layers, further boosting efficiency.

Also Read:

Performance and Impact

The empirical validation of ToMA demonstrates its effectiveness. It reduces the total generation latency for SDXL-base by 24% and Flux.1-dev by 23%, all while maintaining negligible degradation in image quality (with a DINO score change of less than 0.07). This performance significantly outperforms prior methods like ToMeSD and ToFu, which either fail to accelerate modern attention implementations or introduce visual artifacts at comparable compression rates.

ToMA’s design ensures real-world speedup, not just theoretical FLOP reductions. It achieves at least 1.24x practical speedup when paired with FlashAttention2 and delivers state-of-the-art results across various diffusion models and GPU architectures (NVIDIA RTX6000, V100, RTX8000). The framework is also architecture-agnostic, meaning it can be readily extended to other diffusion models.

The research paper, available at arXiv:2509.10918, details the algorithmic innovations, system co-design, and extensive empirical validation that establish ToMA as a robust and deployable solution for efficient high-resolution image generation. By making diffusion models faster and more accessible, ToMA contributes to democratizing access to high-quality AI art creation, while also acknowledging the importance of responsible development and use of such powerful technologies.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ToMA: Practical Speedups for High-Fidelity Image Generation

ToMA’s Core Innovations

Performance and Impact

Gen AI News and Updates

Generative AI Powers Next-Gen Autonomous Emergency Response

STV: Smarter In-Context Learning for Multimodal AI

JobSphere: Empowering Job Seekers with an AI-Powered Multilingual Career Assistant

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates