COSE: Enhancing Speech in a Single Step with Average Velocity Flow Matching

TLDR: COSE is a novel one-step speech enhancement framework that utilizes average-velocity flow matching. It introduces a velocity composition identity to efficiently compute average velocity, thereby avoiding computationally expensive Jacobian–vector product (JVP) calculations. This approach leads to significantly faster sampling (up to 5x) and reduced training overhead (40% less cost and memory) compared to MeanFlow, all while maintaining competitive speech enhancement quality on standard benchmarks.

In the rapidly evolving field of artificial intelligence, speech enhancement (SE) plays a crucial role in improving the clarity and quality of audio, benefiting everything from human communication to automated systems like speech recognition. Traditionally, methods based on statistical signal processing have been used, but they often struggle with complex or unpredictable noise. The advent of deep learning has brought significant advancements, with generative models like diffusion and flow matching (FM) leading the charge in capturing intricate speech patterns and preserving both quality and intelligibility.

However, despite their impressive performance, these generative models often face a significant hurdle: their reliance on multi-step generation. This process is computationally intensive, requiring many function evaluations (NFEs) during sampling, and can be prone to errors that accumulate over multiple steps. This makes them less ideal for real-time or resource-constrained applications.

A promising direction in generative modeling is the development of one-step generation techniques. Among these, the MeanFlow framework stands out by reformulating generative dynamics through the concept of average velocity fields, enabling direct one-step generation. While effective, MeanFlow itself introduces a new challenge: high training overhead due to complex Jacobian–vector product (JVP) computations.

This is where a new framework called COSE (Compose velOcity in Speech Enhancement) comes into play. Developed by researchers from the University of Electronic Science and Technology of China, COSE integrates the MeanFlow concept into speech enhancement, specifically designed for efficient, one-step generation. The core innovation of COSE lies in its ability to compute average velocity efficiently by introducing a ‘velocity composition identity’. This clever mathematical approach eliminates the need for expensive JVP computations, significantly reducing the training burden while maintaining theoretical consistency and achieving high-quality speech enhancement.

The benefits of COSE are substantial. Extensive experiments conducted on standard benchmarks like the VoiceBank-DEMAND and CHiME-4 datasets demonstrate that COSE delivers up to five times faster sampling compared to existing methods. Furthermore, it reduces training costs, including GPU memory usage and training time, by approximately 40% when compared to MeanFlow. Crucially, these efficiency gains are achieved without compromising the quality of the enhanced speech.

When compared to other leading models, COSE consistently shows superior one-step generation performance. It outperforms diffusion models like SGMSE+, StoRM, and VPIDM, even when those models use 15 steps. It also surpasses advanced flow-matching methods such as FlowSE and LARF. While diffusion-based models and instantaneous velocity models like FlowSE often see a rapid degradation in quality as the number of sampling steps decreases, COSE maintains high-quality results in just a single step, highlighting the effectiveness of its average velocity modeling in reducing cumulative errors.

The research paper, titled “COMPOSE YOURSELF: AVERAGE-VELOCITY FLOW MATCHING FOR ONE-STEP SPEECH ENHANCEMENT”, details how COSE leverages the properties of Ordinary Differential Equations (ODEs) to decompose displacements into compositions of two segment velocities. This not only avoids the computational overhead of JVP but also ensures that the solution is equivalent to other one-step generation methods, aligning with the self-consistency principle underlying flow matching trajectories.

Also Read:

In conclusion, COSE represents a significant step forward for practical speech enhancement applications. By offering competitive enhancement quality with dramatically improved efficiency through its innovative one-step flow matching framework, it paves the way for more accessible and powerful audio processing technologies. The researchers hope this work will encourage further development in one-step generation for speech enhancement and plan to explore its generalization across a broader range of model architectures and datasets in future work.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

COSE: Enhancing Speech in a Single Step with Average Velocity Flow Matching

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing Large Language Model Reasoning with Concise Outputs

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates