TLDR: TISDiSS is a new framework for audio source separation that offers flexible speed-performance trade-offs. It unifies early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions, allowing a single model to scale performance at inference time without retraining. It achieves state-of-the-art results with fewer parameters, making it efficient and practical for various audio processing applications.
The field of audio processing, particularly source separation, is crucial for applications ranging from real-time communication to creating cleaner datasets for generative AI models. Traditionally, achieving better separation performance has meant using increasingly large and complex neural networks, which drives up both training and deployment costs. However, a new approach called Training-Time and Inference-Time Scalable Discriminative Source Separation, or TISDiSS, offers a more efficient solution.
TISDiSS is a novel framework designed to provide flexible speed-performance trade-offs in source separation. It addresses the challenge of needing powerful models for high-quality separation while also catering to devices with limited computational resources that require faster processing. The core idea is to allow a single trained model to scale its performance at inference time, meaning you can adjust the quality and speed without having to train multiple different models.
This framework unifies three key techniques: early-split multi-loss supervision, a shared-parameter design, and dynamic inference repetitions. Early-split multi-loss supervision helps to guide the model’s learning process by applying supervision at various intermediate stages, improving how well the model learns to represent audio features. The shared-parameter design is crucial for reducing the overall size of the model, making it more lightweight and easier to deploy on different systems. Dynamic inference repetitions allow the model to adjust its computational depth during inference, directly impacting the trade-off between speed and separation quality.
Unlike previous methods that might use some of these techniques in isolation, TISDiSS integrates them all. This joint approach enables the inference-time scaling phenomenon, where increasing the number of inference iterations can improve output quality without altering the model’s fundamental parameters. This is a significant advantage, as it means developers don’t need to retrain or deploy entirely new models for different performance requirements.
The TISDiSS framework processes audio signals through several components: an Encoder, Separator, Splitter, Reconstructor, and Decoder. The Encoder converts raw audio into a time-frequency representation. The Separator then processes these features through multiple iterations using shared parameters. A Splitter then decomposes the output into individual speaker features, which are further refined by a Reconstructor, also using shared parameters. Finally, the Decoder converts these refined features back into clean, separated audio signals.
During training, TISDiSS uses a multi-loss supervision mechanism. This involves a “Final Output Loss” for the ultimate separated signals and “Intermediate Auxiliary Losses” applied at various stages within the Separator, Splitter, and Reconstructor. This comprehensive loss design helps the model learn more effectively and ensures robust performance across different inference depths.
Experiments conducted on standard speech separation datasets like WSJ0-2mix, Libri2Mix, and WHAMR! have shown impressive results. TISDiSS achieves state-of-the-art performance while using significantly fewer parameters compared to other leading models. For instance, one TISDiSS configuration achieved higher SI-SNRi and SDRi scores than larger models, even without using dynamic mixing strategies. This highlights its efficiency and superior performance.
The research also delved into how different aspects of TISDiSS contribute to its success. It was found that training with more inference repetitions consistently improves performance, especially for scenarios requiring low-latency, shallow inference. This means a model trained for higher quality can still perform well at lower computational costs. Furthermore, fine-tuning an existing model with increased inference repetitions can further optimize performance without starting training from scratch, offering a practical solution for adapting models to evolving needs.
Also Read:
- COSE: Enhancing Speech in a Single Step with Average Velocity Flow Matching
- Personalized Voice Cloning Through Federated Identity-Style Adaptation
In conclusion, TISDiSS represents a significant step forward in discriminative source separation. By combining innovative architectural designs and training strategies, it provides a unified, scalable, and practical framework for adaptive audio processing, delivering high performance with reduced computational overhead. For more technical details, you can refer to the full research paper available here.


