TLDR: SpectroStream is a new neural audio codec that improves upon SoundStream, offering high-quality 48 kHz stereo music compression at low bit rates (4-16 kbps). It achieves this through a novel time-frequency domain architecture and a delayed-fusion strategy for multi-channel audio, demonstrating superior performance over existing codecs in both objective and subjective evaluations, especially at lower bit rates.
A new advancement in audio technology, named SpectroStream, has been introduced as a versatile neural audio codec designed to handle high-quality, full-band, multi-channel audio, particularly 48 kHz stereo music. This innovative system builds upon the foundation of its predecessor, SoundStream, significantly expanding its capabilities beyond monophonic audio and lower sample rates.
At its core, SpectroStream leverages a novel neural architecture that processes audio in the time-frequency domain. This approach is crucial for achieving superior audio quality, especially when dealing with higher sample rates. For multi-channel audio, such as stereo music, the model employs a clever “delayed-fusion” strategy. This means that individual audio channels are initially processed independently within certain parts of the model, but then intelligently combined later on. This balance ensures both excellent per-channel acoustic quality and consistent phase across channels, which is vital for a natural and immersive stereo listening experience.
One of the significant advantages of SpectroStream is its design for real-time streaming inference. By incorporating causal convolutions with a minimal look-ahead, the codec can operate with very low architectural latency, making it suitable for applications where immediate processing is required. Remarkably, this can be achieved on a standard desktop CPU, eliminating the need for specialized hardware accelerators.
How SpectroStream Works
The SpectroStream system comprises an encoder, a decoder, and a quantizer, with a discriminator used during the training phase. The process begins by converting the input audio into spectrograms, which represent the audio in both time and frequency. These spectrograms are then fed into the encoder, which compresses them into compact, discrete integer tokens using a technique called residual vector quantization (RVQ). These tokens form the compressed representation of the audio. During decoding, these tokens are dequantized back into embeddings, and the decoder reconstructs the time-domain waveform.
The model’s ability to handle multi-channel audio effectively stems from its delayed fusion in the encoder and an “early splitting” strategy in the decoder. This allows the model to process channels independently in early stages for fidelity, and then jointly in later stages to maintain cross-channel consistency. The training of SpectroStream involves a combination of adversarial and reconstruction losses, utilizing a multi-scale STFT-based discriminator to capture fine details across different time-frequency resolutions.
Performance and Comparison
Extensive experiments were conducted to evaluate SpectroStream’s performance, comparing it against another state-of-the-art codec, Descript Audio Codec (DAC). The evaluation included both objective metrics, such as ViSQOL scores, and subjective listening tests involving human raters. The results consistently demonstrated SpectroStream’s superior performance across various bit rates, particularly at lower bit rates where the difference in quality was most pronounced.
For instance, at a low bit rate of 2.7 kbps per channel, SpectroStream achieved a significant gain in ViSQOL scores compared to DAC. Subjective listening tests further reinforced these findings, with listeners preferring SpectroStream over DAC in a substantial majority of cases, especially at lower bit rates. This indicates that SpectroStream not only achieves better objective quality but also delivers a perceptually more pleasing audio experience.
Also Read:
- SecoustiCodec: Advancing Speech Codecs with Semantic Disentanglement and Real-time Streaming
- EmoSteer-TTS: Precise Emotion Control in Synthesized Speech Without Retraining
The Future of Audio Compression
SpectroStream represents a significant step forward in neural audio compression. Its ability to produce high-fidelity yet compact representations of full-band, multi-channel audio opens up new possibilities for various applications, including more sophisticated audio generation and language modeling on audio. For more technical details, you can refer to the full research paper available here.


