TLDR: CoDiCodec is a new audio autoencoder that can create both continuous and discrete compressed audio representations from a single model. It uses techniques like Finite Scalar Quantization (FSQ) with FSQ-dropout and a novel parallel decoding strategy to achieve high audio quality, efficient compression, and flexibility for various AI audio tasks, outperforming existing methods.
In the rapidly evolving world of artificial intelligence and audio, efficiently representing sound signals in a compact format is crucial for creating advanced generative models. Traditionally, researchers and developers have faced a dilemma: choose between continuous embeddings, which are great for models like GANs and diffusion models, or discrete tokens, which are ideal for training language models. Each approach has its strengths and weaknesses, often forcing a trade-off between compression ratio, audio fidelity, and compatibility with different AI frameworks.
A new research paper introduces CoDiCodec, a groundbreaking audio autoencoder designed to bridge this gap. Developed by Marco Pasini, Stefan Lattner, and György Fazekas, CoDiCodec offers a unified solution, capable of producing both continuous and discrete compressed representations of audio from a single trained model. This flexibility is a significant step forward, allowing developers to use the same underlying technology for a wider range of downstream generative tasks without needing separate, specialized models.
What Makes CoDiCodec Unique?
CoDiCodec stands out by addressing several key challenges in audio compression. It efficiently encodes global audio features using what are called “summary embeddings,” which help reduce redundancy and improve fidelity at high compression ratios. Imagine capturing the essence of an entire audio clip in a few key descriptors, rather than a long, detailed sequence.
One of its core innovations is the use of Finite Scalar Quantization (FSQ) combined with a novel technique called FSQ-dropout. FSQ is a simple yet effective way to convert continuous values into discrete tokens without needing complex additional loss terms during training. However, standard FSQ can limit the expressiveness of continuous embeddings. FSQ-dropout cleverly bypasses the rounding step during training with a certain probability, encouraging the model to produce more informative continuous embeddings while still learning to generate discrete tokens. This means CoDiCodec can offer higher-quality continuous decoding when needed, alongside its discrete capabilities.
The model is also trained end-to-end using a single “consistency loss,” simplifying the training process significantly compared to multi-stage or adversarial training methods often used in other autoencoders. This makes CoDiCodec more stable and easier to work with.
Faster and More Flexible Decoding
CoDiCodec supports two decoding strategies: autoregressive and a novel parallel decoding method. Autoregressive decoding processes audio sequentially, chunk by chunk, which is good for real-time applications. However, for longer audio sequences, this can be slow. The new parallel decoding strategy tackles this by decoding adjacent pairs of compressed latents simultaneously and then shifting these pairs in subsequent denoising steps. This iterative process allows information to propagate across the sequence, preventing artifacts that might arise from completely independent decoding. This results in faster decoding times, especially for long audio samples, and even achieves superior audio quality.
Also Read:
- Enhancing Audio Event Recognition Through Consistency Regularization
- Unveiling Data’s True Complexity: Introducing the Intrinsic Dimension Estimating Autoencoder (IDEA)
Performance and Scalability
In experiments, CoDiCodec demonstrated superior audio quality compared to existing continuous and discrete autoencoders at similar bitrates, as measured by metrics like FAD (Fréchet Audio Distance) and FAD_clap. While some baselines might show higher scores on reconstruction-specific metrics, CoDiCodec prioritizes general audio quality, which is crucial for generative tasks. The paper also highlights an improved architecture that scales more easily, focusing on transformer layers, and achieves faster inference speeds than previous models like Music2Latent2.
This work represents a significant step towards a unified approach to audio compression, bridging the gap between continuous and discrete generative modeling paradigms. For more technical details, you can refer to the full research paper: CoDiCodec: Unifying Continuous and Discrete Compressed Representations of Audio.


