TLDR: This research introduces the use of Autoregressive Diffusion Models (ADMs) to estimate musical surprisal from audio, a measure of how unexpected a musical event is. Unlike previous models, ADMs do not rely on strong assumptions about data distribution and can analyze surprisal at various “noise levels,” corresponding to different audio granularities. The study demonstrates that ADMs outperform prior methods in describing diverse music data and are more effective at capturing monophonic pitch surprisal and detecting segment boundaries. Crucially, the research finds that analyzing surprisal at moderate noise levels can better reflect higher-level musical features like pitch, while filtering out low-level details like timbre nuances, leading to improved performance in musical analysis tasks.
Understanding what makes music surprising or expected for human listeners has long been a fascinating area of research. Recently, a concept called “information content” (IC), or negative log-likelihood (NLL), derived from artificial intelligence models, has been used to estimate this musical surprisal directly from audio. This approach helps quantify how unexpected a musical event is, correlating with human perception of surprise and complexity in music.
Previous attempts to model musical surprisal from audio, such as those using the Generative Infinite-Vocabulary Transformer (GIVT) model, faced limitations. These models often made strong assumptions about how musical data is distributed, which could hinder their effectiveness, especially with highly compressed audio representations.
A New Approach with Diffusion Models
This new research introduces Autoregressive Diffusion Models (ADMs) as a powerful alternative for estimating musical surprisal. Diffusion models have emerged as state-of-the-art tools in generative AI across various domains, including music. A significant advantage of ADMs is that they do not require rigid assumptions about data distribution. Furthermore, by formulating diffusion processes as ordinary differential equations (ODEs), these models can estimate the likelihood of any given data point at different stages of the diffusion process. These stages correspond to varying levels of “noise” or abstraction in the data.
The researchers specifically investigated two popular diffusion models: EDM and Rectified Flow (RFF). They empirically demonstrated that these diffusion models describe diverse music data more effectively, in terms of negative log-likelihood, than the GIVT model. This finding is consistent with their superior performance in audio generation tasks.
Also Read:
- Bridging Code and Sound: Aligning Embeddings for Enhanced Music Generation
- Interpretable AI Models Show Enhanced Robustness in Music Emotion Recognition
Exploring Surprisal at Different Audio Granularities
A key hypothesis explored in this paper is that surprisal estimated at different diffusion process noise levels corresponds to the surprisal of music and audio features present at different audio granularities. For instance, at moderate noise levels, the models might capture the surprisal of higher-level features like pitch, while filtering out the contributions of lower-level features such as subtle timbre nuances.
To test the effectiveness of diffusion model IC in capturing surprisal, the study focused on two specific tasks:
- Capturing Monophonic Pitch Surprisal: This task relates to understanding tonality. The diffusion models were found to capture pitch surprisal better than the GIVT model. Notably, IC estimates from noised data (at intermediate noise levels) showed a higher correlation with perceptually validated pitch expectancy models, suggesting they are more invariant to timbre variations.
- Detecting Segment Boundaries in Multi-Track Audio: This task is related to identifying information changes in music. The research showed that peaks in the surprisal function align with segment boundaries. Furthermore, using IC estimated at coarser noise levels improved the precision and recall of segment boundary predictions.
The findings support the hypothesis that for appropriate noise levels, the results of musical surprisal tasks improve. This indicates that diffusion models not only surpass GIVT models in surprisal estimation but also offer additional insights by allowing analysis at different audio granularities.
This groundbreaking work suggests that diffusion models can provide a more nuanced and accurate understanding of musical expectation and surprise. The code for this research is publicly available on github.com/SonyCSLParis/audioic, allowing other researchers to build upon these findings. For more in-depth technical details, you can refer to the full research paper available at arXiv:2508.05306.


