spot_img
HomeResearch & DevelopmentUnveiling PianoVAM: A New Multimodal Dataset for Piano Performance...

Unveiling PianoVAM: A New Multimodal Dataset for Piano Performance Analysis

TLDR: PianoVAM is a new, comprehensive multimodal dataset for piano performance research, featuring synchronized videos, audio, MIDI, hand landmarks, fingering labels, and metadata. Collected from amateur pianists during practice sessions, it addresses limitations of existing datasets by offering a rich combination of modalities. The dataset utilizes a semi-automated method for fingering annotation and has been benchmarked for piano transcription, demonstrating how visual information can significantly improve performance, especially in noisy or reverberant environments. It aims to advance Music Information Retrieval by providing detailed insights into the expressive sound creation process.

The world of music performance is rich and complex, involving much more than just sound. From the subtle movements of a musician’s hands to their posture, visual elements play a crucial role in how music is created and perceived. This multimodal nature of music has sparked a growing interest in collecting data that goes beyond just audio, especially within the Music Information Retrieval (MIR) community.

Addressing this need, a new and comprehensive dataset called PianoVAM has been introduced. Developed by a team of researchers including Yonghyun Kim, Junhyung Park, Joonhyung Bae, Kirak Kim, Taegyun Kwon, Alexander Lerch, and Juhan Nam, PianoVAM offers an unprecedented look into piano performances by combining various data types. This includes synchronized videos, high-quality audio, MIDI data, detailed hand landmarks, fingering labels, and extensive metadata.

The creation of PianoVAM involved recording amateur pianists during their everyday practice sessions using a Yamaha Disklavier piano. This setup captured both audio and MIDI information, alongside synchronized top-view videos, all under realistic and varied performance conditions. To enrich the dataset further, hand landmarks were extracted using a pre-trained hand pose estimation model, and fingering labels were generated through a semi-automated annotation algorithm. The researchers also openly discussed the challenges faced during data collection and the intricate process of aligning these diverse data modalities.

What Makes PianoVAM Unique?

Compared to existing datasets, PianoVAM stands out for its comprehensiveness. While datasets like MAESTRO offer high-quality audio and MIDI, they lack visual components. Others, such as OMAPS2 and PianoYT, include video but have limited or pseudo-MIDI annotations. PianoVAM, on the other hand, provides real performance audio, synchronized MIDI, top-view videos, and unique fingering pseudo-labels, making it a truly multimodal resource for researchers.

For fingering data specifically, PianoVAM offers a large collection of over a million notes with fingering information, generated through a hybrid algorithm that combines automated detection with human refinement. This contrasts with datasets like PIG, which are smaller and manually annotated, or ThumbSet, which relies on crowd-sourced data with less clear annotation sources.

How the Data Was Collected and Processed

The data acquisition system was designed for unsupervised recording, allowing pianists to practice naturally. It involved an overhead webcam for video, a dedicated microphone for audio, and a Disklavier piano for MIDI signals. OBS Studio recorded the audio-video stream, while Logic Pro simultaneously captured the audio-MIDI stream. A common audio signal served as a reference for precise time alignment across all modalities.

After collection, the data underwent rigorous pre-processing. This included refining the time alignment of audio and MIDI using techniques similar to the MAESTRO dataset, involving down-mixing, resampling, and Dynamic Time Warping. Additionally, a loudness normalization procedure was applied to ensure consistency across recordings, especially given the varied recording conditions over time.

Dataset Insights

The PianoVAM dataset comprises 106 solo piano recordings from 10 amateur performers, totaling approximately 21 hours. The repertoire is diverse, covering 38 composers from various eras, and includes improvisations. An interesting finding from the dataset statistics is the significantly higher use of the sustain pedal in PianoVAM compared to MAESTROv3, which the researchers attribute to the repertoire, amateur playing tendencies, and studio acoustics.

Fingering Annotation: A Hybrid Approach

A key innovation in PianoVAM is its method for generating fingering annotations. The algorithm processes performance videos to map hand landmarks to potential finger candidates for each MIDI note. For clear cases, fingering is determined automatically with high precision (around 95%). In ambiguous situations, a custom graphical user interface (GUI) allows a human annotator to make the final selection, ensuring complete and accurate labels.

Also Read:

Benchmarking and Applications

To demonstrate its utility, PianoVAM was benchmarked on piano transcription tasks in both audio-only and audio-visual settings. The results showed that models trained on PianoVAM, especially when combined with MAESTROv3, significantly improved transcription performance. Crucially, the audio-visual experiments highlighted how visual information can enhance transcription, particularly under challenging acoustic conditions like noise and reverberation. By using visual cues to filter out physically implausible notes, the system improved onset prediction precision.

While PianoVAM offers significant advancements, the researchers acknowledge certain biases in performer identity, pedal usage, and composer representation, as well as challenges with visual ambiguities in fingering detection. Future work aims to expand the dataset with expert performances, multi-angle videos, and richer contextual data, alongside improvements in fingering detection precision using advanced hand pose estimation and 3D reconstruction models.

In conclusion, PianoVAM represents a significant step forward in multimodal music performance research, providing a rich resource for advancing Music Information Retrieval applications. You can find more details about this dataset and its applications by reading the full research paper available here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -