Unveiling PianoVAM: A New Multimodal Dataset for Piano Performance Analysis

TLDR: PianoVAM is a new, comprehensive multimodal dataset for piano performance research, featuring synchronized videos, audio, MIDI, hand landmarks, fingering labels, and metadata. Collected from amateur pianists during practice sessions, it addresses limitations of existing datasets by offering a rich combination of modalities. The dataset utilizes a semi-automated method for fingering annotation and has been benchmarked for piano transcription, demonstrating how visual information can significantly improve performance, especially in noisy or reverberant environments. It aims to advance Music Information Retrieval by providing detailed insights into the expressive sound creation process.

The world of music performance is rich and complex, involving much more than just sound. From the subtle movements of a musician’s hands to their posture, visual elements play a crucial role in how music is created and perceived. This multimodal nature of music has sparked a growing interest in collecting data that goes beyond just audio, especially within the Music Information Retrieval (MIR) community.

Addressing this need, a new and comprehensive dataset called PianoVAM has been introduced. Developed by a team of researchers including Yonghyun Kim, Junhyung Park, Joonhyung Bae, Kirak Kim, Taegyun Kwon, Alexander Lerch, and Juhan Nam, PianoVAM offers an unprecedented look into piano performances by combining various data types. This includes synchronized videos, high-quality audio, MIDI data, detailed hand landmarks, fingering labels, and extensive metadata.

The creation of PianoVAM involved recording amateur pianists during their everyday practice sessions using a Yamaha Disklavier piano. This setup captured both audio and MIDI information, alongside synchronized top-view videos, all under realistic and varied performance conditions. To enrich the dataset further, hand landmarks were extracted using a pre-trained hand pose estimation model, and fingering labels were generated through a semi-automated annotation algorithm. The researchers also openly discussed the challenges faced during data collection and the intricate process of aligning these diverse data modalities.

What Makes PianoVAM Unique?

Compared to existing datasets, PianoVAM stands out for its comprehensiveness. While datasets like MAESTRO offer high-quality audio and MIDI, they lack visual components. Others, such as OMAPS2 and PianoYT, include video but have limited or pseudo-MIDI annotations. PianoVAM, on the other hand, provides real performance audio, synchronized MIDI, top-view videos, and unique fingering pseudo-labels, making it a truly multimodal resource for researchers.

For fingering data specifically, PianoVAM offers a large collection of over a million notes with fingering information, generated through a hybrid algorithm that combines automated detection with human refinement. This contrasts with datasets like PIG, which are smaller and manually annotated, or ThumbSet, which relies on crowd-sourced data with less clear annotation sources.

How the Data Was Collected and Processed

The data acquisition system was designed for unsupervised recording, allowing pianists to practice naturally. It involved an overhead webcam for video, a dedicated microphone for audio, and a Disklavier piano for MIDI signals. OBS Studio recorded the audio-video stream, while Logic Pro simultaneously captured the audio-MIDI stream. A common audio signal served as a reference for precise time alignment across all modalities.

After collection, the data underwent rigorous pre-processing. This included refining the time alignment of audio and MIDI using techniques similar to the MAESTRO dataset, involving down-mixing, resampling, and Dynamic Time Warping. Additionally, a loudness normalization procedure was applied to ensure consistency across recordings, especially given the varied recording conditions over time.

Dataset Insights

The PianoVAM dataset comprises 106 solo piano recordings from 10 amateur performers, totaling approximately 21 hours. The repertoire is diverse, covering 38 composers from various eras, and includes improvisations. An interesting finding from the dataset statistics is the significantly higher use of the sustain pedal in PianoVAM compared to MAESTROv3, which the researchers attribute to the repertoire, amateur playing tendencies, and studio acoustics.

Fingering Annotation: A Hybrid Approach

A key innovation in PianoVAM is its method for generating fingering annotations. The algorithm processes performance videos to map hand landmarks to potential finger candidates for each MIDI note. For clear cases, fingering is determined automatically with high precision (around 95%). In ambiguous situations, a custom graphical user interface (GUI) allows a human annotator to make the final selection, ensuring complete and accurate labels.

Also Read:

Benchmarking and Applications

To demonstrate its utility, PianoVAM was benchmarked on piano transcription tasks in both audio-only and audio-visual settings. The results showed that models trained on PianoVAM, especially when combined with MAESTROv3, significantly improved transcription performance. Crucially, the audio-visual experiments highlighted how visual information can enhance transcription, particularly under challenging acoustic conditions like noise and reverberation. By using visual cues to filter out physically implausible notes, the system improved onset prediction precision.

While PianoVAM offers significant advancements, the researchers acknowledge certain biases in performer identity, pedal usage, and composer representation, as well as challenges with visual ambiguities in fingering detection. Future work aims to expand the dataset with expert performances, multi-angle videos, and richer contextual data, alongside improvements in fingering detection precision using advanced hand pose estimation and 3D reconstruction models.

In conclusion, PianoVAM represents a significant step forward in multimodal music performance research, providing a rich resource for advancing Music Information Retrieval applications. You can find more details about this dataset and its applications by reading the full research paper available here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unveiling PianoVAM: A New Multimodal Dataset for Piano Performance Analysis

What Makes PianoVAM Unique?

How the Data Was Collected and Processed

Dataset Insights

Fingering Annotation: A Hybrid Approach

Benchmarking and Applications

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Financial Sector Fortifies Against Surging AI-Powered Scams

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates