TLDR: This paper explores how Large Language Models (LLMs), primarily trained on text, can implicitly understand and generate symbolic music. Researchers created a dataset of LLM-generated MIDI files from text prompts and used it to train neural networks for music classification and melody completion. While not outperforming models trained on human-composed music, the results show LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting their potential for cross-domain learning in music.
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human language, and their prowess has extended to other symbolic domains like computer code and mathematics. However, their inherent ability to model and perceive symbolic music has remained largely unexplored. A recent research paper delves into this fascinating area, investigating how these text-trained AI models implicitly represent musical concepts and whether the music they generate can be useful for training other AI systems.
The paper, titled “Large Language Models’ Internal Perception of Symbolic Music,” by Andrew Shin and Kunitake Kaneko from Keio University, addresses a significant gap in our understanding of LLMs’ cross-domain capabilities. Unlike models specifically trained on musical data, general-purpose LLMs learn patterns solely from vast text corpora. The core question is whether this text-based learning equips them with a rudimentary understanding of musical structures, such as melody, harmony, and rhythm.
Generating Music from Text
To explore this, the researchers embarked on an ambitious data generation process. They created a novel dataset of MIDI files, a common digital format for symbolic music, entirely generated by an LLM (specifically GPT-4). They designed textual prompts that instructed the LLM to create four-track MIDI sequences—melody, chords, bass, and rhythm—for various combinations of 13 genres (from the TOP-MAGD taxonomy) and 25 styles (from the MASD framework), augmented with a randomly selected mood like ‘happy’ or ‘sad’.
The LLM was asked to output a pure JSON string encoding 8-bar sequences with specific constraints on pitch, duration, velocity, and start times. For instance, drum pitches were limited to kick, snare, and hi-hat, aligning with MIDI percussion standards. This process resulted in a massive dataset of 16,250 unique MIDI files, totaling over 780,000 note events, all generated without any predefined musical templates or explicit musical training for the LLM. This approach directly tests the LLM’s raw generative capacity for symbolic music based solely on its interpretation of textual prompts.
Testing Musical Perception and Utility
The study then conducted several experiments to evaluate the LLM’s musical perception and the utility of its generated data:
- Direct LLM Classification: Existing MIDI files were converted to JSON and fed directly into the LLM, which was prompted to classify their genre or style. This tested the LLM’s zero-shot ability to interpret symbolic music from its internal text-based representations.
- Neural Network Classification: A simple Convolutional Neural Network (CNN) was trained exclusively on the LLM-generated MIDI dataset to classify genres and styles. Its performance was then benchmarked against established models trained on human-composed music.
- Melody Completion: A transformer model was trained on the LLM-MIDI dataset to predict the next melodic phrase, assessing its ability to learn and generalize melodic structures.
Also Read:
- Aligning Language Models: The Power of Inferring Rewards with Inverse Reinforcement Learning
- AI’s Expanding Role in Scientific Discovery: From Assistants to Autonomous Researchers
Key Findings and Implications
The results, while not outperforming state-of-the-art models trained on human-composed music (which was not the primary goal), yielded significant insights:
- The direct LLM classification, though limited, performed better than random chance, indicating some capacity to discern musical structure from text-based patterns. The LLM demonstrated an analytical approach, breaking down musical attributes like melody, chords, bass, and rhythm from the JSON data and synthesizing them into genre/style predictions.
- The CNN trained on the LLM-generated dataset significantly outperformed the direct LLM classification, affirming that supervised training with actual music data (even if AI-generated) provides richer clues for classification than an LLM relying solely on text-based reasoning. In some cases, it even surpassed certain baseline models trained on existing MIDI datasets for style classification.
- The transformer model for melody completion also exceeded random chance, suggesting it learned basic melodic patterns like pitch continuity and rhythmic flow from the LLM-generated data.
These findings highlight that LLMs can indeed infer rudimentary musical structures and temporal relationships from text, demonstrating their potential to implicitly encode musical patterns. The performance gap compared to models trained on human-crafted music underscores the limitations due to a lack of explicit musical context and the simplified nature of the generated data. However, the ability to exceed chance performance suggests an inherent capacity for cross-domain learning.
This research is not about creating the next chart-topping AI musician, but rather about understanding the fundamental capabilities of LLMs. It reveals their latent capacity to bridge text and symbolic music, positioning them as versatile learners capable of synthesizing knowledge across different structured systems. This opens new avenues for generative music systems driven by text-based AI and offers a novel perspective into the representational power of LLMs. For more details, you can read the full paper here.


