TLDR: This research paper investigates how Large Language Models (LLMs) compose meaningful word representations from subword units. Through geometry analysis and probing tasks, the study examines structural similarity, semantic decomposability, and form retention across six LLMs. It identifies three distinct compositional strategies among these models, showing variations in how they preserve content and form information across layers. The findings suggest that pre-training largely determines these strategies and that contextualization significantly impacts composition, especially for certain model families.
Large Language Models (LLMs) are at the forefront of artificial intelligence, capable of understanding and generating human-like text. A fundamental aspect of how these models process language involves breaking down words into smaller units called subwords. For instance, the word “unbreakable” might be split into “un”, “break”, and “able”. The big question then becomes: how do LLMs effectively combine these subword pieces to grasp the meaning of the entire word? This research paper, titled “Understanding Subword Compositionality of Large Language Models,” delves deep into this very question, exploring the intricate ways LLMs build word representations from their subword components.
The study, conducted by Qiwei Peng, Yekun Chai, and Anders Søgaard from the University of Copenhagen and ETH Zurich, investigates three crucial aspects of subword compositionality: structural similarity, semantic decomposability, and form retention. Structural similarity examines how the geometric arrangement of combined subword representations relates to the representation of the whole word. Semantic decomposability probes whether the model understands if a word’s meaning can be broken down into its parts (like “unbreakable”) or if it’s a single, indivisible unit (like “dog”). Form retention looks at whether surface-level features, such as the length of a word, are preserved as subwords are combined.
To analyze these aspects, the researchers employed two main methodologies. First, they used a technique called Procrustes Analysis for geometry analysis. This method helps quantify how similar the vector space of composed subwords is to the vector space of the original whole word. Imagine trying to fit two different shapes together; Procrustes Analysis finds the best way to align them and measures how well they match. The study found that simple addition of subword representations consistently yielded the highest structural similarity to the whole word representation across all models and layers.
Interestingly, the geometry analysis revealed that LLMs could be categorized into three distinct groups based on their compositional strategies. Aya-expanse and Gemma2 models showed consistently high structural similarity, maintaining a strong alignment between composed and whole-word vectors throughout their layers. Falcon and Qwen2.5 models exhibited good structural similarity in early layers, which then weakened in later layers. Llama3 and Llama3.1 models, on the other hand, displayed only moderate structural similarity, primarily in their initial embedding layer, with a rapid decline thereafter. The research also highlighted that instruction tuning (fine-tuning models for specific tasks) had minimal impact on these core compositional patterns; instead, these strategies are largely established during the pre-training phase.
Further insights came from examining how different word types behave. Non-root words, which are typically decomposable into meaningful units (e.g., “prepared”), consistently showed higher structural similarity than root words (e.g., “dog”), which are semantic atoms. This suggests that LLMs find it easier to compose representations for words that inherently have a more modular structure. The study also explored the impact of contextualization, where subwords are processed together within the LLM. When subwords were contextualized, all models showed stronger linear alignment, with Llama models, in particular, demonstrating high levels of isometry in their middle layers. This indicates that some LLMs require the subwords to interact within a shared context to form a linearly alignable composed representation, potentially hinting at different underlying composition mechanisms.
The second methodology involved probing analysis, where simple classification or regression models were trained to predict specific properties from the LLM’s representations. For semantic decomposability, the task was to classify whether a word was a root or non-root word. The results were striking: composed representations consistently preserved this content information with high accuracy (over 80% F1 score) across all models and layers, regardless of the variations in structural similarity. This suggests that even if the geometric alignment isn’t perfect, the essential semantic meaning is still encoded.
For form retention, the task was to predict the length of the word. Here, a different pattern emerged. Word length information was strongest in the early layers, gradually decreased in the middle layers (as models abstract away surface features), and then surprisingly re-emerged in the final layers. This suggests a dynamic process where early layers capture explicit surface-level features, middle layers prioritize semantic abstraction, and later layers might re-integrate some form-related information. Crucially, the same three groups of LLMs identified in the geometry analysis also appeared in their word length prediction patterns, reinforcing the idea of distinct compositional strategies.
Also Read:
- Exploring the Internal Cognitive Organization of Large Language Models
- Enhancing Language Models with New Vocabulary for Specialized Tasks
In conclusion, this research provides valuable insights into the “black box” of LLM subword composition. It demonstrates that LLMs employ diverse strategies to construct word representations from subwords, which can be broadly categorized into three groups. These strategies impact how structural similarity, semantic content, and formal features like word length are preserved across the model’s layers. The findings underscore the importance of pre-training data and its mixture in shaping these compositional dynamics. For a deeper dive into the experimental details and results, you can read the full paper here.


