How Language Models Build Their Lexicon: Early Organization of Semantic and Syntactic Structures

TLDR: This research paper investigates the development of vocabulary embeddings in large language models (LLMs) during training. It finds that semantic and syntactic linguistic structures are rapidly organized early in the training process. High-frequency words stabilize their representations faster and overcome initial biases, while low-frequency words retain some influence from their random initializations. Embeddings continue to evolve after initial linguistic stabilization, primarily by refining morphological relationships between rare words. The study underscores the significant and distinct roles of word frequency and function in shaping the internal lexicon of LLMs.

Large language models (LLMs) process information by manipulating complex input embedding vectors across multiple layers. A recent research paper delves into the fundamental question of how these input vocabulary representations are structured and how this structure evolves throughout the training process of an LLM.

The study, titled “Vocabulary Embeddings Organize Linguistic Structure Early in Language Model Training,” was conducted by Isabel Papadimitriou from the University of British Columbia and Jacob Prince from Harvard University. Their work provides crucial insights into the dynamic journey by which word embeddings in LLMs develop and organize linguistic information.

To investigate this, the researchers employed Representational Similarity Analysis (RSA), a method that correlates the geometric structure of embeddings with various linguistic metrics, including semantic, syntactic, and frequency-based features. They analyzed two prominent open-source models, Pythia 12B and OLMo 7B, across numerous training checkpoints.

A significant finding is that the vocabulary embedding geometry rapidly aligns with semantic and syntactic features early in the training cycle. For instance, semantic structures, which relate to word meanings, show high correlations and stabilize very quickly, often within the initial 10,000 training steps. Syntactic organization, such as part-of-speech categories and verb classes, also peaks early, although these correlations tend to be slightly lower than those for semantic features. Interestingly, a combined hypothesis using both part-of-speech and Wiktionary tags demonstrated a more gradual and sustained increase in correlation, suggesting that the model’s representations might capture more complex, interacting linguistic features.

Word frequency plays a distinct and continuous role in shaping these embeddings. High-frequency words, like common articles and prepositions (e.g., “the,” “of”), converge to their final vector representations much faster than less frequent or lexical words. However, when considering the impact per exposure, high-frequency words change more slowly with each update compared to low-frequency words, which undergo more significant shifts per occurrence despite being seen less often. The study also revealed that low-frequency words tend to retain some influence from their random initializations, whereas the most frequent words completely shed these early biases. Over time, the vocabulary embeddings progressively organize themselves to reflect frequency rank relationships, meaning words with similar frequency standings become geometrically closer.

Even after the initial stabilization of semantic and syntactic features, which occurs around 15% into training, the embeddings continue to evolve. The research shows that individual word embeddings continue to shift substantially in absolute terms. A qualitative analysis of these later changes revealed a consistent pattern: rare and technical nouns, along with their morphological inflections (e.g., “galaxy” and “galaxies”), move significantly closer to each other. This suggests that later stages of training are crucial for refining these specific, often less frequent, word relationships, potentially as a way for the model to learn their shared meaning.

Also Read:

This research offers a comprehensive framework for understanding how linguistic structure is encoded in LLM embeddings and how this organization emerges during training. It highlights the critical, yet distinct, roles of word frequency and function in this process. The findings pave the way for deeper investigations into how the evolution of vocabulary geometry contributes to the development of specific capabilities in language models. For a more in-depth understanding, you can access the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

How Language Models Build Their Lexicon: Early Organization of Semantic and Syntactic Structures

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Avalara Secures $500 Million Investment from BlackRock to Propel AI-Powered Tax Automation

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates