Understanding Grammar in Language Models: A New Perspective on String Probability

TLDR: This research paper introduces a theoretical framework to explain how language models (LMs) learn grammar, proposing that string probability is determined by both the underlying message and the string’s grammaticality. It empirically validates three key predictions: LMs show strong correlations in probabilities for grammatical and ungrammatical sentences within meaning-matched minimal pairs, their probability differences align with human acceptability judgments in such pairs, and they exhibit poor separation between general grammatical and ungrammatical strings. The study provides a robust theoretical basis for using minimal pair comparisons to evaluate LM grammatical competence and sheds light on the complex relationship between statistical likelihood and linguistic correctness in AI.

The question of what language models (LMs) truly understand about grammar has been a subject of intense debate in the field of linguistics and artificial intelligence. While LMs are incredibly adept at generating human-like text, their internal grasp of grammatical rules, distinct from statistical likelihood, remains a complex puzzle. A recent research paper, “What Can String Probability Tell Us About Grammaticality?”, delves into this fundamental issue, offering a theoretical framework and empirical evidence to clarify the relationship between string probability, meaning, and grammaticality in LMs.

Traditionally, grammaticality and probability are considered separate concepts in linguistics. For instance, the famous sentence “Colorless green ideas sleep furiously” is grammatically correct but highly improbable. Conversely, an ungrammatical sentence might appear frequently in real-world usage, leading LMs to assign it a non-zero probability. This inherent characteristic of LMs, designed to model real-world language, means they will always assign some probability to ungrammatical strings, making direct assessment of their grammatical knowledge challenging.

The authors, Jennifer Hu, Ethan Gotlieb Wilcox, Siyuan Song, Kyle Mahowald, and Roger P. Levy, propose a framework where the probability of a string is influenced by two latent variables: its underlying ‘message’ and its ‘grammaticality’. This means that a string’s likelihood isn’t just about whether it’s grammatically correct, but also about how probable its conveyed meaning is. This distinction is crucial for understanding how LMs process language.

The Role of Minimal Pairs

A common approach to evaluating LM grammar involves using “minimal pairs” – sentences that differ by a single grammatical feature, with one being grammatical and the other ungrammatical. For example, “The moon emerges” (grammatical) versus “*The moon emerge” (ungrammatical). The paper provides a formal argument for why this minimal pair approach is appropriate. When two sentences convey a sufficiently similar message, comparing their probabilities can reveal insights into the LM’s understanding of grammaticality, as the ‘message probability’ factor is largely controlled.

However, the framework also highlights that the probability of a message can sometimes overshadow the contribution of grammaticality. A model might assign a higher probability to an ungrammatical sentence if its message is far more common or plausible than the message of a grammatically correct but unusual sentence. This suggests that for a fair assessment, minimal pairs must be carefully constructed to ensure the messages are truly matched.

Three Key Predictions and Empirical Validation

The research paper outlines three main predictions derived from its theoretical framework, which were then tested empirically using 280,000 sentence pairs in both English and Chinese, and evaluated across various language models like GPT-2 and Llama-3.

The first prediction states that there should be a correlation between the log-probability of grammatical and ungrammatical strings within minimal pairs. This is because, when the message is controlled, both strings are influenced by the same underlying message probability. The empirical results strongly confirmed this, showing a positive correlation that weakened as the ‘minimalness’ (semantic similarity) of the pairs decreased.

The second prediction posits a correlation between the differences in log-probability assigned by models and human acceptability judgments within minimal pairs. If a model correctly captures grammatical distinctions, the probability gap between a grammatical and ungrammatical sentence should align with how humans perceive their acceptability. This prediction was largely validated, particularly for English datasets, suggesting that LMs’ probability differences can indeed reflect human grammatical intuitions when the message is controlled.

Finally, the third prediction addresses a long-standing observation by Chomsky: that grammatical and ungrammatical sentences are often scattered throughout a list ranked by statistical approximation to English, rather than being neatly separated. The paper’s framework explains this by showing that string probability alone, influenced by both message and grammaticality, does not inherently separate grammatical from ungrammatical strings. Empirical tests confirmed this, even with various normalizing transformations of probability, indicating substantial overlap between the scores of grammatical and ungrammatical sentences when not part of controlled minimal pairs.

Also Read:

Implications for Evaluating Language Models

This research provides crucial theoretical grounding for the widespread practice of using minimal-pair probability comparisons to assess the grammatical knowledge of LMs. It clarifies that critiques based on the poor separation of general grammatical and ungrammatical strings do not necessarily invalidate the use of probability for grammatical evaluation, especially when controlled minimal pairs are used. The findings also highlight the importance of carefully designing evaluation procedures to factor out the influence of message probability when trying to isolate an LM’s sensitivity to grammatical rules.

The paper concludes by noting a fascinating tension: LMs are excellent at generating grammatical text, yet they struggle to discriminatively separate grammatical from ungrammatical strings based on raw probability. This observation connects to the broader “generative AI paradox,” where what an AI can create, it may not fully understand in a human-like cognitive sense. Ultimately, this work encourages a more nuanced approach to evaluating LMs, recognizing their unique computational architecture and the complex interplay between probability, meaning, and grammar.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Understanding Grammar in Language Models: A New Perspective on String Probability

The Role of Minimal Pairs

Three Key Predictions and Empirical Validation

Implications for Evaluating Language Models

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates