Measuring How AI Embeddings Combine Meanings: A New Evaluation Framework

TLDR: Researchers developed a two-step evaluation framework to quantify additive compositionality in AI embeddings (word, sentence, knowledge graph). They found that embeddings across different models and training stages exhibit significant linear compositionality, allowing them to generalize to new combinations. However, the framework also identifies limitations where linear composition breaks down, pointing to areas for future non-linear research.

Understanding how artificial intelligence models combine basic units of meaning to form complex ideas is crucial for their ability to generalize and interpret novel expressions. This concept, known as compositionality, is at the heart of how language models process information. A recent research paper, titled “Quantifying Compositionality of Classic and State-of-the-Art Embeddings,” delves into this very challenge, proposing a robust framework to measure how well different types of AI embeddings exhibit additive compositionality.

The study, authored by Zhijin Guo, Chenhao Xue, Zhaozhen Xu, Hongbo Bo, Yuxuan Ye, Janet B. Pierrehumbert, and Martha Lewis, highlights a long-standing debate in AI. Early static word embeddings like Word2vec made strong claims about their compositional nature, often demonstrated by simple analogies like “king – man + woman = queen.” However, these claims faced criticism for being overly simplistic. On the other hand, modern generative transformer models (like BERT, GPT, and Llama) offer immense flexibility but often lack clear boundaries on how context can shift meaning, potentially obscuring their compositional structure.

A Two-Step Approach to Quantifying Compositionality

To address this, the researchers formalized a two-step, generalized evaluation pipeline. This method is designed to be modality-agnostic, meaning it can be applied to various types of embeddings, including words, sentences, and knowledge graphs.

The first step focuses on **quantifying linearity**. This involves measuring the linear relationship between known attributes of entities (e.g., demographic information for users, concepts in a sentence, or morphological features of a word) and their corresponding embeddings. This is achieved using Canonical Correlation Analysis (CCA), a statistical method that identifies and quantifies shared information between two sets of variables.

The second step, **quantifying additive generalization**, assesses whether these linear components can be combined to predict embeddings for unseen attribute combinations. This is done through a “Leave-One-Out” (LOO) experiment. In this setup, the model learns to associate attributes with embeddings from a subset of data, then attempts to reconstruct the embedding for a left-out entity based on its attributes. The accuracy of this reconstruction is measured using metrics like L2 loss (reconstruction error), cosine similarity (alignment between predicted and actual embeddings), and retrieval accuracy (how well the predicted embedding identifies the correct entity).

Experiments Across Diverse Data Modalities

The framework was rigorously applied to three distinct data modalities:

Sentence Embeddings: The study evaluated SBERT, GPT, and Llama models using sentences annotated with concepts from the Schema-Guided Dialogue (SGD) dataset. This allowed them to see if sentence meanings could be additively decomposed into their constituent concepts.
Knowledge Graph Embeddings: Using the MovieLens 1M dataset, user embeddings (derived from movie preferences) were analyzed against demographic attributes (gender, age, occupation) to see if these attributes composed linearly within the user embeddings.
Word Embeddings: Word2vec embeddings were examined for their ability to capture both semantic (using WordNet) and morphological (using MorphoLex) information, specifically looking at how roots and suffixes combine.

Key Findings and Insights

The experiments yielded several important insights:

Across all modalities, a significant linear correlation was found between embeddings and their semantic features, confirming the foundational assumption for compositionality.
Additive generalization was consistently observed. For instance, sentence embeddings could be reconstructed for unseen concept combinations, and user embeddings generalized additive relationships to new user attributes. Word2vec embeddings also showed decomposition into root and suffix combinations.
Compositionality signals increased during training stages for models like MultiBERT and knowledge graph embeddings, indicating that models learn more compositional structures over time.
Interestingly, in transformer-based models like SBERT, compositionality generalization increased through earlier layers, peaking around layers 4 or 5, but then showed an abrupt decline in the final layer. This suggests that later layers might specialize in task-specific representations, potentially moving away from purely additive compositional structures.

Also Read:

Understanding Compositional Failures and Future Directions

Beyond successful cases, the framework also quantifies instances where additive compositionality breaks down. These “failure cases” are crucial as they highlight the limitations of linear composition and point to semantic phenomena that require more complex, non-linear interactions or context-dependent meanings. For example, fluctuations in retrieval accuracy across transformer layers suggest challenges in accurately representing concepts in natural language.

The researchers emphasize that while current models retain a surprising degree of additive compositional structure, there are consistent residuals that signal unresolved semantic complexities. These findings underscore the need for future research into more expressive, non-linear approaches to compositional representation.

This work provides a unified and statistically robust diagnostic for evaluating compositionality, offering valuable opportunities for improving the interpretability of representation learning in AI. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Measuring How AI Embeddings Combine Meanings: A New Evaluation Framework

A Two-Step Approach to Quantifying Compositionality

Experiments Across Diverse Data Modalities

Key Findings and Insights

Understanding Compositional Failures and Future Directions

Gen AI News and Updates

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates