Bridging the Number Divide: How Text Embedding Models Grapple with Numerical Precision

TLDR: Text embedding models, crucial for NLP, struggle to accurately capture nuanced numerical information in text, performing only slightly better than chance on a new financial dataset called EmbedNum-1K. LLM-based models show a slight advantage, but overall, models are sensitive to number formats, challenged by out-of-vocabulary numbers, and exhibit biases in numerical reasoning. The study reveals that general language understanding or frequent exposure to numbers doesn’t guarantee numerical precision, and rich context can obscure numerical details, highlighting a need for specialized designs to improve numerical capabilities.

Text embedding models are fundamental to many modern natural language processing (NLP) applications, transforming words and sentences into numerical vectors that machines can process. These vectors power everything from semantic search to advanced retrieval-augmented generation (RAG) systems. While these models have shown impressive performance on various benchmarks, a critical question has largely remained unaddressed: how well do they truly understand and encode nuanced numerical information within text?

Consider scenarios in finance or healthcare, where numerical precision is paramount. A statement like “Company X’s market share grew by 2%” carries a vastly different implication than “Company X’s market share grew by 20%.” Similarly, clinical notes detailing blood pressure readings of “120/80 mmHg” versus “180/110 mmHg” represent drastically different patient conditions. If embedding models fail to capture these subtle numerical distinctions, they could lead to misleading interpretations and potentially critical errors in decision-making systems.

Unveiling the Numeracy Gap with EmbedNum-1K

To address this crucial evaluation gap, researchers introduced EmbedNum-1K, a specialized dataset designed to rigorously test how well text embedding models preserve numerical information. This dataset, rooted in a financial context, comprises 1,000 synthetic samples. Each sample presents a question (Q) and two candidate answers (A+ and A-). The key is that A+ and A- differ only in their numerical values, with A+ being the correct answer based on the numerical condition in Q. For instance, if Q asks “Who owns over 15% of the company?”, A+ might be “Investor Alice owns a 20% stake,” while A- is “Investor Alice owns a 5% stake.” The task for the embedding model is to identify A+ as closer to Q in the embedding space.

The EmbedNum-1K dataset also features 17 variants, exploring how models handle numbers presented in different formats—integers, decimals, percentages, and even written forms like “six.” This allows for a fine-grained analysis of numerical understanding.

Key Findings: Models Struggle with Numbers

The empirical investigation, evaluating 13 widely used text embedding models (including transformer encoder-based, LLM-based, and commercial models), revealed a significant limitation: embedding models generally struggle to accurately capture numerical details in text. The average retrieval accuracy across all models was a mere 54%, only slightly better than random guessing (50%). This suggests that current training practices, which often prioritize overall semantic similarity, tend to overlook the fine-grained precision required for numerical content.

Interestingly, LLM-based embedding models showed a noticeable advantage, achieving about 5 percentage points higher accuracy than encoder-based models (56% vs. 51%). This superiority is likely linked to the advanced natural language understanding capabilities inherent in large language models.

The format of numbers also proved critical. Models interpreted values like “8%” and “0.08” distinctly, with accuracy varying by up to 12 percentage points depending on the numeric format. They performed best on single-decimal numbers (e.g., 0.5, 0.8) but struggled significantly with four-digit integers and comma-separated numbers, often performing at a random guessing level.

Further analysis showed that simply improving a model’s “literacy” in a specific domain (e.g., fine-tuning on financial data) does not automatically translate into better “numeracy.” This highlights that numerical understanding requires specialized design considerations beyond general domain adaptation.

Out-of-vocabulary (OOV) numbers, such as complex decimals or comma-separated integers, posed a particular challenge. These numbers are often broken into multiple tokens during processing, disrupting their magnitude and making them harder for models to interpret accurately. Moreover, mirroring human cognitive patterns, models also struggled with long, high-precision numbers, showing declining accuracy as the number of significant figures increased.

Deeper Insights into Embedding Numeracy

The study also uncovered several additional fascinating insights:

Performance Asymmetry: Models systematically favored “greater-than” questions (e.g., “above 200”) over “less-than” questions (e.g., “below 200”), demonstrating a clear bias in numerical reasoning.
Digit vs. Written Form: While converting numbers to their written form (e.g., “twenty-four” instead of “24”) offered a minor accuracy gain, the overall performance remained low. This suggests that the limitation is inherent to how models represent numerical content, not just an OOV issue.
Frequency of Exposure: Surprisingly, numbers that appeared more frequently in training data did not consistently lead to better model performance. This indicates that simply increasing numerical data in pretraining might not be an efficient path to improved numeracy.
Contextual Dilemma: A significant finding was the “granularity dilemma.” In context-rich sentences, the embedding’s capacity was largely consumed by the semantic context, weakening the representation of fine-grained numerical details. Models performed better in context-reduced settings, implying that additional context can sometimes “hide” the numbers.
Probing vs. Task Performance: Traditional probing tests, which attempt to decode numerical values directly from embeddings, showed that encoder-based models yielded higher Adjusted R2 scores (meaning values were easier to decode). However, these same models performed worse on the actual retrieval task compared to LLM-based models. This challenges the reliability of probing as a metric for real-world numeracy in context-rich settings.

Also Read:

Moving Forward

This research underscores a critical limitation in current text embedding models: their struggle to accurately preserve and interpret nuanced numerical information. The findings suggest that simply scaling up models or training data is insufficient. Future research needs to explore dedicated design considerations and specialized architectures that explicitly account for fine-grained numerical details. Addressing this “numeracy gap” is vital for advancing NLP applications in number-intensive domains like finance, healthcare, and scientific research, where numerical precision is not just important, but often critical for reliable decision-making.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Bridging the Number Divide: How Text Embedding Models Grapple with Numerical Precision

Unveiling the Numeracy Gap with EmbedNum-1K

Key Findings: Models Struggle with Numbers

Deeper Insights into Embedding Numeracy

Moving Forward

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates