TLDR: Text embedding models, crucial for NLP, struggle to accurately capture nuanced numerical information in text, performing only slightly better than chance on a new financial dataset called EmbedNum-1K. LLM-based models show a slight advantage, but overall, models are sensitive to number formats, challenged by out-of-vocabulary numbers, and exhibit biases in numerical reasoning. The study reveals that general language understanding or frequent exposure to numbers doesn’t guarantee numerical precision, and rich context can obscure numerical details, highlighting a need for specialized designs to improve numerical capabilities.
Text embedding models are fundamental to many modern natural language processing (NLP) applications, transforming words and sentences into numerical vectors that machines can process. These vectors power everything from semantic search to advanced retrieval-augmented generation (RAG) systems. While these models have shown impressive performance on various benchmarks, a critical question has largely remained unaddressed: how well do they truly understand and encode nuanced numerical information within text?
Consider scenarios in finance or healthcare, where numerical precision is paramount. A statement like “Company X’s market share grew by 2%” carries a vastly different implication than “Company X’s market share grew by 20%.” Similarly, clinical notes detailing blood pressure readings of “120/80 mmHg” versus “180/110 mmHg” represent drastically different patient conditions. If embedding models fail to capture these subtle numerical distinctions, they could lead to misleading interpretations and potentially critical errors in decision-making systems.
Unveiling the Numeracy Gap with EmbedNum-1K
To address this crucial evaluation gap, researchers introduced EmbedNum-1K, a specialized dataset designed to rigorously test how well text embedding models preserve numerical information. This dataset, rooted in a financial context, comprises 1,000 synthetic samples. Each sample presents a question (Q) and two candidate answers (A+ and A-). The key is that A+ and A- differ only in their numerical values, with A+ being the correct answer based on the numerical condition in Q. For instance, if Q asks “Who owns over 15% of the company?”, A+ might be “Investor Alice owns a 20% stake,” while A- is “Investor Alice owns a 5% stake.” The task for the embedding model is to identify A+ as closer to Q in the embedding space.
The EmbedNum-1K dataset also features 17 variants, exploring how models handle numbers presented in different formats—integers, decimals, percentages, and even written forms like “six.” This allows for a fine-grained analysis of numerical understanding.
Key Findings: Models Struggle with Numbers
The empirical investigation, evaluating 13 widely used text embedding models (including transformer encoder-based, LLM-based, and commercial models), revealed a significant limitation: embedding models generally struggle to accurately capture numerical details in text. The average retrieval accuracy across all models was a mere 54%, only slightly better than random guessing (50%). This suggests that current training practices, which often prioritize overall semantic similarity, tend to overlook the fine-grained precision required for numerical content.
Interestingly, LLM-based embedding models showed a noticeable advantage, achieving about 5 percentage points higher accuracy than encoder-based models (56% vs. 51%). This superiority is likely linked to the advanced natural language understanding capabilities inherent in large language models.
The format of numbers also proved critical. Models interpreted values like “8%” and “0.08” distinctly, with accuracy varying by up to 12 percentage points depending on the numeric format. They performed best on single-decimal numbers (e.g., 0.5, 0.8) but struggled significantly with four-digit integers and comma-separated numbers, often performing at a random guessing level.
Further analysis showed that simply improving a model’s “literacy” in a specific domain (e.g., fine-tuning on financial data) does not automatically translate into better “numeracy.” This highlights that numerical understanding requires specialized design considerations beyond general domain adaptation.
Out-of-vocabulary (OOV) numbers, such as complex decimals or comma-separated integers, posed a particular challenge. These numbers are often broken into multiple tokens during processing, disrupting their magnitude and making them harder for models to interpret accurately. Moreover, mirroring human cognitive patterns, models also struggled with long, high-precision numbers, showing declining accuracy as the number of significant figures increased.
Deeper Insights into Embedding Numeracy
The study also uncovered several additional fascinating insights:
- Performance Asymmetry: Models systematically favored “greater-than” questions (e.g., “above 200”) over “less-than” questions (e.g., “below 200”), demonstrating a clear bias in numerical reasoning.
- Digit vs. Written Form: While converting numbers to their written form (e.g., “twenty-four” instead of “24”) offered a minor accuracy gain, the overall performance remained low. This suggests that the limitation is inherent to how models represent numerical content, not just an OOV issue.
- Frequency of Exposure: Surprisingly, numbers that appeared more frequently in training data did not consistently lead to better model performance. This indicates that simply increasing numerical data in pretraining might not be an efficient path to improved numeracy.
- Contextual Dilemma: A significant finding was the “granularity dilemma.” In context-rich sentences, the embedding’s capacity was largely consumed by the semantic context, weakening the representation of fine-grained numerical details. Models performed better in context-reduced settings, implying that additional context can sometimes “hide” the numbers.
- Probing vs. Task Performance: Traditional probing tests, which attempt to decode numerical values directly from embeddings, showed that encoder-based models yielded higher Adjusted R2 scores (meaning values were easier to decode). However, these same models performed worse on the actual retrieval task compared to LLM-based models. This challenges the reliability of probing as a metric for real-world numeracy in context-rich settings.
Also Read:
- Unpacking AI’s Numerical Acumen: Why Language Models Struggle with Intuitive Math
- Assessing Topic Model Quality with Large Language Models: A New Framework
Moving Forward
This research underscores a critical limitation in current text embedding models: their struggle to accurately preserve and interpret nuanced numerical information. The findings suggest that simply scaling up models or training data is insufficient. Future research needs to explore dedicated design considerations and specialized architectures that explicitly account for fine-grained numerical details. Addressing this “numeracy gap” is vital for advancing NLP applications in number-intensive domains like finance, healthcare, and scientific research, where numerical precision is not just important, but often critical for reliable decision-making.


