ToMMeR: Extracting Entity Mentions from Early LLM Layers

TLDR: ToMMeR is a lightweight model that efficiently detects entity mentions in text by probing early layers of large language models. It achieves high recall (93% zero-shot) and precision (90%+) across 13 NER benchmarks, demonstrating that LLMs naturally encode entity boundaries as an emergent capability. The model can also be extended to achieve competitive Named Entity Recognition performance, offering an efficient and transferable solution for information extraction.

A new research paper introduces ToMMeR, a novel and efficient approach to identifying entity mentions within text using large language models (LLMs). This development is significant for information extraction, a foundational task in natural language processing that often faces performance bottlenecks.

Traditionally, identifying text spans that refer to entities—known as mention detection—has been a complex task, often conflated with entity typing (e.g., classifying a mention as a person, organization, or location). Existing systems typically require extensive training on task-specific annotations and involve hundreds of millions of parameters. However, recent evidence suggests that LLMs might already encode entity-like spans during their pretraining phase.

Introducing ToMMeR: A Lightweight Solution

ToMMeR, which stands for Token Matching for Mention Recognition, is a lightweight model designed to probe and extract these inherent mention detection capabilities from the early layers of any LLM backbone. With fewer than 300,000 parameters, ToMMeR is remarkably efficient and can be trained in a matter of hours, without modifying the underlying LLM.

The core idea behind ToMMeR is to leverage the latent binding signals within LLM representations. It uses a simple feed-forward head that aggregates token-matching and token-value features. This allows it to score spans directly from the frozen LLM’s representations, eliminating the need for schema input, prompting, or text generation.

How ToMMeR Works

ToMMeR operates by analyzing how tokens within an LLM’s early layers relate to each other. It adapts the transformer’s attention mechanism to quantify the association between token pairs, using a cosine similarity metric. This helps identify internal token bindings within a potential entity span. Complementing this, token-level information is incorporated to provide crucial cues about a span’s boundaries and context. A logistic model then predicts the probability of a span being a valid entity mention based on these matching scores and token values.

The model is trained on Pile-NER, a diverse dataset of GPT-3.5 annotations from The Pile, which offers broad semantic coverage. To address the common issue of class imbalance in mention detection, ToMMeR employs a Balanced Binary Cross-Entropy loss function, ensuring fair contribution from both entity and non-entity spans.

Key Findings and Performance

ToMMeR demonstrates impressive performance across various benchmarks:

High Recall and Precision: Across 13 diverse Named Entity Recognition (NER) benchmarks, ToMMeR achieves a remarkable 93% zero-shot recall. Its precision, validated by an LLM-as-a-judge evaluation, stands at over 90%, indicating that it rarely produces incorrect predictions despite its high coverage.
Emergent Capability: A cross-model analysis involving LLMs ranging from 14 million to 15 billion parameters revealed that diverse architectures converge on similar mention boundaries (DICE scores > 0.75). This strongly suggests that mention detection is a shared, emergent capability of language modeling, rather than an artifact of specific datasets or architectures.
Early Layer Detection: The ability to detect mentions emerges very early in the LLM’s computational process, with near-optimal performance achieved using representations from the first layer of the transformer. This implies that entity-related signals are established and maintained consistently throughout the model’s depth.
Extension to Full NER: When extended with span classification heads, ToMMeR achieves competitive performance (80-87% F1 score) on standard NER benchmarks, proving its utility as a foundational component for complete information extraction pipelines.

Also Read:

Impact and Future Directions

The introduction of ToMMeR offers both practical and conceptual contributions. Practically, it provides a lightweight, transferable, and high-coverage method for mention detection that can be integrated into any LLM, enabling real-time streaming deployment with minimal overhead. Conceptually, it offers compelling evidence that LLMs develop structured entity representations in their early layers, which can be efficiently recovered through simple probing mechanisms.

This work positions ToMMeR at the forefront of efficient probing methods and practical information extraction systems, paving the way for more modular and schema-agnostic extraction pipelines. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

ToMMeR: Extracting Entity Mentions from Early LLM Layers

Introducing ToMMeR: A Lightweight Solution

How ToMMeR Works

Key Findings and Performance

Impact and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates