Code-Based Prompting: A New Way to Improve Named Entity Recognition with Large Language Models

TLDR: CodeNER is a novel method that uses code-based prompting to enhance Named Entity Recognition (NER) capabilities of Large Language Models (LLMs). By embedding detailed BIO schema instructions within structured code prompts, CodeNER helps LLMs better understand and perform NER, overcoming limitations of traditional text-based prompting. Experimental results show CodeNER consistently outperforms text-based methods across various languages and models, demonstrating improved accuracy in identifying entity boundaries and handling complex text structures.

Named Entity Recognition, or NER, is a fundamental task in natural language processing (NLP) that involves identifying and classifying named entities in text, such as people, locations, and organizations. Traditionally, NER has been approached as a sequential labeling problem, where models assign a specific tag (like BIO schema) to each word in a sentence to mark entity boundaries. While conventional methods have achieved high performance, they often require extensive labeled datasets for training.

Recently, large language models (LLMs) have shown remarkable capabilities in various NLP tasks, including NER, through in-context learning and zero-shot task-solving. However, applying LLMs to NER using traditional text-based prompting methods presents a challenge. LLMs typically operate on a “text-in-text-out” schema, which doesn’t naturally align with the “text-in-span-out” nature of NER, where the goal is to identify specific spans of text as entities. This mismatch can lead to difficulties in accurately identifying entity boundaries and handling the sequential aspects of NER.

To address these limitations, a new method called CodeNER has been proposed. CodeNER introduces a novel approach that leverages code-based prompting to enhance LLMs’ understanding and performance in NER. Instead of relying solely on natural language instructions, CodeNER embeds detailed BIO schema labeling instructions within structured code prompts, typically in Python. This approach exploits the LLMs’ inherent ability to comprehend programming language structures, allowing them to better identify entity boundaries and process text sequentially.

The core idea behind CodeNER is to provide explicit guidance for sequential processing, which is crucial for accurate NER, especially in zero-shot and few-shot scenarios where direct supervision is minimal. By defining variables for sentences and NER tag labels, and including a function that iterates through tokens to apply BIO tags, CodeNER guides the LLM to dynamically define and populate an entity dictionary. This structured approach helps overcome issues like misinterpretation and variability often encountered with purely text-based prompts.

Experiments were conducted across ten benchmark datasets in English, Arabic, Finnish, Danish, and German, using both closed models like ChatGPT (GPT-4 and GPT-4 Turbo) and open models such as Llama-3-8B and Phi-3-mini-128k-instruct. The results consistently showed that CodeNER outperforms conventional text-based prompting methods. For instance, CodeNER demonstrated significant improvements on datasets like FIN and MIT restaurant, with average F1 score improvements across all datasets for both GPT-4 and GPT-4 Turbo. When analyzing performance across specific entity labels (Person, Location, Organization, Miscellaneous), CodeNER generally showed stronger performance, particularly for the Miscellaneous category.

Further analysis with open models like Phi-3 and Llama-3-8B also confirmed CodeNER’s superior performance compared to vanilla text-based prompts and even other partial code-based prompts like GoLLIE and GNER. The study highlighted that the effectiveness of CodeNER is closely tied to the LLM’s capability to interpret programming language instructions. The integration of the BIO schema within the code-based prompts was found to be particularly important for accurately identifying entity span boundaries.

Case studies revealed that CodeNER is more robust in handling complex scenarios. For example, it accurately recognized long sequence tokens like website URLs as single units, unlike text-based methods that might break them down. It also effectively managed duplicate and repeated tokens, ensuring each token is labeled correctly and avoiding overlapping classifications. CodeNER also showed better performance in capturing special characters attached to words, which vanilla prompts often missed.

The researchers also explored combining CodeNER with Chain-of-Thought (CoT) prompting, a technique that encourages step-by-step reasoning. This combination further improved performance in zero-shot settings, suggesting that structured programming language-style prompts can enhance LLMs’ understanding of long-range scopes. Interestingly, testing CodeNER with different programming languages like C++ showed comparable results to Python on some datasets, indicating that dataset characteristics might sometimes have a greater impact than the specific programming language chosen. However, some languages like Java proved less effective, possibly due to LLMs lacking inherent knowledge of Java in their pre-training.

Also Read:

In summary, CodeNER offers a significant advantage by bridging the gap between LLMs’ text-in-text-out schema and NER’s text-in-span-out nature. It provides a structured, sequential approach to labeling that reduces errors like overlapping tags and improves the recognition of individual tokens. While highly effective, CodeNER may be less advantageous for datasets with very long sentences containing many function words, where a simpler, context-focused approach might sometimes perform better. The research paper, available at https://arxiv.org/pdf/2507.20423, concludes that CodeNER consistently outperforms text-based prompting methods, demonstrating the effectiveness of explicitly structuring NER instructions within a code-based framework.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Code-Based Prompting: A New Way to Improve Named Entity Recognition with Large Language Models

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates