Unpacking AI's Ability to Translate Tables into Natural Language

TLDR: This research introduces Combo-Eval, a novel evaluation framework for assessing how Large Language Models (LLMs) convert tabular database results into natural language representations (NLRs). It combines traditional metrics with LLM-as-a-judge, achieving superior alignment with human judgment and significantly reducing computational costs. The paper also presents NLR-BIRD, the first dedicated dataset for NLR benchmarking, and highlights that LLMs struggle with generating accurate NLRs for larger datasets, with incomplete information being the primary error.

In today’s fast-paced digital world, where conversational AI agents are becoming increasingly common, the ability for these systems to interact with databases using natural language is crucial. This interaction often involves two main steps: converting a user’s natural language question into a structured SQL query, and then transforming the tabular results from that query back into a user-friendly natural language representation (NLR). While Large Language Models (LLMs) are typically employed for this latter task, the accuracy and completeness of these NLRs have largely remained an unexplored territory.

A recent research paper, titled “Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs,” delves into this critical area. Authored by Jyotika Singh, Weiyi Sun, Amit Agarwal, Viji Krishnamurthy, Yassine Benajiba, Sujith Ravi, and Dan Roth from Oracle AI, the study introduces a groundbreaking evaluation method called Combo-Eval. This method aims to provide a more accurate and efficient way to judge the quality of LLM-generated NLRs.

The Challenge of Narrating Tabular Data

Imagine asking an AI agent a question about your sales data, and instead of a clear, concise answer, you get a raw, unformatted table. This is where NLRs come in, transforming impersonal data into understandable text. However, the paper highlights that LLMs often struggle with this task, especially when dealing with larger and more complex datasets. The most common issues identified include incomplete information, where crucial details from the table are missed, as well as hallucinations (making up information), presenting results out of order, skipping null values, and inconsistencies in formatting.

The research found a clear trend: the quality of LLM-generated NLRs significantly decreases as the size of the result set increases. While larger LLMs generally perform better on more challenging, larger result sets, even top models like GPT-4o, which excels with smaller data, show a decline in performance with bigger tables.

Introducing Combo-Eval: A Hybrid Approach to Evaluation

To address the limitations of existing evaluation methods, the authors propose Combo-Eval. This innovative framework combines the strengths of traditional metrics (like ROUGE scores, which measure text similarity) with the nuanced judgment capabilities of LLMs themselves (LLM-as-a-judge). The core idea is to first use simple, fast metrics to filter out very clear correct or incorrect NLRs. Only the more ambiguous cases are then passed to a more computationally intensive LLM-as-a-judge for a finer assessment.

This hybrid approach offers significant advantages. The paper demonstrates that Combo-Eval achieves superior alignment with human judgments compared to using metrics or LLM-as-a-judge alone. Crucially, it also leads to a substantial reduction in LLM calls, ranging from 25% to 61%, making the evaluation process much more cost-effective and efficient, particularly for large-scale industrial applications. This efficiency is especially pronounced when using smaller judge models, offering a practical solution without compromising accuracy.

NLR-BIRD: A New Benchmark Dataset

Accompanying the Combo-Eval method is NLR-BIRD, the first dedicated dataset for benchmarking NLR generation. This dataset covers 11 diverse domains, from finance to sports, and includes human-labeled ground truth NLRs. It provides a robust resource for researchers and developers to test and compare the performance of different LLMs in generating natural language representations from database outputs.

Also Read:

Real-World Scenarios and Future Directions

The study explores two main evaluation scenarios: comparing model-generated NLRs against human-annotated Ground Truth (GT) NLRs, and comparing them against the User Question and raw Database Result-Set (UQDB) when ground truth is unavailable. While GT-based evaluations generally yield higher accuracy, UQDB proves to be a viable and practical alternative for real-world industry applications where human-annotated references might not exist.

This research lays a strong foundation for the systematic evaluation of LLMs in narrating tabular data. The insights gained, particularly regarding the prevalence of incomplete information as an error source and the performance drop with larger datasets, open up avenues for improving LLM training strategies. Future work could extend this framework beyond Text-to-SQL systems to other areas requiring the narration of structured data, such as schema enrichment or enhancing interactive systems. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI’s Ability to Translate Tables into Natural Language

The Challenge of Narrating Tabular Data

Introducing Combo-Eval: A Hybrid Approach to Evaluation

NLR-BIRD: A New Benchmark Dataset

Real-World Scenarios and Future Directions

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates