TLDR: This research introduces Combo-Eval, a novel evaluation framework for assessing how Large Language Models (LLMs) convert tabular database results into natural language representations (NLRs). It combines traditional metrics with LLM-as-a-judge, achieving superior alignment with human judgment and significantly reducing computational costs. The paper also presents NLR-BIRD, the first dedicated dataset for NLR benchmarking, and highlights that LLMs struggle with generating accurate NLRs for larger datasets, with incomplete information being the primary error.
In today’s fast-paced digital world, where conversational AI agents are becoming increasingly common, the ability for these systems to interact with databases using natural language is crucial. This interaction often involves two main steps: converting a user’s natural language question into a structured SQL query, and then transforming the tabular results from that query back into a user-friendly natural language representation (NLR). While Large Language Models (LLMs) are typically employed for this latter task, the accuracy and completeness of these NLRs have largely remained an unexplored territory.
A recent research paper, titled “Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs,” delves into this critical area. Authored by Jyotika Singh, Weiyi Sun, Amit Agarwal, Viji Krishnamurthy, Yassine Benajiba, Sujith Ravi, and Dan Roth from Oracle AI, the study introduces a groundbreaking evaluation method called Combo-Eval. This method aims to provide a more accurate and efficient way to judge the quality of LLM-generated NLRs.
The Challenge of Narrating Tabular Data
Imagine asking an AI agent a question about your sales data, and instead of a clear, concise answer, you get a raw, unformatted table. This is where NLRs come in, transforming impersonal data into understandable text. However, the paper highlights that LLMs often struggle with this task, especially when dealing with larger and more complex datasets. The most common issues identified include incomplete information, where crucial details from the table are missed, as well as hallucinations (making up information), presenting results out of order, skipping null values, and inconsistencies in formatting.
The research found a clear trend: the quality of LLM-generated NLRs significantly decreases as the size of the result set increases. While larger LLMs generally perform better on more challenging, larger result sets, even top models like GPT-4o, which excels with smaller data, show a decline in performance with bigger tables.
Introducing Combo-Eval: A Hybrid Approach to Evaluation
To address the limitations of existing evaluation methods, the authors propose Combo-Eval. This innovative framework combines the strengths of traditional metrics (like ROUGE scores, which measure text similarity) with the nuanced judgment capabilities of LLMs themselves (LLM-as-a-judge). The core idea is to first use simple, fast metrics to filter out very clear correct or incorrect NLRs. Only the more ambiguous cases are then passed to a more computationally intensive LLM-as-a-judge for a finer assessment.
This hybrid approach offers significant advantages. The paper demonstrates that Combo-Eval achieves superior alignment with human judgments compared to using metrics or LLM-as-a-judge alone. Crucially, it also leads to a substantial reduction in LLM calls, ranging from 25% to 61%, making the evaluation process much more cost-effective and efficient, particularly for large-scale industrial applications. This efficiency is especially pronounced when using smaller judge models, offering a practical solution without compromising accuracy.
NLR-BIRD: A New Benchmark Dataset
Accompanying the Combo-Eval method is NLR-BIRD, the first dedicated dataset for benchmarking NLR generation. This dataset covers 11 diverse domains, from finance to sports, and includes human-labeled ground truth NLRs. It provides a robust resource for researchers and developers to test and compare the performance of different LLMs in generating natural language representations from database outputs.
Also Read:
- OraPlan–SQL: Advancing Natural Language to SQL Conversion with Intelligent Planning
- LongWeave: A New Standard for Assessing AI’s Long Text Capabilities
Real-World Scenarios and Future Directions
The study explores two main evaluation scenarios: comparing model-generated NLRs against human-annotated Ground Truth (GT) NLRs, and comparing them against the User Question and raw Database Result-Set (UQDB) when ground truth is unavailable. While GT-based evaluations generally yield higher accuracy, UQDB proves to be a viable and practical alternative for real-world industry applications where human-annotated references might not exist.
This research lays a strong foundation for the systematic evaluation of LLMs in narrating tabular data. The insights gained, particularly regarding the prevalence of incomplete information as an error source and the performance drop with larger datasets, open up avenues for improving LLM training strategies. Future work could extend this framework beyond Text-to-SQL systems to other areas requiring the narration of structured data, such as schema enrichment or enhancing interactive systems. For more details, you can read the full research paper here.


