TLDR: A new study evaluates the effectiveness of structured decoding versus one-shot prompting for text-to-table generation using large language models across three datasets: E2E, Rotowire, and Livesum. The findings indicate that structured decoding significantly improves table validity and numerical alignment, particularly for tasks like Rotowire. However, it can degrade performance when dealing with densely packed textual information (E2E) or complex aggregation over long contexts (Livesum). The research emphasizes that the optimal decoding strategy is highly task-dependent and highlights the need for more suitable evaluation metrics and future advancements in schema inference and complex table generation.
Converting unstructured text into organized tables is a crucial task in artificial intelligence, enabling advancements in areas like knowledge base construction, document summarization, and improving web chatbot readability. This process, known as text-to-table generation, has seen significant evolution, from early sequence-to-sequence models to modern approaches leveraging large language models (LLMs) with various prompting techniques.
A recent study, titled Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets, delves into a less explored aspect of this field: the impact of enforcing structural constraints during table generation. While previous research often focused on unconstrained table generation, this paper systematically compares schema-guided (structured) decoding with standard one-shot prompting across three diverse benchmarks using open-source LLMs of up to 32 billion parameters.
Understanding the Approaches
The research primarily investigates two methods for LLMs to generate structured tables from text:
- Free-form (Unstructured) Generation: Here, LLMs generate a markdown sequence for a given input text. A one-shot instruction prompt guides the model to adhere to a desired table output format, specifying header cells, column, and row numbers. However, LLMs can sometimes struggle with rigid instructions, potentially adding extra prose, breaking tables, or generating multiple distinct tables. Robust post-processing is then required to extract and validate well-formed tabular data.
- Schema-guided (Structured) Decoding: This approach enforces stricter structural guarantees by using decoding constraints defined by a provided JSON schema. A schema builder dynamically constructs a nested JSON schema based on the table layout, with cell values constrained to nullable integers and headers to predefined values. This method simplifies post-processing as the LLM directly emits Pydantic-safe JSON.
The Datasets and Evaluation
The study utilized three distinct datasets, each presenting unique challenges:
- E2E: Sourced from the restaurant domain, featuring short textual descriptions and simple two-row tables. It focuses on extracting textual information from concise texts.
- Rotowire: From the sports domain, containing game summaries paired with tables of basketball player and team statistics. This dataset requires identifying and assigning sparsely mentioned numerical statistics.
- Livesum: Also from the sports domain, comprising live soccer commentaries and team statistics. This demands truthful aggregation of atomic extraction units distributed throughout longer texts, requiring enhanced reasoning capabilities.
Evaluation was conducted at cell, row, and table levels, using metrics such as F1 scores, Levenshtein ratio, ROUGE-L, and Root Mean Square Error (RMSE) for numerical tables. The presence rate of valid tables was also tracked.
Key Findings and Insights
The evaluation revealed that the effectiveness of structured decoding is highly task-dependent:
- Increased Table Presence: Structured decoding consistently boosted the presence of valid tables across all three benchmarks, significantly reducing malformed outputs compared to one-shot prompting.
- Rotowire Success: For the Rotowire dataset, structured decoding consistently improved model performance. It excelled in scenarios demanding precise numerical alignment, with even the smallest models showing significant gains. Errors like column mismatches were drastically reduced.
- E2E and Livesum Challenges: Conversely, structured decoding proved counterproductive on the E2E and Livesum datasets. For E2E, where critical attributes are densely packed in short texts, the freedom of unconstrained decoding allowed larger models to capture subtle lexical cues more effectively. For Livesum, a reasoning-heavy task requiring aggregation over long contexts, structured decoding ensured table coverage but did not enhance quality metrics, suggesting it might hinder complex reasoning.
- Model Size Influence: Generally, a positive relationship was observed between model size and table generation quality. However, exceptions were noted, particularly with the largest Qwen2.5-32B model, which sometimes showed reduced table presence and quality on certain tasks.
- Evaluation Metrics: The study highlighted limitations of common string-based NLP metrics, which can overestimate table quality. Exact match metrics were found to be too strict, while RMSE provided a more informative assessment for numerical tables. Positional cell-level Levenshtein was deemed useful for expressing performance differences across model sizes. Ultimately, human evaluation remains crucial for robustly assessing table quality.
Also Read:
- Unpacking LLM Factual Stability: Introducing a New Robustness Score
- AI-Powered Insights: Automating Usability Evaluation with Multimodal Language Models
Conclusion and Future Directions
The research underscores that while schema-guided decoding significantly improves table validity and reduces malformed outputs, its impact on table quality varies based on the task. It is beneficial for extracting numerical information from sparsely spread text but can hinder performance when dealing with densely packed textual information or tasks requiring extensive aggregation and reasoning over long contexts.
Future work should explore methods for schema inference, generate more complex table types (e.g., multi-line headers, merged cells), develop advanced constrained decoding strategies beyond standard JSON schemas, and create novel evaluation metrics and datasets, ideally complemented by human assessments, to better capture real-world use cases.


