TLDR: The LyS system introduces a zero-shot approach for Tabular Question Answering (Tabular QA) by using a Large Language Model to generate Python code that extracts information from tables. Its modular pipeline includes a column selector, a code generator, and an iterative error-handling module that refines code based on execution failures. The system performed well in SemEval 2025 Task 8, demonstrating the effectiveness of zero-shot code generation for Tabular QA, despite challenges with highly complex data types.
In the evolving landscape of artificial intelligence, the ability for machines to understand and answer questions based on structured data, known as Tabular Question Answering (Tabular QA), is becoming increasingly vital. This field holds immense potential for real-world applications, from analyzing financial reports and business intelligence to exploring scientific datasets. Unlike traditional question answering that deals with unstructured text, Tabular QA requires systems to navigate tables, understand column relationships, and handle various data types to extract precise information.
Historically, Tabular QA systems often relied on complex supervised methods, involving structured prediction or sequence-to-sequence models that required extensive training on large annotated datasets. However, with the emergence of powerful instruction-based Large Language Models (LLMs), a new paradigm has taken hold: zero-shot generation. This approach allows models to generate answers without prior task-specific fine-tuning, significantly reducing the need for vast amounts of labeled data.
A recent paper, “LyS at SemEval 2025 Task 8: Zero-Shot Code Generation for Tabular QA,” explores this zero-shot approach by leveraging LLMs to dynamically generate functional code. This code is designed to extract relevant information from tabular data based on a user’s input question. The team behind LyS developed a modular pipeline to enhance accuracy and reliability, consisting of three main stages.
The LyS System: A Modular Approach
The LyS system is built around a core idea: using an LLM to generate executable code. To support this, it incorporates additional components that refine the process and improve robustness:
- Column Selector: This initial module uses an instruction-based LLM to identify the most relevant columns in a table for a given question. Instead of relying on predefined rules, it intelligently determines which parts of the table are essential for answering the query.
- Answer Generator: Once the relevant columns are identified, this component instructs another LLM to generate Python code. Python was chosen due to its widespread use in data analysis and strong support for tabular data processing through libraries like Pandas. This generated code is then executed to retrieve the answer from the tabular source.
- Code Fixer: A crucial part of the pipeline, this module captures any execution errors that might occur due to incorrect syntax or data mismatches. If an error is detected, the error message and context are fed back into the LLM, prompting it to regenerate a corrected version of the code. This iterative refinement process significantly enhances the system’s reliability.
The system also includes a preprocessing step to standardize column names and infer common data schemes, which helps prevent errors in the code generation phase.
Performance and Insights
The LyS team participated in the SemEval 2025 Task 8, a competition that provided a diverse dataset of real-world tabular data. Their zero-shot approach meant no explicit training or fine-tuning was conducted; instead, they validated different open-source LLMs on a development dataset to select the best performer. Models like Qwen-2.5-Coder (7B and 32B versions), Mistral-7B, and Codestral-22B were tested, with Qwen-2.5-Coder 32B showing superior performance.
During the development phase, the LyS system consistently outperformed the baseline, demonstrating the viability of zero-shot code generation for Tabular QA. The integration of the Column Selector module led to a clear improvement in accuracy, highlighting the importance of pre-selecting relevant attributes. Furthermore, the Code Fixer module, especially when combined with an enhanced column selection, significantly boosted performance, particularly for Subtask 1 which involved larger databases. This showed that incorporating error feedback helps the LLM generate better queries.
In the final test phase of the competition, the best-performing configuration of LyS achieved a respectable rank of 33 out of 53 participants. While there was a noticeable drop in accuracy compared to the development phase results, this was attributed to the increased complexity of data types in the test tables, such as lists not enclosed by brackets or dictionaries with variable keys. This indicates that while the system is robust, handling highly complex and inconsistently formatted data types remains a challenge.
Also Read:
- Streamlining Database Interaction: An End-to-End Text-to-SQL Framework with Automated Database Selection
- SQL-Exchange: Bridging Database Schemas with Intelligent Query Transformation
Looking Ahead
The LyS system demonstrates that zero-shot code generation is a valid and promising approach for Tabular QA, capable of adapting to different dataset schemes without extensive training. Future work aims to further refine prompt templates, improve schema adaptation, optimize execution efficiency, and potentially incorporate a voting system with multiple LLMs. Enhancing the detection and handling of complex data types is also a critical area for improvement, as it will make the system more generalizable to the vast amount of less structured online data. For more technical details, you can refer to the full research paper available here.


