TLDR: The text2SQL4PM dataset is a new bilingual (Portuguese-English) benchmark for text-to-SQL translation in the process mining domain. Developed by Bruno Y. Yamate et al., it features 1,655 natural language utterances, 205 SQL statements, and ten qualifiers, addressing the unique challenges of specialized vocabulary and event log structures. A baseline study with GPT-3.5 Turbo demonstrated the dataset’s utility, highlighting areas of difficulty like case-level queries and temporal ordering, while showing that models can often produce functionally correct SQL even if structurally different from the gold standard.
A new research paper introduces a significant resource for bridging the gap between natural language and database queries, particularly within the complex field of process mining. Titled “Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation,” this work by Bruno Y. Yamate, Thais R. Neubauer, Marcelo Fantinato, and Sarajane M. Peres presents text2SQL4PM, a novel bilingual dataset designed to enhance text-to-SQL capabilities.
The ability to convert natural language questions into Structured Query Language (SQL) statements is crucial for democratizing access to data. It allows users without technical SQL expertise to retrieve information from databases, while also boosting the productivity of experienced developers. While existing text-to-SQL datasets cover various domains like cars, flights, and music, they often fall short in specialized areas such as process mining, which comes with its own unique vocabulary and data structures.
Introducing text2SQL4PM: A Specialized Bilingual Dataset
The text2SQL4PM dataset is a benchmark specifically crafted for the process mining domain. It addresses the unique challenges of this field, which deals with event logs – sequential records of activities within business processes. Unlike typical relational databases, event logs, when converted for SQL use, often result in single, non-normalized tables with specialized terminology. This dataset is bilingual, supporting both Portuguese and English, and comprises 1,655 natural language utterances, including human-generated paraphrases, 205 SQL statements, and ten qualifiers that categorize the complexity and nature of the queries.
The creation of text2SQL4PM involved a rigorous three-phase methodology. Initially, undergraduate and graduate students generated content. This content was then meticulously curated by process mining experts who adapted it to a specific business process (authorization requests for academic travel expenses) and associated it with various qualifiers. Finally, the dataset was augmented with human-generated paraphrases and professional translations into English, ensuring its richness and utility for diverse research purposes.
Why Process Mining is a Unique Challenge for Text-to-SQL
Process mining presents several hurdles for text-to-SQL solutions. The domain’s specialized vocabulary, the non-normalized single-table structure of event logs, and the critical importance of temporal ordering of events make it difficult for general-purpose models to perform effectively. For instance, understanding concepts like ‘case level’ versus ‘event level’ analysis, or correctly interpreting temporal sequences of activities, requires deep domain knowledge that traditional models often lack.
Baseline Performance with GPT-3.5 Turbo
To establish a baseline, the researchers tested the dataset with GPT-3.5 Turbo, a large language model. Using a zero-shot prompt engineering approach, the model was tasked with generating SQL statements from the natural language utterances. The evaluation used two key indicators: ‘exact set match without values’ (structure indicator) and ‘execution accuracy’ (run indicator).
The results showed that while the structure indicator had success rates of around 31-32% for both languages, the run indicator achieved higher rates of 44-47%. This suggests that GPT-3.5 Turbo often generates SQL statements that, despite not perfectly matching the ‘gold standard’ structure, still produce the correct results when executed. Challenges were particularly noted in queries involving case-level analysis, domain-specific vocabulary, and complex temporal ordering. For example, queries asking for the “greatest number of events” might be misinterpreted by the model, leading to a single result instead of all instances that meet the criteria.
Also Read:
- Understanding Large Language Models in Legal AI: A Deep Dive into Current Trends and Future Paths
- Unifying Autoformalization: A Framework for Bridging Informal and Formal AI Reasoning
Impact and Future Directions
The text2SQL4PM dataset is a valuable contribution to the fields of natural language processing and process mining. It serves as a precise benchmark for evaluating text-to-SQL implementations in a specialized domain and offers a rich resource for semantic parsing, machine translation, and paraphrase generation tasks. Its bilingual nature, human-curated content, and detailed qualifiers make it particularly useful for fine-tuning models for Portuguese language processing. While currently focused on exploratory information retrieval, future research could explore extending its utility to more advanced process mining tasks or integrating with process query languages. You can find more details about this research paper here.


