Unlocking Process Mining Data: A New Bilingual Dataset for Text-to-SQL Translation

TLDR: The text2SQL4PM dataset is a new bilingual (Portuguese-English) benchmark for text-to-SQL translation in the process mining domain. Developed by Bruno Y. Yamate et al., it features 1,655 natural language utterances, 205 SQL statements, and ten qualifiers, addressing the unique challenges of specialized vocabulary and event log structures. A baseline study with GPT-3.5 Turbo demonstrated the dataset’s utility, highlighting areas of difficulty like case-level queries and temporal ordering, while showing that models can often produce functionally correct SQL even if structurally different from the gold standard.

A new research paper introduces a significant resource for bridging the gap between natural language and database queries, particularly within the complex field of process mining. Titled “Text-to-SQL Oriented to the Process Mining Domain: A PT-EN Dataset for Query Translation,” this work by Bruno Y. Yamate, Thais R. Neubauer, Marcelo Fantinato, and Sarajane M. Peres presents text2SQL4PM, a novel bilingual dataset designed to enhance text-to-SQL capabilities.

The ability to convert natural language questions into Structured Query Language (SQL) statements is crucial for democratizing access to data. It allows users without technical SQL expertise to retrieve information from databases, while also boosting the productivity of experienced developers. While existing text-to-SQL datasets cover various domains like cars, flights, and music, they often fall short in specialized areas such as process mining, which comes with its own unique vocabulary and data structures.

Introducing text2SQL4PM: A Specialized Bilingual Dataset

The text2SQL4PM dataset is a benchmark specifically crafted for the process mining domain. It addresses the unique challenges of this field, which deals with event logs – sequential records of activities within business processes. Unlike typical relational databases, event logs, when converted for SQL use, often result in single, non-normalized tables with specialized terminology. This dataset is bilingual, supporting both Portuguese and English, and comprises 1,655 natural language utterances, including human-generated paraphrases, 205 SQL statements, and ten qualifiers that categorize the complexity and nature of the queries.

The creation of text2SQL4PM involved a rigorous three-phase methodology. Initially, undergraduate and graduate students generated content. This content was then meticulously curated by process mining experts who adapted it to a specific business process (authorization requests for academic travel expenses) and associated it with various qualifiers. Finally, the dataset was augmented with human-generated paraphrases and professional translations into English, ensuring its richness and utility for diverse research purposes.

Why Process Mining is a Unique Challenge for Text-to-SQL

Process mining presents several hurdles for text-to-SQL solutions. The domain’s specialized vocabulary, the non-normalized single-table structure of event logs, and the critical importance of temporal ordering of events make it difficult for general-purpose models to perform effectively. For instance, understanding concepts like ‘case level’ versus ‘event level’ analysis, or correctly interpreting temporal sequences of activities, requires deep domain knowledge that traditional models often lack.

Baseline Performance with GPT-3.5 Turbo

To establish a baseline, the researchers tested the dataset with GPT-3.5 Turbo, a large language model. Using a zero-shot prompt engineering approach, the model was tasked with generating SQL statements from the natural language utterances. The evaluation used two key indicators: ‘exact set match without values’ (structure indicator) and ‘execution accuracy’ (run indicator).

The results showed that while the structure indicator had success rates of around 31-32% for both languages, the run indicator achieved higher rates of 44-47%. This suggests that GPT-3.5 Turbo often generates SQL statements that, despite not perfectly matching the ‘gold standard’ structure, still produce the correct results when executed. Challenges were particularly noted in queries involving case-level analysis, domain-specific vocabulary, and complex temporal ordering. For example, queries asking for the “greatest number of events” might be misinterpreted by the model, leading to a single result instead of all instances that meet the criteria.

Also Read:

Impact and Future Directions

The text2SQL4PM dataset is a valuable contribution to the fields of natural language processing and process mining. It serves as a precise benchmark for evaluating text-to-SQL implementations in a specialized domain and offers a rich resource for semantic parsing, machine translation, and paraphrase generation tasks. Its bilingual nature, human-curated content, and detailed qualifiers make it particularly useful for fine-tuning models for Portuguese language processing. While currently focused on exploratory information retrieval, future research could explore extending its utility to more advanced process mining tasks or integrating with process query languages. You can find more details about this research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Process Mining Data: A New Bilingual Dataset for Text-to-SQL Translation

Introducing text2SQL4PM: A Specialized Bilingual Dataset

Why Process Mining is a Unique Challenge for Text-to-SQL

Baseline Performance with GPT-3.5 Turbo

Impact and Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates