TLDR: A new framework automates schema lineage extraction from complex, multilingual enterprise data pipelines, addressing ‘semantic drift.’ It introduces a composite evaluation metric (SLiCE) and a benchmark dataset of 1,700 real-world annotations. Experiments show that larger language models, especially with Chain-of-Thought prompting, significantly improve extraction accuracy, with a 32B open-source model achieving performance comparable to GPT-series models, offering a cost-effective solution for data governance and AI applications.
In today’s data-driven world, large enterprises rely on complex data pipelines to transform raw information into valuable insights. These pipelines, however, often involve multiple programming languages and intricate transformations, leading to a significant challenge known as ‘semantic drift.’ This means the original meaning and context of data can get lost as it moves through various processing stages, making it difficult to trace where data came from, ensure its accuracy, and govern its use effectively. This issue also impacts the performance of advanced AI services like retrieval-augmented generation (RAG) and text-to-SQL systems.
To tackle this problem, researchers have introduced a groundbreaking framework for automatically extracting fine-grained schema lineage from these multilingual enterprise pipeline scripts. Schema lineage essentially maps out the journey of data, showing how each piece of information in a final dataset was derived from its original sources. The proposed method focuses on identifying four crucial components: source schemas (the original data fields), source tables (where the data originated), transformation logic (the operations applied to the data), and aggregation operations (like summing or counting data).
This framework creates a standardized way to represent data transformations, making it easier to understand and manage. To rigorously evaluate the quality of the extracted lineage, the paper introduces a new metric called Schema Lineage Composite Evaluation (SLiCE). SLiCE is designed to assess both the structural correctness of the lineage (does it follow the right format?) and its semantic fidelity (does it accurately reflect the data’s meaning?). This comprehensive metric provides a unified score while also offering detailed diagnostics for each component of the lineage.
A significant contribution of this research is a new benchmark dataset. This dataset comprises 1,700 manually annotated lineages, derived from 50 real-world industrial scripts written in SQL, Python, and C#. These scripts represent diverse business domains and varying levels of complexity, providing a high-fidelity standard for testing and improving schema lineage extraction models.
The researchers conducted extensive experiments using 12 different language models, ranging from smaller models (1.3 billion parameters) to large language models (LLMs) like GPT-4o and GPT-4.1. They explored three prompting strategies: base (minimal instructions), few-shot (providing examples), and Chain-of-Thought (CoT) (including step-by-step reasoning traces). The results clearly demonstrate that the performance of schema lineage extraction improves significantly with larger model sizes and more sophisticated prompting techniques.
Notably, a 32-billion parameter open-source model, when guided by a single reasoning trace, achieved performance comparable to the proprietary GPT series models under standard prompting. This is a crucial finding, as it suggests a scalable and economical approach for deploying ‘schema-aware’ AI agents in practical business applications. Such agents could automatically understand and document data transformations, greatly enhancing data literacy and governance within organizations.
Also Read:
- Streamlining Database Interaction: An End-to-End Text-to-SQL Framework with Automated Database Selection
- Enhancing Business Reporting with AI-Powered Multi-Dimensional Data Summarization
The implications of this work are far-reaching. Accurate schema lineage extraction can lead to the automated creation of high-quality documentation for dynamic data pipelines, which can then serve as a robust knowledge base for RAG systems. This means AI can generate precise and contextual business statements about data, like explaining how a ‘TotalAmountSpent’ metric was calculated. Furthermore, it can substantially improve text-to-SQL tasks by providing AI with precise definitions and relevant business contexts, ultimately making AI-driven analytical workflows more efficient and reliable. For more details, you can read the full research paper here.


