AI-Powered Schema Inference for Diverse Tabular Datasets

TLDR: SI-LLM is a novel framework that uses Large Language Models to automatically infer concise conceptual schemas for tabular data. It analyzes column headers and cell values to identify hierarchical entity types, attributes, and inter-type relationships without relying on bespoke training or domain ontologies. This three-step process significantly improves data understanding and exploration for heterogeneous datasets.

In the vast and ever-growing landscape of digital information, tabular data—found in everything from spreadsheets to massive data lakes—often presents a significant challenge. These datasets, frequently collected from diverse sources, are rarely perfectly organized. They suffer from inconsistent representations and sparse metadata, making it incredibly difficult for data scientists and analysts to understand and utilize them effectively.

While previous efforts have focused on discovering and exploring datasets, the crucial task of schema inference—understanding the underlying structure and meaning of the data—has remained a hurdle, especially when metadata is limited. This is where a new framework, SI-LLM (Schema Inference using Large Language Models), steps in.

Developed by Zhenyu Wu, Jiaoyan Chen, and Norman W. Paton from the University of Manchester, SI-LLM offers a novel, end-to-end approach to infer a concise conceptual schema for tabular data. What makes SI-LLM particularly innovative is its reliance solely on column headers and cell values, completely bypassing the need for bespoke training data or pre-existing domain ontologies. The inferred schema is rich, comprising hierarchical entity types, their attributes, and the relationships between these types.

How SI-LLM Works: A Three-Step Process

SI-LLM operates through a systematic, prompt-based framework, leveraging the power of Large Language Models (LLMs) to make sense of complex tabular data:

1. Inferring Type Hierarchy: The first step involves constructing a conceptual type hierarchy for each individual dataset. Imagine a family tree for your data, starting from a generic root like “Thing” and branching out to more specific categories like “CreativeWork” and then “Movie.” These individual hierarchies are then merged into a unified global hierarchy, with inconsistent or erroneous connections carefully pruned. The LLMs are prompted to infer these full type hierarchies directly, rather than building them incrementally, which helps capture more coherent structures.

2. Inferring Conceptual Attributes: Once the type hierarchy is established, SI-LLM moves on to identifying conceptual attributes for each type. For instance, for a “Movie” type, attributes like “Movie Title” or “Production Company” are inferred from the various column headers and sample cell values across relevant tables. The system also intelligently resolves different phrasings (e.g., “Producer,” “Release Company,” and “Studio” all becoming “Production Company”) and can even infer inherited attributes from child types to their parents in the hierarchy.

3. Discovering Relationships Between Types: The final step is to uncover semantic relationships between different conceptual types. This is achieved by examining the values within attributes. For example, if the “production company” attribute of a “Movie” type frequently contains values like “Warner Bros.” or “Walt Disney Studios,” SI-LLM recognizes these as instances of a “Company” type. This insight allows it to infer a relationship, such as “ProducedBy,” linking “Movie” to “Company.”

Performance and Impact

Extensive evaluations on two diverse datasets, WDC (web tables) and GDS (open government data), have shown SI-LLM’s promising performance. It consistently achieves high purity in identifying top-level types, meaning tables are correctly assigned to their overarching categories. While its Rand Index (a measure of clustering quality) is competitive, its strength lies in its ability to produce rich, fine-grained hierarchies with many types, offering a more detailed semantic model than many existing embedding-based approaches.

For attribute inference, SI-LLM demonstrates robust performance, often outperforming or matching strong baselines. In relationship discovery, it significantly improves recall and F1-scores compared to traditional methods that rely on column similarity, by directly inferring conceptual-level relationships from attribute values and names.

A case study on the GDS benchmark further illustrates SI-LLM’s capabilities, showing how it can integrate heterogeneous tables into a coherent schema, recovering a high percentage of annotated types and aligning well with ground-truth relationships.

Also Read:

The Future of Data Understanding

SI-LLM represents a significant step forward in making complex tabular data more accessible and understandable. By leveraging the advanced reasoning capabilities of Large Language Models, it automates a traditionally challenging task without requiring extensive manual curation or external knowledge bases. This framework holds immense potential for data scientists and engineers, enabling them to more efficiently discover, explore, and extract value from the ever-growing repositories of tabular data.

For more technical details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

AI-Powered Schema Inference for Diverse Tabular Datasets

How SI-LLM Works: A Three-Step Process

Performance and Impact

The Future of Data Understanding

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates