TLDR: SI-LLM is a novel framework that uses Large Language Models to automatically infer concise conceptual schemas for tabular data. It analyzes column headers and cell values to identify hierarchical entity types, attributes, and inter-type relationships without relying on bespoke training or domain ontologies. This three-step process significantly improves data understanding and exploration for heterogeneous datasets.
In the vast and ever-growing landscape of digital information, tabular data—found in everything from spreadsheets to massive data lakes—often presents a significant challenge. These datasets, frequently collected from diverse sources, are rarely perfectly organized. They suffer from inconsistent representations and sparse metadata, making it incredibly difficult for data scientists and analysts to understand and utilize them effectively.
While previous efforts have focused on discovering and exploring datasets, the crucial task of schema inference—understanding the underlying structure and meaning of the data—has remained a hurdle, especially when metadata is limited. This is where a new framework, SI-LLM (Schema Inference using Large Language Models), steps in.
Developed by Zhenyu Wu, Jiaoyan Chen, and Norman W. Paton from the University of Manchester, SI-LLM offers a novel, end-to-end approach to infer a concise conceptual schema for tabular data. What makes SI-LLM particularly innovative is its reliance solely on column headers and cell values, completely bypassing the need for bespoke training data or pre-existing domain ontologies. The inferred schema is rich, comprising hierarchical entity types, their attributes, and the relationships between these types.
How SI-LLM Works: A Three-Step Process
SI-LLM operates through a systematic, prompt-based framework, leveraging the power of Large Language Models (LLMs) to make sense of complex tabular data:
1. Inferring Type Hierarchy: The first step involves constructing a conceptual type hierarchy for each individual dataset. Imagine a family tree for your data, starting from a generic root like “Thing” and branching out to more specific categories like “CreativeWork” and then “Movie.” These individual hierarchies are then merged into a unified global hierarchy, with inconsistent or erroneous connections carefully pruned. The LLMs are prompted to infer these full type hierarchies directly, rather than building them incrementally, which helps capture more coherent structures.
2. Inferring Conceptual Attributes: Once the type hierarchy is established, SI-LLM moves on to identifying conceptual attributes for each type. For instance, for a “Movie” type, attributes like “Movie Title” or “Production Company” are inferred from the various column headers and sample cell values across relevant tables. The system also intelligently resolves different phrasings (e.g., “Producer,” “Release Company,” and “Studio” all becoming “Production Company”) and can even infer inherited attributes from child types to their parents in the hierarchy.
3. Discovering Relationships Between Types: The final step is to uncover semantic relationships between different conceptual types. This is achieved by examining the values within attributes. For example, if the “production company” attribute of a “Movie” type frequently contains values like “Warner Bros.” or “Walt Disney Studios,” SI-LLM recognizes these as instances of a “Company” type. This insight allows it to infer a relationship, such as “ProducedBy,” linking “Movie” to “Company.”
Performance and Impact
Extensive evaluations on two diverse datasets, WDC (web tables) and GDS (open government data), have shown SI-LLM’s promising performance. It consistently achieves high purity in identifying top-level types, meaning tables are correctly assigned to their overarching categories. While its Rand Index (a measure of clustering quality) is competitive, its strength lies in its ability to produce rich, fine-grained hierarchies with many types, offering a more detailed semantic model than many existing embedding-based approaches.
For attribute inference, SI-LLM demonstrates robust performance, often outperforming or matching strong baselines. In relationship discovery, it significantly improves recall and F1-scores compared to traditional methods that rely on column similarity, by directly inferring conceptual-level relationships from attribute values and names.
A case study on the GDS benchmark further illustrates SI-LLM’s capabilities, showing how it can integrate heterogeneous tables into a coherent schema, recovering a high percentage of annotated types and aligning well with ground-truth relationships.
Also Read:
- Enhancing Conversational Search Through Iterative Clarification and Rewriting
- Structuring Intelligence: Language Models Crafting Hierarchical Learning Environments for AI Agents
The Future of Data Understanding
SI-LLM represents a significant step forward in making complex tabular data more accessible and understandable. By leveraging the advanced reasoning capabilities of Large Language Models, it automates a traditionally challenging task without requiring extensive manual curation or external knowledge bases. This framework holds immense potential for data scientists and engineers, enabling them to more efficiently discover, explore, and extract value from the ever-growing repositories of tabular data.
For more technical details, you can read the full research paper here.


