TLDR: RELRaE is a novel framework that leverages Large Language Models (LLMs) to automate and enhance the process of converting semi-structured XML data into explicit knowledge graphs. It employs a multi-stage approach for extracting, labeling, refining, and evaluating relationships within XML schemas, significantly reducing the manual effort required by domain experts and improving the accuracy of generated ontology labels for better data interoperability.
In today’s data-rich world, laboratories, especially those utilizing robots, generate an immense volume of information, often stored in semi-structured formats like XML. While these formats connect concepts implicitly, the true power of data lies in explicit, machine-readable semantics, typically found in knowledge graphs defined by ontologies. Bridging this gap – translating XML schemas into ontologies – is a crucial but often time-consuming and expert-dependent process.
Traditional methods for converting XML data into knowledge graphs rely heavily on domain experts, leading to bottlenecks and significant manual effort. This is particularly challenging in specialized fields like analytical chemistry, where data from instruments like those using Analytical Information Markup Language (AnIML) needs to be precisely understood and structured.
Introducing RELRaE: A Hybrid Approach to Ontology Building
A new framework called RELRaE (Relationship Extraction, Labelling, Refinement, and Evaluation) has been developed to address these limitations by integrating Large Language Models (LLMs) into the XML schema-to-ontology translation process. The goal is to reduce the workload on domain experts and ontology engineers while creating a robust ‘skeleton ontology’ that represents the inter-concept relationships within an XML schema, enriched with domain knowledge.
RELRaE operates through four distinct stages:
- Concept Relationship Extraction: This initial stage identifies hierarchical relationships between concept pairs within the XML schema.
- Rule-based Label Generation: Based on the extracted structural information, a rule-based approach proposes initial labels for these relationships.
- Label Refinement: An LLM is then used to refine these initial labels, taking into account schema-based contextual information to ensure accuracy.
- Automatic Label Evaluation: Finally, a different LLM acts as a proxy for a human domain expert, assessing the suitability of the refined labels using a five-point Likert scale. Labels deemed ‘Likely’ or ‘Yes’ are accepted, otherwise, the original rule-based label is used.
This multi-stage process aims to produce a foundational ontology that can then be further enriched.
Also Read:
- AI Models Streamline Clinical Data Standardization with HL7 FHIR
- Automating Data and AI Workflows with the Data Agent Architecture
Key Findings and Benefits
Empirical evaluations using the AnIML schema demonstrated that RELRaE significantly improves the accuracy of relationship labels compared to purely rule-based or LLM-only methods. The hybrid approach, combining rule-based generation with LLM refinement, consistently yielded superior results. This suggests that providing an initial, structured starting point for the LLM, rather than asking it to generate labels from scratch, leads to higher quality and more consistent outcomes.
Furthermore, the research explored the effectiveness of using an LLM as an evaluator. The findings indicate that LLMs show promise in automatically assessing the suitability of generated labels, potentially reducing the need for extensive human expert review. This capability is vital for identifying and mitigating potential ‘hallucinations’ or inaccuracies that LLMs might produce.
The RELRaE framework offers a valuable contribution to the field of ontology engineering by demonstrating how LLMs can effectively support the semi-automatic generation of ontologies, particularly in complex, domain-intensive scenarios like lab automation. By making implicit semantics explicit, this framework enhances data interoperability and lays the groundwork for more sophisticated knowledge-driven applications. You can read the full research paper here.


