TLDR: This paper introduces new methods to improve Association Rule Mining (ARM) in datasets with many features but few samples, common in biomedicine. It demonstrates that neurosymbolic AI, specifically Aerial+, is significantly faster than traditional methods for high-dimensional data. Furthermore, it proposes two fine-tuning techniques using tabular foundation models to significantly boost the quality of discovered rules in these challenging low-data environments, offering a path to scalable and high-quality knowledge discovery.
Association Rule Mining (ARM) is a powerful technique used to uncover hidden patterns and relationships within datasets, often expressed as ‘if-then’ rules. These rules are crucial for both discovering new knowledge and building interpretable machine learning models, especially in critical decision-making scenarios. However, as datasets grow in complexity, particularly with a large number of features (high-dimensional data), traditional ARM methods face significant challenges like an overwhelming number of rules (rule explosion) and heavy computational demands.
A recent paper, accepted at the ECAI 2025 Workshop: 1st International Workshop on Advanced Neuro-Symbolic Applications (ANSyA), addresses these challenges head-on. Titled “Discovering Association Rules in High-Dimensional Small Tabular Data,” authored by Erkan Karabulut, Daniel Daza, Paul Groth, and Victoria Degeler, the research introduces novel approaches to make ARM more efficient and effective.
Neurosymbolic methods, which combine the strengths of neural networks with symbolic reasoning, have emerged as a promising solution to the rule explosion problem. One such method, Aerial+, has shown great potential. However, like all neural network-based approaches, Aerial+ can struggle when there isn’t much data available, a situation known as a low-data regime. This is particularly common in fields like biomedicine, where datasets might have thousands of features (e.g., genes) but only a handful of samples (e.g., patients).
The paper makes three significant contributions. First, it empirically demonstrates that Aerial+ is remarkably scalable, performing one to two orders of magnitude faster than other state-of-the-art algorithmic and neurosymbolic methods on high-dimensional datasets. This means it can process much larger and more complex datasets in a fraction of the time.
Second, the researchers formally introduce and tackle the problem of ARM in high-dimensional, low-data settings. This is a crucial area, as many real-world datasets, such as gene expression data with around 18,000 features and only 50 samples, fall into this category. The paper highlights that while neurosymbolic methods offer scalability, they need longer training to find high-quality rules in these challenging scenarios.
Third, to overcome the limitations of neurosymbolic methods in low-data regimes, the paper proposes two innovative fine-tuning strategies for Aerial+ that leverage tabular foundation models. Foundation models are large neural networks pre-trained on vast amounts of data to capture general patterns, which can then be adapted for specific tasks. The proposed methods, called Weight Initialization (Aerial+WI) and Double Loss (Aerial+DL), utilize embeddings from TabPFN, a tabular foundation model, to significantly improve the quality of the discovered rules.
The Weight Initialization approach uses the foundation model to create a semantically meaningful starting point for Aerial+’s neural network, guiding it to learn better representations from the outset. The Double Loss strategy, on the other hand, integrates the foundation model’s insights directly into Aerial+’s training process, ensuring that the rules learned are not only accurate but also semantically consistent with the broader patterns captured by the foundation model.
Experimental results on five real-world gene expression datasets confirm the effectiveness of these fine-tuning methods. Both Aerial+WI and Aerial+DL consistently produced rules with higher confidence and association strength compared to the default Aerial+ version. While these fine-tuned methods generated fewer rules and sometimes covered less data, this is an expected outcome, as they prioritize rules with stronger, more meaningful associations, effectively filtering out less significant ones. Crucially, the fine-tuning process added only a negligible increase to the overall execution time.
Also Read:
- CARGO: A Scalable Framework for Causal Discovery in High-Dimensional Event Sequences
- Navigating the Future of Healthcare: A Deep Dive into Large Language Models in Medicine
This research marks a significant step forward for ARM, particularly in domains characterized by complex, high-dimensional data with limited samples. It underscores the immense potential of integrating neurosymbolic AI with powerful foundation models to achieve both scalable and high-quality knowledge discovery. The authors invite further exploration into how other forms of prior knowledge and advanced models can be incorporated into neurosymbolic ARM to unlock even greater insights from data. You can read the full research paper here: Discovering Association Rules in High-Dimensional Small Tabular Data.


