TLDR: A new framework, MAFG, automates scientific data feature generation using multi-agent reinforcement learning to create high-order feature combinations and integrates large language models for interpreting and validating these features. Tested on diverse scientific datasets, MAFG significantly improves machine learning model performance and provides explainable insights, addressing challenges of manual feature engineering and enhancing data mining.
In the era of big data, scientific datasets are growing in size and complexity, making it crucial to effectively utilize the information within their features. Feature generation, a key preprocessing step for tabular scientific data, aims to improve the predictive power of original data by creating higher-order feature combinations and removing redundant ones. Traditional methods, however, face significant challenges: they require extensive domain-specific expertise and struggle with the exponentially expanding search space as feature combinations increase.
Inspired by advancements in Data-Centric Artificial Intelligence (DCAI), a new research paper introduces the Multi-agent Feature Generation (MAFG) framework. This innovative framework redefines the conventional feature generation workflow by automating the process and significantly enhancing various downstream scientific data mining tasks. You can find the full research paper here.
How MAFG Works: A Collaborative Approach
The MAFG framework employs three collaborating reinforcement learning agents to dynamically generate and optimize features. These agents work together in an iterative exploration stage to construct mathematical transformation equations, synthesize, and identify feature combinations with high information content. They leverage a reinforcement learning mechanism to continuously evolve their strategies, learning from interactions with the data and optimizing their parameters based on performance feedback.
Specifically, the framework includes two feature clustering agents (Agent_C1, Agent_C2) and one operation selection agent (Agent_Op). Agent_C1 selects an initial feature subset, Agent_Op chooses a transformation operator (like addition, multiplication, square root, or normalization), and Agent_C2, when needed for binary operations, selects a second feature subset. This collaborative decision-making process allows for the exploration of complex feature combinations.
Integrating Large Language Models for Interpretability
A unique aspect of MAFG is its integration of Large Language Models (LLMs). After the exploration phase, LLMs are used to interpretatively evaluate the generated features, especially those that lead to significant improvements in model performance. This addresses a common limitation of traditional reinforcement learning-based feature engineering: the difficulty in explaining the high-order transformed features. The LLM module helps quantify the impact of features, cross-validates their scientific rationality with existing domain knowledge, and helps remove overly complex or unexplainable features, ensuring the scientific validity and practical usability of the generated features.
Demonstrated Effectiveness Across Scientific Domains
The MAFG framework was rigorously tested on three diverse scientific datasets from the Science Data Bank: a hand-foot-mouth disease incidence dataset (meteorological and environmental factors), a student learning engagement and teacher-student relationship dataset, and a cystatin C and frailty risk dataset (renal function indicators). Experimental results consistently demonstrated that MAFG significantly improved the performance of downstream machine learning models compared to using original data alone.
For instance, in the hand-foot-mouth disease dataset, a generated feature combining “temperature × precipitation” led to an 18.5% improvement in the 1-RAE performance metric. For student learning, a complex feature involving “dedication cubed × teacher-student relationship squared / absorption” reduced the RMSE by 69.2%. In the frailty risk dataset, a feature combining “(newcrp² + newhba1c) × cystatinc” showed a remarkable 1650% increase in 1-RAE, highlighting the framework’s ability to uncover powerful, interpretable relationships.
The research also compared different deep reinforcement learning algorithms, with DuelingDDQN consistently achieving the best performance across all datasets. Furthermore, the study showed that integrating feature selection components within MAFG effectively reduced redundancy and improved the utilization of newly generated, high-value features, accelerating the feature generation process.
Also Read:
- SI-Agent: Automating Clear and Effective Instructions for AI Models
- STRUCT SENSE: A New Agentic Framework for Extracting Structured Information from Scientific Texts
Conclusion
The MAFG framework offers a novel perspective for feature engineering in scientific datasets, overcoming the limitations of traditional methods. By combining multi-agent reinforcement learning for automated feature discovery and large language models for scientific interpretation, it provides an effective, robust, and explainable solution for enhancing scientific data mining and analysis. This approach not only boosts predictive model performance but also generates valuable, interpretable knowledge from complex scientific data, paving the way for more intelligent data processing in scientific research.


