TLDR: MINERVA is a novel supervised feature selection method that uses neural networks to estimate mutual information between features and targets. It employs a two-stage process with a specialized loss function and sparsity-inducing regularizers to identify relevant features, especially those involved in complex, higher-order interactions. Experiments on synthetic and real-world fraud datasets demonstrate MINERVA’s superior ability to perform exact feature selection and improve predictive performance compared to existing methods.
In the world of machine learning and data analysis, dealing with vast amounts of information, often called high-dimensional data, is a common challenge. This data frequently contains features that are either irrelevant or redundant, leading to increased storage needs, higher computational costs, and less effective predictive models. The process of identifying and selecting only the most important features is known as feature selection, a crucial step in building efficient and accurate machine learning systems.
Traditional feature selection methods, often called ‘filters,’ typically rely on simple statistical measures to understand how individual features relate to the target variable we’re trying to predict. However, these methods can fall short when the target variable depends on more intricate, higher-order interactions between multiple features, rather than just the contribution of each feature on its own. Imagine trying to predict fraud; a single transaction detail might not be suspicious, but a combination of two seemingly unrelated details (like two independent transactions sharing the same device ID but from different users) could be a strong indicator. Existing methods often struggle to capture such complex dependencies.
Introducing MINERVA: A Novel Approach
To address these limitations, researchers Taurai Muvunza, Egor Kraev, Pere Planell-Morell, and Alexander Y. Shestopaloff have introduced a new method called MINERVA: Mutual Information Neural Estimation Regularized Vetting Algorithm. This innovative approach to supervised feature selection leverages neural networks to estimate the mutual information between features and targets. Mutual information is a powerful concept from information theory that quantifies how much information one random variable provides about another. Essentially, it measures the strength of the relationship between them.
MINERVA’s core strength lies in its ability to approximate mutual information using neural networks. It employs a specially designed loss function, enhanced with ‘sparsity-inducing regularizers,’ which helps in identifying and prioritizing the most relevant features while pushing the weights of less important ones towards zero. This ensures that the model focuses only on what truly matters.
A Two-Stage Process for Better Generalization
A key aspect of MINERVA is its two-stage implementation. This design separates the process of learning data representations from the actual feature selection. In the first stage, MINERVA explores the dependencies between all features and the target without any selection constraints, allowing the neural network to learn the underlying relationships stably. In the second stage, the learned knowledge is fine-tuned, and the sparsity-inducing regularizers are introduced to select the important features. This decoupling improves the model’s ability to generalize to new data and provides a more accurate understanding of feature importance.
Capturing Complex Dependencies
The researchers demonstrated MINERVA’s effectiveness through experiments on both synthetic and real-life fraud datasets. On synthetic data, they created scenarios where the target variable depended on subtle interactions, such as whether two independent discrete random variables were equal. This type of dependence is often overlooked by traditional methods that only look at pairwise relationships. MINERVA successfully captured these complex feature-target relationships by evaluating feature subsets as an ensemble, meaning it considers how features work together rather than just individually.
In Experiment A, where 30 features were generated and only two (features 3 and 8) were expected to be relevant, MINERVA, alongside FOCI, was the only method to achieve an exact selection. Other benchmark methods like KSG, Boruta, HSIC Lasso, RFE, and Random Forest failed, selecting all 30 features. In Experiment B, involving continuous features and nonlinear functions, MINERVA again performed an exact selection of the 10 expected features, significantly outperforming all baselines. Furthermore, when evaluating the predictive performance using a gradient boosting model, MINERVA achieved the highest out-of-sample R2 score of 84.69%, demonstrating its ability to select features that are crucial for accurate prediction.
Real-World Application: Fraud Detection
MINERVA was also tested on a challenging real-world fraud dataset from a financial company, comprising 3 million samples and 214 processed features. This dataset was highly imbalanced, with fraud cases making up only 0.1% of the observations. After addressing the data imbalance using the Synthetic Minority Over-sampling Technique (SMOTE), MINERVA showed strong performance. With a regularization coefficient of 103, MINERVA selected 160 features and achieved the highest out-of-sample recall of 0.573, indicating its effectiveness in identifying fraudulent transactions. While other methods like KSG and HSIC Lasso also performed well, MINERVA consistently demonstrated robust performance across various metrics.
Also Read:
- Enhancing Time Series Predictions with Relevance-Aware Thresholding
- Signature-Informed Transformer: A New Approach to Risk-Aware Asset Allocation
Conclusion
MINERVA represents a significant advancement in feature selection, particularly for datasets where target variables depend on complex, higher-order feature interactions. By combining neural estimation of mutual information with a carefully designed two-stage training process and sparsity-inducing regularizers, MINERVA can accurately identify the most informative features. Its proven efficacy on both synthetic and real-world fraud datasets highlights its potential to enhance predictive performance and reduce the challenges associated with high-dimensional data. For more details, you can read the full research paper here.


