TLDR: This research addresses the challenge of detecting cryptocurrency pump-and-dump (P&D) schemes, which are rare events causing severe data imbalance. The study applied the Synthetic Minority Oversampling Technique (SMOTE) to balance the dataset and evaluated five ensemble learning models. Results showed that SMOTE significantly improved the models’ ability to detect P&D events, particularly enhancing recall. XGBoost and LightGBM emerged as top performers, offering high accuracy and fast computational speeds, making them ideal for real-time market surveillance and contributing to a more transparent cryptocurrency market.
Cryptocurrency markets, while offering innovative financial opportunities, also expose investors to significant risks, particularly from manipulative schemes like pump-and-dump (P&D). These schemes involve coordinated groups artificially inflating the price of a low-liquidity digital asset through concentrated buying and aggressive promotion, only to sell off their holdings at the peak, causing prices to crash and leaving late investors with substantial losses. Such practices not only harm individual investors but also undermine the integrity and trust in the broader digital asset ecosystem.
A recent study titled Improving Cryptocurrency Pump-and-Dump Detection through Ensemble-Based Models and Synthetic Oversampling Techniques by Jieun Yu, Minjung Park, and Sangmi Chai, addresses the critical challenge of detecting these P&D events. The core difficulty lies in the scarcity of P&D events compared to normal trading activity, leading to a severe class imbalance in datasets. This imbalance often biases traditional detection models towards the majority (normal) class, making them ineffective at identifying the rare but impactful manipulative events.
Addressing Data Imbalance with SMOTE
To overcome this, the researchers employed the Synthetic Minority Oversampling Technique (SMOTE). SMOTE works by generating synthetic examples of the minority class (P&D events) based on existing ones, thereby creating a more balanced dataset for model training. This technique has been widely recognized for its effectiveness in improving the sensitivity of models to rare events in various fraud detection applications.
Leveraging Ensemble Learning Models
The study also evaluated advanced ensemble learning models, which combine multiple individual models to achieve better predictive accuracy and robustness. Five tree-based ensemble algorithms were tested: Random Forest, AdaBoost, Gradient Boosting Machine (GBM), XGBoost, and LightGBM. These models are particularly well-suited for complex, non-linear patterns found in financial transaction data and can adapt to rapidly evolving manipulative behaviors.
Experimental Setup and Key Findings
The empirical analysis utilized an open dataset of 317 documented P&D events from the Binance exchange, alongside 481,840 records of normal trading activity. The data was aggregated into 25-second chunks, with features generated over a 7-hour sliding time window. Experiments were conducted both with the original imbalanced dataset and with a SMOTE-augmented dataset.
The results clearly demonstrated the significant impact of SMOTE. Models trained on the SMOTE-balanced dataset showed a consistent and substantial improvement in ‘recall’ across all ensemble methods. Recall measures the proportion of actual P&D events that were correctly identified, which is crucial for early detection where missing a manipulative event carries a high cost. For instance, the Random Forest model’s recall increased from 88.46% to 93.59% after applying SMOTE.
Among the tested models, XGBoost and LightGBM stood out. They not only achieved high recall rates (94.87% and 93.59% respectively with SMOTE) and strong F1-scores (a balanced measure of precision and recall) but also exhibited remarkable computational efficiency. LightGBM completed training in just over 3 seconds, and XGBoost in approximately 14 seconds, making them highly suitable for near real-time surveillance applications in fast-paced cryptocurrency markets. Other models like Random Forest and GBM took significantly longer, up to 8 and 20 minutes respectively.
Also Read:
- Advancing Vehicle Type Recognition: A Deep Dive into Balancing Datasets and Model Performance
- Designing Effective Quantum Classifiers for Financial Security
Implications for Market Integrity
These findings highlight that integrating data balancing techniques like SMOTE with advanced ensemble methods significantly enhances the early detection of manipulative activities. This approach offers a robust and practically applicable methodology for decision support systems used by exchanges and regulatory authorities. By providing timely alerts, these models can help mitigate the impact of P&D schemes on retail investors and contribute to a fairer, more transparent, and stable cryptocurrency market.
While the study focused on a single exchange and specific P&D events, the framework provides a strong foundation for future research. Expanding datasets to include multiple exchanges, incorporating social media signals or blockchain network metrics, and exploring more advanced data augmentation techniques could further refine detection capabilities. Ultimately, this research represents a significant step towards developing operational tools that can protect market participants in the rapidly evolving digital asset ecosystem.


