spot_img
HomeResearch & DevelopmentFrom Testers to ML Practitioners: A Practical Guide to...

From Testers to ML Practitioners: A Practical Guide to Machine Learning for Enterprise Software

TLDR: This paper details the practical machine learning journey of a group of software testers, providing a non-technical, step-by-step guide to applying ML in enterprise software testing. It covers the entire ML workflow, from data gathering and cleaning to feature engineering, model selection, training, testing, and evaluation. The authors share lessons learned, common pitfalls, and successes, emphasizing the critical role of data, the importance of consulting subject matter experts, and understanding key evaluation metrics like precision and recall beyond simple accuracy. The paper serves as a valuable resource for beginners looking to apply ML techniques in any domain.

A recent paper titled “Machine Learning Experiences” by Michael Cohoon and Debbie Furman offers a unique perspective on the journey of applying machine learning (ML) in enterprise software testing. This insightful work is particularly valuable for those new to the field, as it chronicles the practical experiences of a group of software testers, not trained data scientists, as they navigate the complexities of ML. [https://arxiv.org/pdf/2507.22064]

The authors emphasize that their journey mirrors a standard ML workflow, akin to the CRISP-DM process, which can be universally applied to any ML project. This workflow involves several key stages: gathering data, cleaning it, performing feature engineering, splitting data for training and testing, choosing a suitable ML model, training the model, and finally, testing and evaluating its performance. The paper aims to demystify these steps, introducing common ML terminology in an easy-to-understand manner and highlighting both the challenges and successes encountered along the way.

The Crucial Role of Data

A central theme of the paper is the paramount importance of data in any machine learning endeavor. The authors stress that for ML to be effective, one must have a clear problem to solve and the appropriate data to address it. Their initial problem statements revolved around predicting errors in fixes for field defects and identifying duplicate or invalid defects during the development cycle. They learned early on that even with a well-defined problem, ML is impossible without sufficient, relevant data to train the algorithms.

The process of gathering data itself presented several hurdles, including discovering data locations, obtaining access, and exporting it to a common platform. For continuously growing datasets, questions arose about automating data extraction and refresh frequency. A key lesson was the benefit of taking point-in-time snapshots of data to test models and combat ‘data drift’ – changes in data characteristics over time. They also learned the hard way that simply throwing raw data at an ML model without prior review leads to poor results, underscoring the need for careful data examination.

Cleaning and Engineering Features

Raw data is rarely in a format directly consumable by ML algorithms, making data cleaning a critical and often time-consuming step. This involves identifying features (input values describing characteristics), converting data types (e.g., text to numbers), handling missing information, and removing irrelevant data. The authors strongly advocate for consulting Subject Matter Experts (SMEs) to understand the dataset’s nuances, identify unreliable, redundant, misleading, or biased fields, and discover opportunities to create new, more valuable features.

Feature engineering, the process of transforming or adding features, was highlighted as essential for normalizing input, cleaning existing data, and turning seemingly valueless fields into important predictors. An example given was transforming complex code level identifiers into simpler ‘DEV’ or ‘GA’ categories to reduce noise and increase examples. They also discussed handling highly correlated features, which can lead to redundancy or inaccurate model behavior, suggesting techniques like using heatmaps to visualize correlations.

The paper also delves into methods for addressing missing data, from simply removing records or features to imputing (filling in) values or creating new categories for missing data. Encoding, particularly OneHot Encoding, was presented as a vital technique for converting categorical text values into numerical formats that ML models can process, by creating new binary features for each unique category.

Choosing and Training Models

With data prepared, the next step is model selection. The authors, focusing on classification problems (predicting a label, like ‘problem’ or ‘no problem’), initially chose a Decision Tree Classifier due to its high explainability and visual nature, which was beneficial for their learning process.

Training involves splitting the data, typically 80% for training and 20% for testing. They used scikit-learn’s `train_test_split` function for this. After training, they examined feature importance metrics to understand which data points were most impactful to the model’s decisions, aligning these with domain expertise or seeking new insights.

Testing and Evaluating Performance

Evaluating a model’s performance is crucial. The paper introduces four possible outcomes for binary classification: True Positive (correctly predicting positive), True Negative (correctly predicting negative), False Positive (incorrectly predicting positive), and False Negative (incorrectly predicting negative). These outcomes are visualized using Confusion Matrices and summarized in Classification Reports.

While initial high accuracy might seem promising, the authors learned it can be misleading, especially with imbalanced datasets (e.g., many ‘good fixes’ but few ‘bad fixes’). They then focused on precision (when the model predicts positive, how often is it correct?) and recall (how often does the model correctly identify positive cases?). Understanding the trade-off between precision and recall, based on business objectives, was a significant lesson. The F1-score, a combined measure of precision and recall, was also discussed as a more balanced metric.

Their exploration extended to other models like Random Forest, Naive Bayes, and Support Vector Machines, noting that while training was similar, output analysis varied. Finally, they experimented with ensembles, specifically a Voting Classifier, to combine the strengths of multiple models, which led to improved overall performance and consistency.

Also Read:

Future Directions

The paper concludes by outlining next steps for their ML journey, including tuning hyperparameters (model settings that can improve performance), implementing Cross Validation (a technique to ensure models generalize well to unseen data by training and testing on different subsets), and considering model deployment to integrate their ML solutions into larger systems and workflows.

Karthik Mehta
Karthik Mehtahttps://blogs.edgentiq.com
Karthik Mehta is a data journalist known for his data-rich, insightful coverage of AI news and developments. Armed with a degree in Data Science from IIT Bombay and years of newsroom experience, Karthik merges storytelling with metrics to surface deeper narratives in AI-related events. His writing cuts through hype, revealing the real-world impact of Generative AI on industries, policy, and society. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -