From Testers to ML Practitioners: A Practical Guide to Machine Learning for Enterprise Software

TLDR: This paper details the practical machine learning journey of a group of software testers, providing a non-technical, step-by-step guide to applying ML in enterprise software testing. It covers the entire ML workflow, from data gathering and cleaning to feature engineering, model selection, training, testing, and evaluation. The authors share lessons learned, common pitfalls, and successes, emphasizing the critical role of data, the importance of consulting subject matter experts, and understanding key evaluation metrics like precision and recall beyond simple accuracy. The paper serves as a valuable resource for beginners looking to apply ML techniques in any domain.

A recent paper titled “Machine Learning Experiences” by Michael Cohoon and Debbie Furman offers a unique perspective on the journey of applying machine learning (ML) in enterprise software testing. This insightful work is particularly valuable for those new to the field, as it chronicles the practical experiences of a group of software testers, not trained data scientists, as they navigate the complexities of ML. [https://arxiv.org/pdf/2507.22064]

The authors emphasize that their journey mirrors a standard ML workflow, akin to the CRISP-DM process, which can be universally applied to any ML project. This workflow involves several key stages: gathering data, cleaning it, performing feature engineering, splitting data for training and testing, choosing a suitable ML model, training the model, and finally, testing and evaluating its performance. The paper aims to demystify these steps, introducing common ML terminology in an easy-to-understand manner and highlighting both the challenges and successes encountered along the way.

The Crucial Role of Data

A central theme of the paper is the paramount importance of data in any machine learning endeavor. The authors stress that for ML to be effective, one must have a clear problem to solve and the appropriate data to address it. Their initial problem statements revolved around predicting errors in fixes for field defects and identifying duplicate or invalid defects during the development cycle. They learned early on that even with a well-defined problem, ML is impossible without sufficient, relevant data to train the algorithms.

The process of gathering data itself presented several hurdles, including discovering data locations, obtaining access, and exporting it to a common platform. For continuously growing datasets, questions arose about automating data extraction and refresh frequency. A key lesson was the benefit of taking point-in-time snapshots of data to test models and combat ‘data drift’ – changes in data characteristics over time. They also learned the hard way that simply throwing raw data at an ML model without prior review leads to poor results, underscoring the need for careful data examination.

Cleaning and Engineering Features

Raw data is rarely in a format directly consumable by ML algorithms, making data cleaning a critical and often time-consuming step. This involves identifying features (input values describing characteristics), converting data types (e.g., text to numbers), handling missing information, and removing irrelevant data. The authors strongly advocate for consulting Subject Matter Experts (SMEs) to understand the dataset’s nuances, identify unreliable, redundant, misleading, or biased fields, and discover opportunities to create new, more valuable features.

Feature engineering, the process of transforming or adding features, was highlighted as essential for normalizing input, cleaning existing data, and turning seemingly valueless fields into important predictors. An example given was transforming complex code level identifiers into simpler ‘DEV’ or ‘GA’ categories to reduce noise and increase examples. They also discussed handling highly correlated features, which can lead to redundancy or inaccurate model behavior, suggesting techniques like using heatmaps to visualize correlations.

The paper also delves into methods for addressing missing data, from simply removing records or features to imputing (filling in) values or creating new categories for missing data. Encoding, particularly OneHot Encoding, was presented as a vital technique for converting categorical text values into numerical formats that ML models can process, by creating new binary features for each unique category.

Choosing and Training Models

With data prepared, the next step is model selection. The authors, focusing on classification problems (predicting a label, like ‘problem’ or ‘no problem’), initially chose a Decision Tree Classifier due to its high explainability and visual nature, which was beneficial for their learning process.

Training involves splitting the data, typically 80% for training and 20% for testing. They used scikit-learn’s `train_test_split` function for this. After training, they examined feature importance metrics to understand which data points were most impactful to the model’s decisions, aligning these with domain expertise or seeking new insights.

Testing and Evaluating Performance

Evaluating a model’s performance is crucial. The paper introduces four possible outcomes for binary classification: True Positive (correctly predicting positive), True Negative (correctly predicting negative), False Positive (incorrectly predicting positive), and False Negative (incorrectly predicting negative). These outcomes are visualized using Confusion Matrices and summarized in Classification Reports.

While initial high accuracy might seem promising, the authors learned it can be misleading, especially with imbalanced datasets (e.g., many ‘good fixes’ but few ‘bad fixes’). They then focused on precision (when the model predicts positive, how often is it correct?) and recall (how often does the model correctly identify positive cases?). Understanding the trade-off between precision and recall, based on business objectives, was a significant lesson. The F1-score, a combined measure of precision and recall, was also discussed as a more balanced metric.

Their exploration extended to other models like Random Forest, Naive Bayes, and Support Vector Machines, noting that while training was similar, output analysis varied. Finally, they experimented with ensembles, specifically a Voting Classifier, to combine the strengths of multiple models, which led to improved overall performance and consistency.

Also Read:

Future Directions

The paper concludes by outlining next steps for their ML journey, including tuning hyperparameters (model settings that can improve performance), implementing Cross Validation (a technique to ensure models generalize well to unseen data by training and testing on different subsets), and considering model deployment to integrate their ML solutions into larger systems and workflows.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

From Testers to ML Practitioners: A Practical Guide to Machine Learning for Enterprise Software

The Crucial Role of Data

Cleaning and Engineering Features

Choosing and Training Models

Testing and Evaluating Performance

Future Directions

Gen AI News and Updates

Dataforge: Automating Data Preparation for AI with Autonomous Data Agents

LLM Agents Enhance Predictive Maintenance by Cleaning Noisy Logs

FELA: Advancing Feature Engineering with Collaborative AI Agents

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates