Quantifying Human-like Behavior in Cyber Simulation Environments

TLDR: PHASE (Passive Human Activity Simulation Evaluation) is a machine learning framework that analyzes network logs to distinguish human from non-human activity with over 90% accuracy. Operating passively, it helps evaluate the realism of synthetic user personas in cybersecurity simulations. By using SHAP analysis, PHASE identifies key behavioral patterns indicative of genuine human users, allowing for the refinement of synthetic user behaviors to be more human-like, as demonstrated in a case study with the MITRE Caldera Human Plugin.

In the evolving landscape of cybersecurity, simulation environments like cyber ranges, honeypots, and sandboxes are crucial for research, adversarial testing, and developing automated defenses. However, for these simulations to be truly effective and provide meaningful insights, they must accurately replicate the complex and dynamic behaviors of real human users. Without realistic human activity, models trained in these environments might not generalize well to real-world scenarios, leading to potentially brittle defenses.

The challenge lies in the fact that including real users in large-scale simulations is often impractical or unethical. Consequently, many environments rely on Synthetic User Personas (SUPs) to emulate human behavior. While SUPs have been integrated into significant research efforts, there has been a notable absence of a standardized, quantitative framework to evaluate how closely the activity generated by these synthetic users matches that of genuine human users. This lack of a robust fidelity measure makes it difficult for researchers to refine and improve SUP behaviors or compare different systems effectively.

PHASE: A New Approach to Evaluating Synthetic User Behavior

To address this critical gap, researchers from the University of Virginia – Steven Lamp, Jason D. Hiser, Anh Nguyen-Tuong, and Jack W. Davidson – have introduced PHASE (Passive Human Activity Simulation Evaluation). This innovative machine learning framework is designed to quantitatively assess the realism of simulated user activity. What makes PHASE particularly groundbreaking is its entirely passive operation: it analyzes standard network monitoring data, specifically Zeek connection logs, without requiring any user-side instrumentation or visible signs of surveillance. This passive design ensures that the evaluation process itself does not introduce any artificial traffic or disrupt the integrity of the simulation environment.

PHASE distinguishes human from non-human activity with over 90% accuracy. Beyond just classification, it employs SHAP (SHapley Additive exPlanations) analysis to uncover the temporal and behavioral signatures that are indicative of genuine human users. This interpretability is vital, as it helps identify specific non-human patterns in synthetic activity that undermine realism, providing clear guidance for improvement.

How PHASE Works

At its core, PHASE utilizes a deep neural network (DNN) model optimized to detect temporal and behavioral patterns in network activity. The framework consists of three main components: a hybrid DNN architecture that combines convolutional, recurrent (BiLSTM), and attention-based layers to capture both short and long-range temporal dependencies; a preprocessing pipeline that transforms raw Zeek connection logs into structured time series data, handling irregular temporal resolution and encoding categorical features; and a supervised training process using labeled real-world data to teach the model to differentiate between human and non-human activity.

The preprocessing steps are crucial. Network logs are often irregular and bursty, reflecting natural fluctuations in user behavior. PHASE addresses this by dividing device activity into daily sequences of one-minute intervals, aggregating logs within each minute. Categorical features, like protocol types or connection states, are converted into numerical forms. To prevent bias, all numerical attributes are normalized, and the system ensures consistency by saving and reusing encoders and scalers for new data.

Building the Foundation: Data Collection and Labeling

A significant challenge for developing PHASE was the absence of suitable, publicly available datasets with clearly labeled human and non-human activity. To overcome this, the researchers constructed three custom datasets, collectively known as the CSNET Datasets (Fall 2024, Summer 2024, and Spring 2025). These datasets were collected from a live academic institution environment using a dedicated data collection system that mirrored network traffic to a Zeek Network Sensor. Ethical considerations, including IRB approval and the anonymization of IP addresses, were strictly adhered to, ensuring privacy and generalizability.

Labeling the data at a fine-grained, device-level resolution was a novel undertaking. An expert panel, comprising faculty, network administrators, research technicians, and a graduate student, identified IP addresses associated with human-operated devices (like laptops and desktops used for internet access) and non-human devices (such as printers, virtual machines, server clusters, and IoT devices like thermostats) by correlating DNS records and IP assignment records with the physical layout of the academic environment.

Unveiling Human-like Patterns: Model Performance and Interpretability

The PHASE models consistently achieved over 90% accuracy and balanced accuracy across all three CSNET datasets, demonstrating their strong capability in distinguishing human from non-human network activity, even with class imbalances. The F1-scores and AUC values further validated the models’ robustness across different academic periods.

To understand what the model considers realistic human activity, SHAP analysis was applied. This technique attributes the model’s output to individual input features, quantifying their contribution to predictions. The analysis revealed that feature importance varies throughout the day, aligning with real-world human routines. For instance, periods of heightened importance were observed during typical work hours, with declines after the academic workday. Features like `conn_state` (connection state), `missed_bytes` (unobserved bytes), and the `history` string (sequence of packet-level events) emerged as highly influential, indicating that specific patterns of packet exchange and connection termination are key signals of human-driven activity.

Putting it to the Test: The MITRE Caldera Human Plugin Case Study

As a proof of concept, PHASE was used to evaluate the MITRE Caldera Human Plugin (MCHP), a widely used Synthetic User Persona. The default MCHP configuration was deployed on virtual machines within an isolated server cluster. The evaluation showed that the default MCHP consistently scored low on the PHASE scale, indicating behavior that was both inconsistent and significantly divergent from real human users. SHAP analysis revealed that the default MCHP exhibited minimal data exchange, a pattern typically associated with non-human activity.

Guided by these insights, the researchers developed an “MCHP Enhanced” configuration. This revised setup introduced a key behavioral adjustment: after completing a randomized number of tasks, the SUP would enter a one-hour idle period, mimicking natural human rhythms like breaks. This modification resulted in consistently higher PHASE scores, demonstrating improved alignment with human-like activity. This case study highlights PHASE’s utility in guiding the iterative refinement of SUP behavior, allowing synthetic personas to be systematically tuned to more closely mimic realistic human activity.

Also Read:

Looking Ahead

Future work for PHASE includes evaluating its generalizability across global contexts and different time zones to assess its ability to recognize human behavioral patterns that vary across cultural, geographic, and temporal boundaries. Additionally, the framework could be extended to classify more granular types of non-human traffic, such as malware, IoT devices, and automated services.

In conclusion, PHASE provides a crucial quantitative method for evaluating the behavioral fidelity of synthetic user personas in cybersecurity simulations. By offering a robust measure of realism, it enhances the credibility of simulation-based cyber defense training and testing, ensuring these environments more faithfully replicate real-world user behavior with high fidelity.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Quantifying Human-like Behavior in Cyber Simulation Environments

PHASE: A New Approach to Evaluating Synthetic User Behavior

How PHASE Works

Building the Foundation: Data Collection and Labeling

Unveiling Human-like Patterns: Model Performance and Interpretability

Putting it to the Test: The MITRE Caldera Human Plugin Case Study

Looking Ahead

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates