TLDR: Researchers developed a new window-based F1 metric (F1w) for evaluating stress detection in time series data from wearable devices. This metric accounts for the gradual nature of stress by allowing temporal tolerance, unlike traditional metrics that often misrepresent performance in real-world, imbalanced datasets. Empirical analysis showed F1w reveals significant model performance patterns that conventional metrics miss, especially in “in-the-wild” datasets, providing a more robust assessment for healthcare applications.
The way we evaluate stress detection in real-world scenarios, especially with wearable devices, often falls short. Traditional methods struggle because stress isn’t a single, sudden event; it’s a gradual process that unfolds over time. This research introduces a new way to measure how well models detect stress, called the window-based F1 metric (F1w), which is designed to be more accurate for these complex, real-world situations.
Current evaluation metrics, like the standard F1 score or point-adjusted F1 (F1pa), often give misleading results, particularly with datasets where stress events are rare and only marked as single points in time. Imagine trying to catch a wave with a single fishing net – you’re likely to miss most of it. Similarly, these metrics struggle to capture the full picture of a stress episode.
The core idea behind the F1w metric is to introduce “temporal tolerance.” This means it acknowledges that a prediction doesn’t have to be perfectly aligned with the exact moment a stress event is annotated to be considered correct. Instead, it looks for predictions within a certain time window around the actual event. This approach better reflects the diffused nature of physiological stress responses.
To test their new metric, the researchers applied it to three different physiological datasets: ADARP, Wrist Angel, and ROAD. ADARP and Wrist Angel are “in-the-wild” datasets, meaning data was collected from people going about their daily lives, self-reporting stress. The ROAD dataset, on the other hand, was collected in a controlled experiment where stress was induced during driving.
The findings were quite revealing. For the ADARP and Wrist Angel datasets, which are highly imbalanced with sparse, single-point stress annotations, traditional metrics often reported near-zero performance for the TimesFM model used for prediction. This made it seem like the model had no predictive ability. However, when evaluated with the F1w metric, especially with larger window sizes, the model showed statistically significant improvements over random baselines. This indicates that the model was indeed making meaningful predictions, but conventional metrics simply couldn’t capture them.
The ROAD dataset presented a different challenge. Because it had long, continuous stress segments, even a random baseline model scored highly with many metrics. This highlights a crucial point: the choice of evaluation metric must be carefully considered based on the nature of the dataset and its annotations. Overly tolerant metrics can overestimate performance if the events are already dense or continuous.
A key advantage of F1w is that it allows for direct, post-hoc assessment using the original annotations, without needing to alter the ground truth labels. This improves reproducibility across studies. Furthermore, the window size in F1w is directly interpretable as a time tolerance, which can be adapted based on domain knowledge. For example, a 10-second window might be appropriate for acute physiological changes, while a 20-minute window might be better for prolonged emotional states.
The researchers used TimesFM, a decoder-only foundation model, for zero-shot forecasting of stress events. While this setup allowed them to investigate the F-metrics effectively, they acknowledge that in a real-world deployment, explicit stress labels might not be available as input, suggesting future work could focus on models relying solely on physiological signals.
Also Read:
- A New Metric for Evaluating Time Series Anomaly Detection: Confidence and Uncertainty Consistency
- Advancing Human Activity Recognition with Reinforcement Learning for Cross-User Generalization
In conclusion, this research provides a valuable new tool for evaluating event detection in time series, particularly for healthcare applications like stress monitoring. The F1w metric offers a more robust and practical approach, revealing model performance that traditional metrics miss, and providing guidance for developing more reliable interventions in real-world settings. You can find more details about this work in the full research paper. Read the full research paper here.


