TLDR: A user study with sleep medicine experts found that AI assistance, especially when transparent (white-box) and used for quality control, significantly improves diagnostic accuracy and consistency in identifying sleep arousal events. While transparent AI takes more time, it greatly enhances user trust and acceptance, suggesting a promising path for integrating AI into clinical workflows with careful design.
Artificial intelligence (AI) is increasingly demonstrating its capability to match or even surpass human experts in interpreting biomedical signals, particularly in fields like sleep medicine. However, simply having high predictive accuracy isn’t enough for successful integration into clinical practice. Clinicians need to understand when and why to trust algorithmic recommendations. A recent study, titled “Assessing the Real-World Utility of Explainable AI for Arousal Diagnostics: An Application-Grounded User Study,” by Stefan Kraft, Andreas Theissler, Vera Wienhausen-Wilke, Gjergji Kasneci, and Hendrik Lensch, delves into this crucial aspect, exploring how different types and timings of AI assistance influence diagnostic performance, efficiency, and user experience in sleep arousal scoring.
The research focused on a key diagnostic task in sleep medicine: scoring nocturnal arousal events in polysomnographic (PSG) data. This task is traditionally time-consuming and prone to high variability among human scorers. The study involved eight professional sleep medicine practitioners who scored arousal events under three conditions: manual scoring, black-box (BB) AI assistance (where the AI provides suggestions without explanations), and transparent white-box (WB) AI assistance (where explanations are provided). Additionally, the timing of assistance was varied: either from the start of scoring or as a post-hoc quality-control (QC) review.
The AI System and Its Explanations
The AI system at the heart of this study utilized a modified DeepSleep architecture, a state-of-the-art tool for arousal detection. The AI was optimized to minimize missed events, which is crucial in clinical contexts. For the transparent (white-box) condition, the system offered several types of explanations:
- Local explanations: These showed which specific data points and signals were most important for the AI’s decision on an individual arousal event.
- Global explanations: A bar chart summarized the overall importance of each data channel across the entire dataset, giving a broader understanding of the AI’s reasoning.
- Confidence scores: A visual channel displayed the AI’s confidence level for arousal onset at each time point.
- Decision threshold: The threshold used by the AI to classify an event as an arousal was explicitly shown.
- Most probable onset: A clear marker indicated the most likely start time of an arousal within a predicted interval.
These features were designed to enhance user understanding and trust, addressing the common challenge of AI’s lack of interpretability in clinical settings.
Evaluating Performance: Two Perspectives
To thoroughly assess the AI’s utility, the researchers used two different reference standards, or “ground truths.” The first was a “consensus ground truth,” derived from the collective manual scoring of the study participants themselves, aiming for a neutral benchmark. The second was the “CPS ground truth,” which was the clinical standard used to train the AI model. This dual approach allowed the team to see if AI assistance improved performance against a neutral expert consensus and also how well it helped clinicians align with a specific, established clinical standard.
Key Findings on Performance and Efficiency
The study yielded several important insights:
- AI vs. Unaided Humans: When evaluated against the consensus ground truth, the AI model performed at a similar level to human solo annotators. However, when measured against the CPS ground truth (the AI’s training standard), both the AI alone and human-AI teams significantly outperformed unaided human experts. This suggests that AI assistance is particularly beneficial when the goal is to align with a specific, established clinical standard.
- Transparency Matters: Transparent (white-box) AI assistance consistently outperformed opaque (black-box) assistance. White-box AI led to an average 18% improvement in performance, with this benefit being most pronounced during the quality-control phase, where it showed a 30% improvement over black-box assistance.
- Timing is Crucial: For clinically relevant count-based measures (like the total number of arousals), the timing of AI assistance was a dominant factor. Quality-control (QC) workflows, where AI was used for review after initial manual scoring, substantially outperformed assistance provided from the start. QC regimes led to a four-fold improvement in reducing error rates and significantly reduced systematic under-counting bias.
- Time Demands: While transparent AI and quality-control workflows led to better performance, they also required more time. White-box phases took approximately twice as long as black-box phases, and QC sessions were roughly twice as long as start-time assistance sessions. However, the study noted that with increased familiarity, some participants achieved similar speeds with white-box as with black-box support.
User Experience and Acceptance
Participants overwhelmingly favored transparent AI assistance. They rated white-box AI significantly higher across all key dimensions: usefulness, confidence, trust, ease of validation, and enjoyment. On a 1-10 usefulness scale, transparent AI narrowed the gap to the ideal score of 10 by over 40% compared to black-box AI. Most participants (seven out of eight) expressed a willingness to adopt the system with minor or no modifications, highlighting a high degree of acceptance.
Clinicians also gained valuable insights into the AI’s reasoning, identifying recurring patterns in how the AI recognized arousals (e.g., based on pulse rate, leg movements, breathing patterns, EEG changes). While most preferred starting with AI assistance for efficiency, the objective data showed that quality-control timing yielded more accurate results. This highlights a tension between user preference for speed and the objective benefits of a more thorough, quality-control approach.
Also Read:
- CXRAgent: A New AI Approach for Reliable Chest X-Ray Interpretation
- Measuring Public Trust in AI: A New Instrument for Human-AI Interaction
Future Directions for AI in Sleep Diagnostics
The study concludes that human-AI collaboration in arousal detection significantly enhances team alignment with a reference standard, making scoring more accurate and consistent. Transparency is a key enabler, transforming AI from a mere suggestion provider into a source of actionable evidence. The most reliable results were achieved when transparent support was applied as a targeted quality-control step.
The findings suggest that for routine scoring, a faster, start-time assistance might suffice, while for auditing, educational purposes, or complex cases, the more time-consuming quality-control approach is beneficial. This points to the need for configurable workflows that balance accuracy and efficiency. With minor interface and model tweaks, such as adjusting decision thresholds, narrowing visual ranges for arousal onset, and emphasizing EEG features, the system could move from a promising prototype to a trusted co-scorer in routine sleep-laboratory workflows. For more details, you can read the full paper here.


