TLDR: A new study uses survival analysis to evaluate LLM robustness in multi-turn dialogues, finding that abrupt semantic shifts (prompt-to-prompt drift) dramatically increase failure risk, while gradual, cumulative semantic drift is surprisingly protective, enabling longer, more stable conversations. Accelerated Failure Time (AFT) models proved superior for predicting these dynamic failures.
Large Language Models (LLMs) have transformed how we interact with AI, but understanding their reliability in ongoing conversations has been a significant challenge. Traditional methods for evaluating these models often focus on single interactions, failing to capture how their performance might degrade over longer, multi-turn dialogues, especially when faced with challenging or adversarial questions.
A recent research paper titled “TIME-TO-INCONSISTENCY: A SURVIVAL ANALYSIS OF LARGE LANGUAGE MODEL ROBUSTNESS TO ADVERSARIAL ATTACKS” by Yubo Li, Ramayya Krishnan, and Rema Padman from Carnegie Mellon University introduces a groundbreaking approach to this problem. Instead of looking at isolated instances of failure, their work treats conversational breakdown as a “time-to-event” process, similar to how survival analysis is used in medical studies to track how long a patient survives before a specific event occurs. This allows for a dynamic understanding of LLM robustness over time. You can read the full paper here: Research Paper.
Understanding Conversational Failure
The researchers analyzed 36,951 conversation turns across nine leading LLMs, defining “failure” as the point when a model first produces an incorrect answer during a multi-turn exchange. They measured “time” in discrete conversation rounds, up to an observation horizon of eight turns. To understand what drives these failures, they engineered several key features from the dialogue text:
- Prompt-to-Prompt Drift (Dp2p): This measures the immediate semantic shift between one conversation turn and the next. A large jump here means the topic or intent changed abruptly.
- Context-to-Prompt Drift (Dc2p): This captures how much the current prompt deviates from the overall accumulated conversation context.
- Cumulative Drift (Dcum): This tracks the total semantic distance covered throughout the conversation.
A New Approach to Modeling Robustness
The study employed a sophisticated set of survival models, including Cox proportional hazards, Accelerated Failure Time (AFT) models, and Random Survival Forests. A crucial finding was that the standard assumption of “proportional hazards” (where the risk of failure remains constant over time) was systematically violated for key semantic drift features. This meant that the risk of an LLM failing isn’t static; it changes as the conversation progresses, especially under adversarial pressure.
This violation highlighted the superiority of Accelerated Failure Time (AFT) models. These models are particularly well-suited to capture the time-varying nature of risk, leading to more accurate predictions and better calibration, especially in the later stages of a dialogue.
Surprising Insights into Semantic Drift
The research revealed extraordinary temporal dynamics in LLM robustness:
- Abrupt Shifts are Catastrophic: Prompt-to-prompt (P2p) semantic drift emerged as the dominant driver of failure. Sudden, immediate shifts in topic or intent dramatically increased the hazard of conversational failure for all LLMs tested. For some models, the risk more than quadrupled with these abrupt changes.
- Gradual Drift is Protective: Counterintuitively, higher accumulated drift over the course of a conversation was associated with a lower risk of failure. This suggests that if a conversation gradually evolves, the model might adapt to the changing topic or adversarial pressure, becoming more resilient over time. This challenges the common belief that any deviation from the initial topic is always detrimental.
These findings indicate that the velocity of semantic change—how quickly the topic shifts—is more critical for conversational integrity than the total distance the conversation has drifted.
Also Read:
- Adaptive Risk Control for Secure and Efficient LLM In-Context Learning
- Untargeted Jailbreak Attack: A New Approach to Uncover LLM Vulnerabilities
Practical Implications
The insights from this survival analysis framework have immediate practical applications. The strong link between abrupt P2p drift and failure provides a clear direction for developing real-time monitoring and early warning systems. By detecting these acute conversational shocks, AI systems could proactively intervene, perhaps by gracefully changing the topic, escalating to a human agent, or adjusting their response strategy before user trust is broken. The high discriminative accuracy (C-index up to 0.874) and exceptional calibration (IBS below 0.18) of the AFT models mean these systems could identify at-risk conversations with high confidence.
In conclusion, this work establishes a powerful new way to evaluate LLM robustness, moving beyond static benchmarks to understand the dynamic, time-dependent nature of conversational failure. It provides concrete insights for designing more resilient and reliable AI agents by focusing on the temporal dynamics of semantic drift.


