Predicting Transit Delays: How AI Learns from Unstructured Alerts

TLDR: This research explores using Reinforcement Learning from Verifiable Rewards (RLVR) with Large Language Models (LLMs) to predict public transit incident durations from text alerts. It introduces a novel tolerance-based, shaped reward function that grants partial credit for predictions within an error margin, overcoming the limitations of binary rewards for continuous forecasting. The study found that general-purpose, instruction-tuned LLMs outperformed specialized math-reasoning models, achieving a 35% relative improvement in 5-minute accuracy over baselines, particularly excelling in high-precision, early-stage predictions despite classical regressors minimizing overall MAE/MSE.

Public transit delays are a common headache in urban areas, causing frustration for commuters and logistical challenges for transit agencies. Predicting how long these disruptions will last, especially from early, unstructured text alerts, is a critical but incredibly difficult task. Imagine a subway signal problem or a bus detour – knowing how long it will take to resolve could significantly improve how agencies respond and how riders plan their journeys.

Traditionally, predicting incident durations has been challenging due to several factors. The text alerts themselves often contain specialized jargon not found in general language, making it hard for standard language models to understand. Furthermore, the ‘ground truth’ duration of an incident can be noisy and uncertain, as initial estimates often differ from the actual resolution time. This task also involves predicting a continuous value (duration in minutes), not a simple ‘yes’ or ‘no’ answer, which complicates many machine learning approaches.

One popular method for training large language models (LLMs) is Supervised Fine-Tuning (SFT), where models learn from examples of ideal input-output pairs. However, SFT struggles with the inherent noise and continuous nature of transit incident data. Another advanced technique, Reinforcement Learning from Verifiable Rewards (RLVR), has shown great success in tasks with clear, binary correct answers, like solving math problems or writing code. The big question this research paper aimed to answer was whether the powerful mathematical and logical reasoning capabilities of LLMs, when trained with RLVR, could be adapted to the messy, real-world problem of predicting transit incident durations.

A Novel Approach to Continuous Forecasting

This groundbreaking research introduces a new framework that bridges the gap between RLVR LLM training and the complex forecasting challenges in public transit. The key innovation lies in adapting RLVR for continuous, noisy targets. Instead of demanding a single, exact correct answer, the researchers developed a ‘tolerance-based, shaped reward function’. This function grants partial credit to predictions that fall within a continuous error margin, making the training process more stable and effective for real-world scenarios.

The team systematically evaluated their framework using a carefully curated dataset of New York City MTA service alerts. This dataset, which links GTFS-rt service alerts to actual incident durations, is a significant contribution in itself, providing a robust resource for future research.

Surprising Findings and Performance Gains

The study yielded several important and somewhat unexpected findings:

General-Purpose LLMs Outperform Math-Focused Models: Counter-intuitively, general-purpose, instruction-tuned LLMs significantly outperformed specialized math-reasoning models. The math-focused models struggled with the ambiguous and nuanced language often found in real-world transit alerts, highlighting that robust natural language understanding is more crucial than pure mathematical reasoning for this task.
Shaped Rewards are Critical: The research empirically demonstrated that a simple binary reward (either perfectly right or perfectly wrong) was unstable and degraded performance. In contrast, their shaped reward design, which offers partial credit, was essential for the model’s success, allowing it to excel even on the most challenging metrics.
RLVR Excels at Early, High-Precision Predictions: While classical regression models (like Support Vector Regressors) were superior at minimizing overall Mean Absolute Error (MAE) or Mean Squared Error (MSE), the RLVR approach truly shined in predicting durations with high precision for short timeframes. The model achieved a remarkable 35% relative improvement in 5-minute accuracy (Acc@5) over the strongest baseline. This means for critical, early-stage decisions where a small error margin is crucial, RLVR provides substantial gains.

The study also explored the impact of prompt design, finding that while detailed prompts could provide a strong initial performance, simpler prompts (like P2, which asked the model to infer a category before predicting duration) ultimately led to higher accuracy after RLVR training. This suggests that too much initial guidance might hinder the model’s ability to learn from the reward signals during the reinforcement learning process.

Also Read:

Looking Ahead

This research marks a significant step forward in applying advanced LLM techniques to complex, real-world forecasting problems. It demonstrates that RLVR can be successfully adapted to noisy, continuous prediction tasks, provided the reward system is designed to reflect the continuous nature of the problem. However, the authors acknowledge limitations, such as the reliance on LLM-assisted ‘ground truth’ durations and the use of data exclusively from the NYC MTA system. Future work will focus on extending the framework to predict spatial impact (which routes and stations are affected) and testing its generalizability across different cities and transit systems. For more details, you can read the full paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Predicting Transit Delays: How AI Learns from Unstructured Alerts

A Novel Approach to Continuous Forecasting

Surprising Findings and Performance Gains

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates