TLDR: A study utilized Multiple Linear Regression (MLR) and Random Forest (RF) algorithms with California Highway 78 traffic data to predict traffic flow. Analyzing data from 30-second to 15-minute intervals, the research found that MLR performed best with 10-minute intervals, while RF continued to improve up to 15-minute intervals. This work provides insights into optimal data granularity for AI-driven traffic management.
Traffic congestion is a persistent global challenge, leading to significant environmental and economic costs. For instance, a mere 10-mile-per-hour decrease in speed due to congestion can increase CO2 emissions by approximately 100 grams per mile. Furthermore, American drivers annually lose an average of 42 hours to traffic, equivalent to a full workweek, according to a 2023 report by INRIX.
Addressing this critical issue, a recent study titled “Prediction of Highway Traffic Flow Based on Artificial Intelligence Algorithms Using California Traffic Data” proposes a machine learning-based model designed to predict highway traffic flow. This research aims to contribute to more effective traffic management and future solutions for congestion.
Understanding the Data
The study utilized extensive traffic data from California Highway 78, specifically a 7.24-kilometer westbound stretch connecting “Melrose Dr” and “El-Camino Real” in the San Diego area. Data was collected over five months, from July to November 2022, with measurements recorded every 30 seconds around the clock. This raw dataset, provided by the California Department of Transportation (Caltrans), included details such as measurement date and time, detector identification numbers, and crucial metrics like the number of passing vehicles (traffic volume) and roadway occupancy for each lane.
Before analysis, the data underwent a three-step preprocessing procedure. First, raw data from individual detectors was integrated and reorganized. Second, the original 30-second interval data was restructured into various time resolutions: 1-minute, 2-minute, 5-minute, 10-minute, and 15-minute intervals. This allowed the researchers to examine how different time granularities affected prediction accuracy. Finally, to ensure consistent traffic patterns, only weekday data was selected for analysis, excluding weekend traffic.
Artificial Intelligence at Work
The researchers employed two prominent artificial intelligence algorithms for traffic flow prediction: Multiple Linear Regression (MLR) and Random Forest (RF).
-
Multiple Linear Regression (MLR): This statistical technique uses multiple input variables to predict a single output. It assigns a weight to each input, summing them to calculate the final prediction. It’s a straightforward method for understanding linear relationships within data.
-
Random Forest (RF): An ensemble machine learning algorithm, Random Forest combines multiple decision trees to make predictions. It’s known for its stable performance across diverse datasets and its effectiveness with complex, non-linear relationships, while also having mechanisms to mitigate overfitting.
To evaluate the performance of these models, standard metrics were used: R-squared (R²), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE). R² indicates how well the model explains the data’s variance, with values closer to 1 suggesting a better fit. MAE measures the average absolute difference between predicted and actual values, providing an intuitive understanding of prediction accuracy in the original units. RMSE is similar but places more emphasis on larger errors, making it sensitive to significant prediction inaccuracies.
Key Findings and Optimal Intervals
The study trained and tested both MLR and RF models using an 80% to 20% split of the preprocessed data. A key aspect of the analysis was observing how model performance changed with varying data collection intervals.
For the Multiple Linear Regression (MLR) model, performance, particularly in terms of scaled MAE and RMSE, showed improvement up to a 10-minute data collection interval. Beyond this, at 15-minute intervals, a noticeable degradation in performance was observed. This suggests that 10 minutes is the optimal data collection interval for MLR in this context.
In contrast, the Random Forest (RF) model demonstrated continued performance improvement as the data collection interval increased, even up to 15 minutes. This indicates that RF might be more robust or better suited for capturing patterns over longer time aggregations in this specific traffic prediction scenario.
Also Read:
- Enhancing Cooperative Driving for Autonomous Vehicles with Topology-Aware Reinforcement Learning
- Advancing Traffic Prediction with Deformable Dynamic Convolution Networks
Conclusion and Future Directions
This research successfully demonstrated the application of machine learning algorithms, MLR and RF, for predicting highway traffic flow using real-world California traffic data. By analyzing data at various time intervals and employing robust performance metrics, the study identified optimal collection intervals for each algorithm, specifically 10 minutes for MLR and at least 15 minutes for RF.
The findings are expected to be valuable for developing more accurate traffic prediction models, ultimately aiding in the development of solutions for traffic congestion and enhancing efficient traffic management systems. Future research aims to expand this analysis by incorporating data from a greater number of detector IDs and exploring even longer collection time intervals to further optimize the RF model’s performance.


