TLDR: This research compares two Reinforcement Learning methods, Value Iteration (model-based) and Proximal Policy Optimisation (model-free), for optimizing call routing in call centers to minimize client waiting time and staff idle time. While Value Iteration is faster, Proximal Policy Optimisation, despite longer training, proves more effective and adaptable in a simulated stochastic environment, achieving better overall performance by learning directly from experience.
Call centers are a crucial part of many businesses, but managing them efficiently can be a complex challenge. The goal is often to keep customers happy by minimizing their waiting time while also ensuring that staff members are busy and not sitting idle. This delicate balance is what researchers Kwong Ho LI and Wathsala Karunarathne explored in their paper, “Optimising Call Centre Operations using Reinforcement Learning: Value Iteration versus Proximal Policy Optimisation.” Their work delves into how artificial intelligence, specifically Reinforcement Learning (RL), can be used to make smarter decisions about routing calls.
The core idea behind this research is to use RL to teach a system how to route incoming calls to available staff members in the most effective way. They compared two main approaches: Value Iteration (VI), which is a “model-based” method, and Proximal Policy Optimisation (PPO), a “model-free” method. Model-based methods require a complete understanding of how the call center system works, including all possible outcomes and rewards for every decision. This is often difficult to achieve in the real world because things are constantly changing and unpredictable. Model-free methods, on the other hand, learn directly from experience, much like a human learns by trial and error, without needing a perfect pre-defined model of the environment.
To test their ideas, the researchers set up a simulated call center environment. This simulation mimicked an 8-hour workday with two staff members (Staff 0 and Staff 1) and two types of customer inquiries (Type 0 and Type 1). Each staff member had their own queue, and calls were routed using a Skills-Based Routing (SBR) strategy, meaning the system considered both how long queues were and what type of inquiry was coming in. For example, Staff 0 was more efficient with Type 0 inquiries, and Staff 1 with Type 1. The system was designed as a Markov Decision Process (MDP), a mathematical framework that helps model decision-making in environments with elements of randomness.
The theoretical model used Value Iteration, assuming it had full knowledge of how the call center would behave. Its reward system was designed to heavily penalize routing calls to full queues or to a busy staff member when another was idle, while also aiming to minimize client waiting time. This approach is very fast if you have all the information upfront.
In contrast, the simulation model was built for the model-free PPO approach. This involved integrating a Discrete Event Simulation (DES) system with the OpenAI Gym framework, which is a popular toolkit for developing and comparing RL algorithms. This setup allowed for a more realistic representation of a call center, with client arrivals, service times, and even clients abandoning calls due to long waits, all happening stochastically (randomly). The PPO agent learned by interacting with this dynamic environment, making routing decisions and receiving feedback (rewards or penalties) based on its actions. The reward function for the simulation model penalized assigning calls to full queues, client abandonment, staff idle time, and client waiting time. This trial-and-error learning process is more computationally intensive but leads to policies that are robust and adaptable to various real-world conditions.
Key Findings and Performance
The evaluation of the different routing policies was conducted over 1,000 simulation runs. The results clearly showed the strengths and weaknesses of each approach. The “Random” policy, as expected, performed the worst across all metrics, leading to the highest waiting times and client abandonment.
The Value Iteration (VI) policy, while better than the random approach, showed limitations. It improved overall rewards and reduced abandonment, and was particularly good at minimizing idle time for Staff 0. However, because it relies on a fixed model, it struggled to adapt, leading to inefficiencies and higher waiting times for Staff 1 in certain situations.
The Proximal Policy Optimisation (PPO) policy emerged as the top performer. Despite requiring significantly longer training time (about 40 minutes compared to VI’s mere seconds), PPO achieved the highest overall reward, served the most clients, and drastically reduced both client abandonment and average waiting time. It also demonstrated a better balance of idle time across both staff members, showcasing its adaptive capabilities in a dynamic environment. This indicates that PPO is a highly practical and effective strategy for optimizing call routing in complex, unpredictable settings.
Also Read:
- Optimizing Industrial Assembly Lines with AI: A New Deep Reinforcement Learning Approach
- Optimizing Agri-Food Inventory with Advanced AI Learning
Looking Ahead
The study concludes that model-free RL, particularly PPO, offers greater adaptability and practical applicability for real-world call center operations where system dynamics are often unknown or highly variable. While model-based methods like VI are theoretically optimal with full knowledge, this assumption rarely holds true in practice. The researchers also noted some limitations, such as the reward function being calculated only at arrival events, which could introduce some approximation errors, and the simplified two-staff setup. However, they emphasize that the PPO framework is scalable and can be extended to handle more staff, inquiry types, and complex routing constraints in future work. This research highlights the significant potential of model-free reinforcement learning as a powerful tool for decision-making in dynamic service systems. You can find more details about their work by checking out the full research paper here.


