TLDR: This research paper introduces a novel model-based reinforcement learning framework, DA-Dreamer, designed to handle random observation delays in Partially Observable Markov Decision Processes (POMDPs). Unlike previous methods that assume full observability or fixed delays, DA-Dreamer uses a latent-space filtering process to sequentially update an agent’s belief state, effectively processing out-of-sequence observations. Experiments show that DA-Dreamer consistently outperforms existing baselines in various environments, demonstrates robustness to stochasticity, and generalizes well to unseen delay distributions, making it highly suitable for real-world applications like robotics and autonomous driving where unpredictable delays are common.
Reinforcement Learning (RL) has achieved remarkable success in various domains, from game playing to robotics. However, a fundamental assumption in most standard RL algorithms is that the agent perceives the environment instantaneously, without any delays. In the real world, this is rarely the case. Delays are a pervasive and often unavoidable aspect of practical systems, particularly in areas like robotics, autonomous driving, and distributed control.
These delays can manifest in different forms, such as feedback delays (the time lag in receiving observations) and execution delays (the delay between an action being chosen and its actual execution). While these are common, they are frequently ignored or oversimplified in the RL literature. Current workarounds, like issuing “no-op” actions to wait for observations, are often impractical or even unsafe in critical situations, such as an autonomous vehicle needing to react immediately to an obstacle.
Even when delays are considered, existing approaches often make simplifying assumptions. They might assume a fully observable environment, as in Markov Decision Processes (MDPs), or fixed delays in Partially Observable Markov Decision Processes (POMDPs). However, real-world systems often combine partial observability with random delays. This combination introduces a unique challenge: observations may arrive out-of-sequence (OOS). Unlike MDPs, where the most recent observation is usually sufficient, POMDPs require the agent to integrate past observations to maintain a belief about the environment’s true state. With random delays, relying solely on the latest observation is insufficient for effective decision-making.
A new research paper, titled “MODEL-BASED REINFORCEMENT LEARNING UNDER RANDOM OBSERVATION DELAYS”, by Armin Karamzade, Kyungmin Kim, JB Lanier, Davide Corsi, and Roy Fox from the University of California, Irvine, tackles this complex problem. The authors propose a novel framework that specifically addresses random observation delays in POMDPs, a setting previously unaddressed in RL.
The core of their solution is a model-based filtering process that sequentially updates the agent’s belief state based on an incoming stream of observations, even when they arrive out-of-sequence. This approach leverages a “world model” trained within the delayed environment to form a coherent understanding of the current latent state, given only the observations that have actually arrived. This belief state then acts as a sufficient summary of information for the agent to learn and execute its policy, ensuring actions are informed solely by available inputs.
The researchers integrated this delay-aware framework into Dreamer, a prominent model-based RL algorithm. The training procedure involves training the world model on complete, ordered trajectories (after all pending observations have arrived), while the policy is trained on belief states inferred from the partially observed sequences that the agent experiences in real-time. This decoupling allows the system to learn robust dynamics while making decisions under uncertainty.
Extensive experiments were conducted on both simulated robotic tasks (MuJoCo environments) and more realistic Meta-World environments with visual inputs. The results demonstrate that their method, referred to as DA-Dreamer, consistently outperforms existing delay-aware baselines designed for MDPs. Notably, DA-Dreamer was the only method capable of effectively handling more realistic, partially observable scenarios with longer delays.
Furthermore, the approach showed strong generalization capabilities. When trained on a wide range of delay distributions, DA-Dreamer performed significantly better under shorter test-time delays and experienced minimal performance degradation under longer ones. This robustness to delay distribution shifts during deployment is a crucial feature for real-world applications where delay patterns are often unpredictable or nonstationary. In Meta-World tasks, DA-Dreamer also significantly outperformed practical heuristics like simply waiting for observations or using only the latest available observation.
Also Read:
- DAWM: Enhancing Offline Reinforcement Learning with Action-Inferred Trajectories
- Learning by Watching: A Deep Dive into State-Only Imitation for AI Agents
This work represents a significant step forward in making reinforcement learning more applicable to real-world systems where delays are a constant factor. By explicitly modeling and filtering out-of-sequence observations in partially observable environments, the proposed framework enables AI agents to make more informed and reliable decisions under conditions of uncertainty. For more details, you can read the full research paper here.


