TLDR: This paper introduces Confounded Causal Imitation Learning (C2L), a new framework that addresses the problem of unmeasured confounders biasing policies in imitation learning. Unlike previous methods, C2L can handle confounders that affect actions over multiple timesteps. It uses a two-stage process: first, identifying a valid instrumental variable (IV) using a novel criterion, and then learning a debiased policy either with a simulator or purely offline. Experiments show C2L accurately identifies IVs and significantly outperforms existing methods in policy learning across various environments and data conditions.
Imitation learning, where autonomous agents learn by mimicking expert demonstrations, has shown great promise in various fields like robotics and autonomous driving. However, a significant challenge arises from “confounding effects” – hidden or unmeasured variables that simultaneously influence both the expert’s observed states and actions. If these confounders are ignored, the learned policies can be biased and perform poorly in real-world scenarios.
Traditional imitation learning methods often struggle with this issue, especially when confounders persist over multiple time steps. For instance, a driver’s fatigue or environmental distractions can affect both vehicle speed and steering over an extended period. Existing solutions, like the Temporally Correlated Noise (TCN) model, typically assume that confounders only impact two consecutive actions, which is a limitation in many realistic situations.
Introducing Confounded Causal Imitation Learning (C2L)
To address these limitations, researchers have proposed a novel framework called Confounded Causal Imitation Learning (C2L). This model is designed to handle confounders that influence actions across multiple, arbitrary-length timesteps, better reflecting real-world complexities. The core idea behind C2L is to leverage the power of “instrumental variables” (IVs) to identify and eliminate the bias caused by these unmeasured confounders.
The C2L framework operates in two main stages:
Stage I: Identifying the Valid Instrumental Variable
The first crucial step is to accurately identify a valid instrumental variable from the available observational data. An instrumental variable is essentially a variable that is related to the action, but only affects the outcome (the policy) through the action, and is independent of the unmeasured confounders. In C2L, the researchers developed an “Auxiliary-Based testing Criterion” (AB Criterion) that helps determine if a candidate past state can serve as a valid IV. This criterion provides clear conditions for IV validity, even in complex, non-linear scenarios, by analyzing the independence between a defined auxiliary residual variable and the candidate IV.
Stage II: Learning the Optimal Policy
Once a valid instrumental variable is identified, the C2L framework offers two distinct approaches for learning an unbiased policy. One is the Simulator-Based Approach (C2L): For environments where a simulator is available, this method first learns an initial policy. Then, it uses the identified IV to generate “confounder-free” synthetic states within the simulator. By training the final policy on these clean synthetic states and the observed expert actions, the confounding bias is effectively removed.
The other is the Offline Approach (C2L*): In situations where a simulator is not accessible, C2L* employs a game-theoretic, adversarial learning strategy. This approach reformulates the policy learning as a minimax optimization problem, allowing it to learn a robust, debiased policy purely from offline data by strategically minimizing prediction errors while a “discriminator” tries to detect errors conditioned on the instrumental variable.
Also Read:
- Enhancing AI Safety: New Methods for Data-Efficient Policy Improvement
- Unpacking Counterfactuals: A New Framework for Causal Reasoning
Experimental Validation and Performance
The effectiveness of the C2L framework was rigorously tested across three diverse environments: LunarLander, HalfCheetah, and AntBulletEnv. The experiments evaluated both the accuracy of IV identification and the performance of the learned policies. The results consistently showed that C2L accurately identified valid instrumental variables, even with varying numbers of expert trajectories, different confounding durations, and various confounder distributions.
Furthermore, in terms of policy learning, both the simulator-based C2L and the offline C2L* approaches significantly outperformed existing baseline methods like Behavioral Cloning (BC), ResiduIL, and DoubIL. This superior performance was particularly noticeable when the amount of demonstration data was limited, highlighting the robustness of the C2L methods. The research paper, titled “Confounded Causal Imitation Learning with Instrumental Variables,” provides a detailed explanation of these findings and the underlying theory. You can read the full paper here: https://arxiv.org/pdf/2507.17309.
In conclusion, C2L represents a significant step forward in making imitation learning more robust and reliable for real-world applications by effectively tackling the pervasive problem of unmeasured confounding effects, especially those that persist over time.


