TLDR: A new research paper introduces “Compute DQN,” a reinforcement learning agent that learns to reason about and control its computational processes. By incorporating computation cost into its reward system and giving it actions to vary its decision rate, the agent achieves better performance on 75% of Atari games while using 3.4 times less compute on average, demonstrating adaptive and game-specific compute efficiency.
In the rapidly evolving world of artificial intelligence, reinforcement learning (RL) agents have achieved remarkable feats, often surpassing human performance in complex tasks. However, a significant difference remains between how humans and AI agents learn and improve: efficiency. While humans naturally become more computationally efficient as they master a skill, AI agents typically maintain a fixed computational footprint, regardless of their proficiency.
The Problem: Inefficient AI
Imagine a human learning to play a video game. Initially, they might concentrate intensely, processing every detail. As they get better, they learn to anticipate, focus on key information, and make decisions with less mental effort. Current RL agents, by contrast, are often designed with fixed computational processes for sensing, acting, learning, and planning. This means they don’t adapt their compute usage to the demands of the moment, leading to potentially wasteful energy consumption and missed opportunities for other processes like advanced planning.
Traditionally, it’s been up to human designers to optimize an agent’s computational processes. For example, in the Arcade Learning Environment (ALE), agents might process only every fifth frame instead of every single one, speeding up execution without much performance loss. But what if agents could make these choices themselves, learning to manage their own computational resources?
A Novel Approach: Reasoning About Compute
A new research paper, “Toward Agents That Reason About Their Computation”, explores this very question. Authored by Adrian Orenstein, Jessica Chen, Gwyneth Anne Delos Santos, Bayley Sapara, and Michael Bowling, the work introduces a novel way to integrate computational awareness directly into the agent’s learning process. The core idea is to treat computation itself as a cost, making it an explicit part of the agent’s objective.
Consider a solar panel tracking station. An agent’s goal is to maximize the power exported to the grid. While actuating the panel to track the sun earns rewards, the agent’s own computation consumes energy. By framing the reward as the energy gathered minus the compute cost, the agent is incentivized to balance task performance with computational efficiency. Furthermore, the researchers provide agents with special “compute actions” that don’t directly affect the environment but control how often they process observations and make decisions.
How It Works: Compute DQN
The researchers extended the popular Deep Q-Network (DQN) architecture, calling their new agent “Compute DQN.” Instead of a fixed decision rate, Compute DQN is given a set of “options” that combine an action with a duration – essentially, how long to repeat that action before processing the next observation and making another decision. This gives the agent coarse control over its decision rate, allowing it to operate at various frequencies, from DQN’s standard 12 Hz down to 1.5 Hz.
Crucially, both the standard DQN and Compute DQN were trained with the exact same computational budget. The cost of computation for each game was dynamically set based on the baseline DQN’s performance, ensuring a meaningful trade-off for the agent to learn.
Key Findings: Smarter, Not Just Faster
The results were compelling. Compute DQN not only learned to adapt its compute usage but also achieved better performance on 75% of the Atari games in the ALE suite. On average, Compute DQN used 3.4 times fewer decisions per second than the standard DQN (3.6 Hz vs. 12 Hz). This suggests that giving agents control over their compute doesn’t necessarily hinder performance; in some cases, it can even improve it, possibly due to better credit assignment or exploration.
The learning process itself showed interesting dynamics. In games like Pong, agents only reduced their decision rate significantly after their policy was already strong. In contrast, games like Breakout and Asterix saw an early sharp drop in decision rates, prioritizing compute conservation before a strong action sequence was found, then increasing compute as performance improved.
Game-Specific Strategies
The agents didn’t just reduce compute uniformly; they learned game-specific strategies. In Pong, the agent conserved compute when the ball was far away, increasing its decision rate just before hitting the ball or repositioning. In Breakout, the agent’s decision rate increased as the ball sped up and the paddle shrunk, requiring more precise control. It also learned to conserve compute when the ball was stuck behind blocks. Asterix agents concentrated compute during dense waves of collectibles and lowered it between waves, adapting to the varying pace of the game.
Adapting to Cost
Further experiments showed that agents are responsive to the explicit cost of computation. When the per-decision compute cost was increased, agents learned policies that chose longer options, significantly reducing their decision rate. Conversely, when compute became cheaper, agents operated closer to the higher 12 Hz rate. This demonstrates that agents can effectively balance performance and compute efficiency based on the designer’s specified cost.
Also Read:
- Improving AI Decision-Making by Tackling Unseen Factors
- Causal Deep Q Networks: A New Path to Intelligent Reinforcement Learning
Looking Ahead
This research represents a significant step toward creating agents that can reason about and control their own computational processes. The implications are far-reaching: more energy-efficient AI, cheaper and faster RL experiments, agents that can optimally utilize available compute resources, and long-lived agents that can continually adapt to changing resource availability and task demands throughout their lifetime. Future work could explore giving agents even richer compute actions, such as adaptively selecting neural network sizes or sampling frequencies, further enhancing their autonomy and efficiency.


