TLDR: This research introduces Horizon-DQN (H-DQN) and adapts Quantile Regression DQN (QR-DQN) to tackle the challenge of delayed and sparse rewards in reinforcement learning, using the game 2048 as a testbed. H-DQN, a novel architecture combining several advanced RL techniques, significantly outperforms standard DQN and PPO, achieving higher scores and reaching the 2048 and 4096 tiles, demonstrating the effectiveness of distributional and multi-step learning in long-horizon tasks.
Reinforcement Learning (RL) has achieved remarkable success in games with immediate and clear feedback, such as Atari, Go, and chess. However, many real-world scenarios, like clinical decision-making or autonomous driving, present a significant hurdle: rewards are often sparse, delayed, or even misleading. This makes it incredibly difficult for RL agents to understand which actions taken early on are responsible for benefits that appear much later.
The 2048 Game: A Perfect Testbed for Delayed Rewards
The popular sliding-tile game 2048 serves as an excellent, compact environment to study this “long-horizon credit assignment problem.” While merging tiles provides immediate small scores, truly mastering the game and creating high-value tiles like 1024 or 2048 demands foresight and strategic planning. Greedy, short-term actions often lead to fragmented boards and suboptimal outcomes, highlighting the tension between immediate gains and long-term strategy.
Previous attempts to conquer 2048 with AI often relied on handcrafted features or specific game knowledge. This research, however, explores whether general-purpose deep RL architectures can learn effective strategies from scratch, without such manual encoding.
Introducing Advanced RL Agents: QR-DQN and Horizon-DQN
The study focuses on two advanced RL algorithms: Quantile Regression DQN (QR-DQN) and a novel architecture called Horizon-DQN (H-DQN).
QR-DQN is a state-of-the-art distributional RL algorithm. Instead of just predicting the average future reward, it models the entire distribution of possible future rewards. This allows it to better understand the uncertainty and potential range of outcomes, especially in environments with high variability. For 2048, the researchers adapted QR-DQN with a lightweight convolutional encoder to process the game board’s spatial information effectively.
Horizon-DQN (H-DQN) is a new, composite architecture specifically designed for long-horizon planning. It builds upon the “Rainbow” agent, which combines several key innovations in deep RL, and adds two crucial mechanisms for sparse-reward domains: sequence-level prioritized replay and a recurrent LSTM encoder. The LSTM helps the agent remember past actions and their long-term consequences, while prioritized replay focuses learning on the most informative sequences of actions. H-DQN also incorporates dueling networks, double Q-learning, multi-step TD updates, and NoisyNet exploration for robust and stable learning.
Experimental Results: A Clear Hierarchy of Performance
The researchers benchmarked H-DQN and QR-DQN against standard Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) baselines under identical training conditions in the Gymnasium-2048 environment. The results showed a clear performance hierarchy:
- Standard DQN and PPO agents plateaued at significantly lower scores (average 1,443 and 1,831 respectively) and rarely reached the 512 tile.
- QR-DQN performed much better, achieving an average score of 3,478 and reaching the 1024 tile, demonstrating its ability to handle sparse rewards more effectively.
- H-DQN emerged as the top performer, achieving an average score of 5,693 and a maximum score of 18,210, consistently reaching the 2048 tile.
Further scaling of H-DQN’s training from 5,000 to 9,000 episodes yielded even more impressive results: a 14.8% increase in average score (to 6,536), a jump in the maximum tile from 2048 to 4096, and a peak score of 41,828. This indicates that H-DQN continues to benefit significantly from extended training, suggesting considerable untapped potential.
Also Read:
- Optimizing Autonomous Driving AI: How Smart Action Choices Lead to Safer, Faster Learning
- Unrewarded Milestones: Why AI Struggles with Hidden Subgoals in Learning
Learned Strategies and Future Directions
A fascinating observation was the “corner-locking” strategy adopted by the stronger H-DQN models. These agents developed a strong bias towards moving tiles in specific directions (e.g., Left and Down, or Right and Up) to anchor the largest tile in a corner. This dramatically reduced board fragmentation and improved long-term play, unlike the uniform move distribution seen in weaker models that led to chaotic boards.
The study concludes that modeling the full return distribution and propagating rewards across multiple steps provides a robust way to tackle tasks with delayed rewards. While promising, the researchers acknowledge challenges such as hyperparameter sensitivity and the significant computational resources required for high performance. Future work could explore integrating model-based planning, curriculum learning, and distributed training architectures to further enhance efficiency and generalizability. For more details, you can refer to the full research paper: 2048: Reinforcement Learning in a Delayed Reward Environment.


