TLDR: This research introduces a novel parallel CPU-GPU framework for Cost-Bounded Depth-First Search (CB-DFS), designed to efficiently integrate neural network heuristics into classic search algorithms like IDA* and BTS. By batching heuristic evaluations on the GPU while CPU threads explore the search space, the framework significantly reduces computation time and improves performance. The paper demonstrates how this approach maintains optimality guarantees and scales with multiple GPUs, offering a new direction for leveraging modern hardware in AI search.
The landscape of artificial intelligence is constantly evolving, with powerful hardware like Graphics Processing Units (GPUs) and modern Central Processing Units (CPUs) offering unprecedented parallel processing capabilities. This advancement opens new avenues for enhancing classic search algorithms, which are fundamental to many AI problems, from game playing to pathfinding.
Traditional search algorithms, such as Iterative Deepening A* (IDA*) and Budgeted Tree Search (BTS), have long been cornerstones of AI. However, they haven’t fully leveraged the massive parallelism offered by contemporary hardware. While some progress has been made, like using GPUs for compressing large pattern database (PDB) heuristics or in algorithms like Batch A*, there’s still a gap in designing search algorithms that truly exploit both CPU and GPU power during the search process itself.
A recent research paper introduces a novel approach to bridge this gap: a parallel CPU-GPU framework for Cost-Bounded Depth-First Search (CB-DFS). This method is specifically designed to efficiently integrate neural network heuristics into the search process. The paper demonstrates its effectiveness through extensions like Batch IDA* and Batch BTS, which build upon the principles of Asynchronous Parallel IDA* (AIDA*) while maintaining guarantees of finding optimal solutions.
How It Works: Combining CPU and GPU Power
The core innovation lies in the parallel implementation of CB-DFS. The Batch IDA* algorithm, for instance, operates in two distinct, yet interconnected, phases:
The first phase is the Search Phase, which is handled by the CPU. Similar to AIDA*, this phase uses CPU parallelism to asynchronously generate and explore different subtrees of the search space. Multiple CPU threads work in parallel, each processing a portion of the search tree.
The second phase is the Heuristic Evaluation Phase, which leverages the GPU. As the CPU threads generate new nodes, these nodes are collected into batches. Once a batch reaches a certain size, it’s sent to the GPU. The GPU then efficiently evaluates the heuristic (an estimate of the cost to reach the goal) for all nodes in the batch simultaneously using a neural network model. After the evaluations are complete, the results are sent back to the CPU, allowing the search to continue.
This batching mechanism is crucial because GPUs excel at performing many independent computations in parallel. By grouping heuristic evaluations, the system can make full use of the GPU’s capabilities, significantly speeding up a traditionally bottlenecked part of the search process. The paper also proves that this approach guarantees finding optimal solutions, a critical property for many applications.
Also Read:
- Orchestrating Edge-Cloud Application Migration with AI: A Deep Dive into Planning Techniques
- ClusterEnv: A Modular Approach to Scaling Reinforcement Learning with Adaptive Policy Synchronization
Scaling and Performance
The framework also addresses scalability. While a single GPU can provide substantial speedups, the paper introduces MultiGPU Batch IDA*, which allows the system to utilize multiple GPUs. By distributing CPU search threads and assigning dedicated batch-processing threads and GPUs to each, the system can process even more batches concurrently, further enhancing performance.
Experiments conducted on challenging problems like the 3×3 Rubik’s Cube and the 4×4 sliding tile puzzle demonstrated significant performance gains. For instance, increasing the batch size in SingleGPU Batch IDA* drastically reduced solution times. Using two GPUs further accelerated the process, achieving nearly ten times faster performance than Batch A* in some cases. The research also explored the impact of neural network model size and various hardware configurations, showing that while larger models can be more expensive to evaluate, the batching approach still provides benefits.
This work lays a strong foundation for future research in building and deploying neural heuristics more effectively within classic search algorithms. For a more in-depth understanding, you can read the full research paper here.


