Building Better AI Systems: A New Framework for Multi-Accelerator Efficiency

TLDR: SNAX is an open-source hardware-software framework that simplifies the integration and management of multiple AI accelerators in computer systems. It uses a unique “hybrid-coupling” approach, combining flexible control with fast data access, and includes a smart compiler to automate complex tasks. This leads to significant performance improvements (over 10x faster) and high utilization (over 90%) for AI workloads, making it easier to develop and deploy efficient heterogeneous computing platforms.

In the rapidly evolving world of artificial intelligence, specialized hardware accelerators are becoming crucial for handling complex AI tasks efficiently. However, integrating multiple accelerators into a single system often creates significant challenges. These difficulties typically stem from inefficient data movement and compatibility issues between hardware and software, preventing a unified approach that balances high performance with ease of use.

Understanding the Challenge

Traditional methods for integrating accelerators often adopt a “hardware-centric” view. This means the hardware is designed first, and then software developers are left to figure out how to manage the system using low-level instructions. This approach complicates the creation of compilers for these diverse systems, leading to underutilized accelerators and reduced overall system efficiency. Imagine having powerful tools but no clear instructions on how to use them together effectively.

Accelerators can be “tightly coupled,” meaning they are deeply integrated with the main processor for very fast, single-cycle interactions. While this offers low latency, it can cause the main processor to wait (stall) if the accelerator takes multiple cycles, limiting parallel operations. On the other hand, “loosely coupled” accelerators connect via a common bus, allowing for more flexible, asynchronous operations where the main processor and other accelerators can work in parallel. However, this flexibility often comes with overheads in data transfer and synchronization, potentially slowing things down if not managed carefully.

Introducing SNAX: A Unified Approach

To address these complex challenges, researchers from MICAS-ESAT, KU Leuven, Belgium, have developed SNAX – an innovative open-source integrated hardware-software framework. SNAX aims to enable efficient multi-accelerator platforms through a novel “hybrid-coupling” scheme. This scheme combines the best of both worlds: loosely coupled asynchronous control for flexible task management and tightly coupled data access for high-speed data movement.

SNAX provides reusable hardware modules designed to boost the utilization of compute accelerators. It also features a customizable compiler, called SNAX-MLIR, which automates key system management tasks. Together, these components enable rapid development and deployment of customized multi-accelerator compute clusters. You can explore the research paper in detail here: An Open-Source HW-SW Co-Development Framework Enabling Efficient Multi-Accelerator Systems.

How SNAX Works: Hybrid Coupling

The core innovation of SNAX lies in its hybrid-coupling concept, which influences both hardware and software development:

The Hardware Side: SNAX Cluster

The SNAX hardware architecture, known as the SNAX Cluster, implements this hybrid approach. For control, it uses one or more lightweight RISC-V cores as management units. These cores send simple “fire-and-forget” control signals to accelerators, allowing them to operate independently and in parallel. This standardized control interface makes it easy to integrate any custom accelerator.

For data, SNAX employs a tightly coupled, shared, multi-banked scratchpad memory (SPM). This allows accelerators to read and write data in a single cycle with parallel access, eliminating the need for slow data transfers between accelerators. To further optimize data flow, SNAX uses “data streamers” at the accelerator-memory interface. These streamers autonomously manage data loading and storing, ensuring a continuous, smooth data stream into the accelerators.

The Software Side: SNAX-MLIR

Managing multiple accelerators in a pipelined, streaming fashion for high utilization is complex. This is where SNAX-MLIR, the MLIR-based compiler toolchain, comes in. It is highly customizable to different SNAX Cluster configurations and accelerator combinations. SNAX-MLIR automates critical tasks such as:

Device Placement: Assigning computation sections to the most suitable accelerator based on the workload.
Static Memory Allocation: Efficiently allocating buffers in shared memory to support seamless data flow between accelerators without intermediate transfers.
Asynchronous Scheduling: Simplifying the management of parallel execution by inserting synchronization points where needed, ensuring tasks run in the correct order while maximizing concurrency.
Device Programming: Generating accelerator-specific instructions to program the RISC-V hosts, standardizing how accelerators’ computation and data flow tasks are managed.

This automation allows developers to focus on their accelerator’s core functionality, knowing that the compiler will handle the complex system-level management.

Demonstrated Impact and Efficiency

Through extensive experimentation on a low-power heterogeneous System-on-Chip (SoC), SNAX has demonstrated remarkable efficiency and flexibility. When accelerating a mini neural network, SNAX achieved a 152x performance boost for convolutional layers and an additional 6.9x improvement by accelerating max-pooling layers. By enabling pipelined execution, where different parts of the network run in parallel on different accelerators, SNAX delivered an additional 3.18x throughput increase.

Crucially, the original program’s source code remained unchanged; only new programming instructions for the accelerators were needed. The compiler seamlessly managed dispatching, synchronization, data movement, and pipelined execution. SNAX also maintained high accelerator utilization, achieving over 90% in full system operation, and even 92% for high-intensity workloads and 79% for low-intensity workloads, showcasing its ability to maximize resource usage across diverse computational demands.

Compared to other state-of-the-art multi-accelerator architectures, SNAX showed significant gains, outperforming some AI-accelerated devices by 7.5x to 15x in certain benchmarks, despite its compact memory and simpler components. This efficiency is attributed to its optimized data access, asynchronous parallel execution, and compiler-managed data layout and scheduling.

Also Read:

Conclusion

SNAX represents a significant step forward in building efficient multi-accelerator systems for AI workloads. By providing a flexible, open-source hardware-software framework with its unique hybrid-coupling approach and intelligent MLIR-based compiler, SNAX simplifies the integration and deployment of diverse accelerators. This co-development strategy not only enhances performance and energy efficiency but also ensures high utilization of specialized hardware, paving the way for more powerful and accessible AI computing platforms.