TLDR: A new universal one-sided algorithm for distributed matrix multiplication is introduced, capable of efficiently handling any combination of data partitionings and replication factors. This algorithm, which uses “slicing” for local computations and direct GPU-to-GPU communication, performs competitively with highly optimized libraries like PyTorch DTensor, simplifying the development of large-scale AI and scientific applications by reducing the need for multiple specialized implementations.
Distributed matrix multiplication is a fundamental operation powering many advanced applications in science, data analytics, and artificial intelligence. From training massive AI models to processing large datasets, the efficiency of these operations directly impacts performance and scalability. However, a significant challenge in this field has been the need for a multitude of specialized algorithms, each tailored to specific data layouts and distribution strategies.
Researchers Benjamin Brock and Renato Golin from Intel Corporation have introduced a novel solution to this problem: a universal one-sided algorithm for distributed matrix multiplication. This new approach aims to simplify the complex landscape of distributed computing by providing a single, adaptable algorithm that can efficiently handle virtually any combination of data partitionings and replication factors.
The Challenge with Existing Methods
Current distributed matrix multiplication algorithms are often limited to a subset of possible data arrangements. This means that developers frequently need to implement multiple algorithms or resort to costly data redistribution steps if their data doesn’t fit a supported partitioning. For instance, in the context of AI models, where weight matrices often exceed the memory capacity of a single GPU, various partitioning strategies (like 1D, 2D, 1.5D, and 2.5D) are employed. Existing systems, often referred to as “SPMD” (Single Program, Multiple Data) systems, dispatch operations to a limited set of algorithms, leading to inefficiencies when a suitable implementation isn’t available.
A Unified Approach: Slicing for Universal Compatibility
The core innovation of this new algorithm lies in its “slicing” mechanism. Instead of relying on predefined data movement patterns, the algorithm uses index arithmetic to compute the exact sets of overlapping data “tiles” that need to be multiplied together locally. This dynamic approach allows it to support all combinations of partitionings and replication factors, including those with non-aligned tiles, which are often problematic for traditional methods.
The algorithm operates by first designating one of the matrices (A, B, or C in the operation C = AB) as “stationary.” Each processing unit (typically a GPU) then identifies the local matrix multiplication tasks it needs to perform based on its stationary data. For remote data, it uses one-sided communication primitives like “remote get” to fetch necessary tiles and “remote accumulate” to update results directly in other GPUs’ memories. This direct GPU-to-GPU communication, facilitated by intra-node interconnects like NVLink or Xe Link, is a key factor in its efficiency.
Optimized Execution and Performance
To ensure high performance, the algorithm incorporates several optimizations:
- Iteration Offset: Balances network load by varying the order in which processes execute tasks.
- Prefetching: Overlaps communication and computation by asynchronously fetching upcoming data tiles.
- Asynchronous Execution: Launches all work asynchronously, allowing local computations and remote accumulations to run in parallel.
- Memory Pool: Reduces overheads associated with frequent GPU memory allocations.
The researchers evaluated their algorithm against PyTorch DTensor, a state-of-the-art distributed tensor library widely used in AI. Experiments were conducted on systems equipped with Intel GPUs (Ponte Vecchio) and Nvidia H100 GPUs, using matrix sizes characteristic of GPT-like transformer models. The results showed that the universal algorithm achieved performance competitive with, and in some cases even surpassed, DTensor across a wide range of partitionings and replication factors. This demonstrates its practical viability for real-world AI workloads.
Also Read:
- Exploring Second-Order Optimization Limits for Large Language Models
- Optimizing Wireless Communication with AI: A New Approach to Precoding in FDD Systems
Advancing Distributed Computing
This universal algorithm significantly simplifies the development of large-scale distributed applications by removing the need for developers to manually select or implement different matrix multiplication algorithms for various data distributions. It opens up new possibilities for exploring the full design space of distributed tensor operations, which is a critical priority for advancing AI research. While the paper focuses on the algorithm itself, future work aims to integrate it into production SPMD systems and combine it with techniques for automatically selecting optimal data partitionings.
For more in-depth technical details, you can read the full research paper: Slicing Is All You Need: Towards A Universal One-Sided Algorithm for Distributed Matrix Multiplication.


