TLDR: This research introduces a novel method to significantly improve the performance of AI workloads on RISC-V processors equipped with the Vector Extension (RVV). By integrating RVV into TVM’s MetaSchedule framework, the authors developed a system that automatically tunes tensor operations for optimal efficiency. The approach outperforms traditional compiler autovectorization and existing hand-crafted libraries like muRISCV-NN, showing substantial speedups (up to 84% for matrix multiplications and 46% for full AI models) and reducing code memory footprint, making it ideal for a wide range of RISC-V-based AI applications, especially in embedded systems.
Artificial Intelligence (AI) models are becoming increasingly prevalent, from powerful data centers to compact embedded devices. The RISC-V Instruction Set Architecture (ISA), known for its open-source nature and scalability, is an excellent candidate for accelerating these AI workloads across diverse hardware platforms.
While RISC-V’s Vector Extension (RVV) has gained support in various commercial and research platforms, efficient software to utilize these vector units for AI workloads has been a challenge. Existing solutions, such as compiler autovectorization (e.g., GCC, LLVM) or hand-crafted libraries like muRISCV-NN, often fall short. Autovectorization doesn’t always maximize vector unit efficiency, and hand-crafted libraries struggle to adapt to different hardware configurations, leading to suboptimal performance.
This research introduces a novel approach to optimize AI workloads for RISC-V vector units by integrating the RVV extension into TVM’s MetaSchedule framework. MetaSchedule is a probabilistic program framework designed for tuning tensor operations, allowing for an efficient exploration of various mapping possibilities for AI workloads onto RISC-V vector units.
The core of this proposal involves extending MetaSchedule with specialized “tensor intrinsics” that leverage the RISC-V RVV extension. These intrinsics define small tensor operations that can be accelerated by the target hardware’s instructions. By using probabilistic sampling, MetaSchedule can explore a vast design space of potential schedules for each tensor operation, identifying the most efficient ones.
A key challenge addressed is the flexibility of the RISC-V RVV extension, particularly its variable vector lengths. The researchers tackled this by registering multiple versions of the same tensor intrinsics within TVM, each configured with different vector lengths. This allows MetaSchedule to match and accelerate both large and small tensor operations effectively.
The evaluation of this new workflow involved implementing various RISC-V Systems-on-Chip (SoCs) on an FPGA and also testing on a commercially available Banana Pi BPI-F3 board. A wide range of AI workloads, including matrix multiplications and complete neural networks (like MobileNetV2, BERT, and MobileLLM), were tuned and compared against existing methods.
The results are compelling. For single matrix multiplications, the proposed solution demonstrated a mean improvement of 84% compared to GCC’s autovectorization and 50% against muRISCV-NN. For complete AI models, the improvements were 46% against GCC’s autovectorization and 29% against muRISCV-NN. On the Banana Pi board, the solution provided a 35% speedup for complete AI models over standard LLVM autovectorization.
Furthermore, an analysis of instruction traces revealed that the optimized schedules generated by this approach utilize vector registers more efficiently, leading to fewer instructions executed and a significantly smaller code memory footprint (around 90% reduction in many cases) compared to muRISCV-NN. This makes the resulting binaries more suitable for embedded devices with limited memory.
Also Read:
- Boosting Recommender System Performance Through Optimized Data Flow
- FlowSpec: Revolutionizing LLM Inference at the Edge for Faster, Smarter AI
While the tuning process requires some time, the significant performance gains achieved make it a worthwhile investment for deploying AI workloads on RISC-V platforms. The open-source nature of this work encourages further expansion to other RISC-V extensions, benefiting both embedded devices and high-performance computing applications. For more technical details, you can refer to the original research paper: Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs.


