TLDR: SimdBench is the first benchmark to evaluate Large Language Models (LLMs) on generating SIMD-intrinsic code, which is crucial for high-performance computing. The study found that while LLMs struggle more with vectorized code compared to scalar code, top models like DeepSeek-R1 can generate correct and significantly faster SIMD code. Common issues include compilation errors due to outdated intrinsic knowledge and logical bugs. The research highlights the potential for LLMs to optimize software with hardware-level features and suggests improving training data and generation strategies.
Large Language Models (LLMs) have shown impressive capabilities in generating code, but their performance in creating highly optimized, hardware-specific code, particularly using SIMD (Single Instruction Multiple Data) intrinsics, has been largely unexplored. A new research paper introduces SimdBench, the first benchmark specifically designed to evaluate how well LLMs can generate this specialized type of code.
SIMD instructions are a crucial feature in modern processors, allowing them to perform the same operation on multiple data items simultaneously. This parallel processing significantly accelerates performance-critical tasks. While compilers can sometimes automatically vectorize code, explicit SIMD intrinsic programming offers finer control for maximum efficiency, a technique widely used in developing high-performance libraries like OpenCV and TensorFlow.
However, writing SIMD intrinsic code is challenging. It involves complex interfaces, low readability due to embedded low-level details, manual data alignment, and intricate control and data flow. Current code generation benchmarks for LLMs primarily focus on general-purpose, scalar code, leaving a significant gap in understanding LLMs’ ability to handle vectorized code.
To address this, researchers from Peking University, The Chinese University of Hong Kong, Shenzhen, The Hong Kong University of Science and Technology, and DAMO Academy, Alibaba Group, developed SimdBench. This benchmark comprises 136 carefully crafted tasks targeting five key SIMD intrinsics: SSE and AVX for x86 architectures, Neon and SVE for ARM, and RVV for RISC-V. The tasks are derived from both hand-crafted operations based on intrinsic documentation and modified problems from existing benchmarks like HumanEval, ensuring diversity and relevance for vectorization.
SimdBench is unique in its comprehensive approach. Each task includes a detailed functional description, a signature for the target function, and robust test cases for both correctness and performance. The correctness tests use differential testing against a canonical scalar solution, while performance tests leverage the Google Benchmark library to measure speedup on large-scale data, ensuring precise and reliable results.
The systematic evaluation of 18 representative LLMs on SimdBench revealed several insightful findings. A universal decrease in `pass@k` (a metric for correctness) was observed for SIMD-intrinsic code generation compared to scalar code, highlighting the inherent difficulty of this task for current LLMs. Among the evaluated models, DeepSeek-R1 demonstrated the best performance, achieving an average `pass@5` of 75.44% across the five intrinsic types, notably excelling in the more complex SVE and RVV scenarios.
Despite the challenges, the study found that valid SIMD-intrinsic code generated by LLMs often resulted in significant performance improvements compared to scalar code, even when the latter was optimized by compilers. This suggests that LLM-assisted vectorized programming can indeed overcome the limitations of compiler auto-vectorization and achieve higher peak performance.
The analysis of invalid cases pointed to two primary obstacles: compilation errors, particularly “use of undeclared identifier,” and logical bugs in the generated code. Errors related to undeclared identifiers were more prevalent for SVE and RVV, often due to outdated or incomplete training data in LLMs regarding the latest intrinsic definitions. Logical bugs were more common for SSE, AVX, and Neon, indicating the complexity of correctly implementing vectorized operations like data alignment.
The researchers propose promising directions for future advancements, including developing high-quality, up-to-date training datasets for SIMD intrinsics, incorporating retrieval-augmented generation (RAG) to allow LLMs to access external documentation, and adopting a step-by-step generation strategy where LLMs first generate scalar code and then vectorize it. This research paves the way for LLMs to assist developers in optimizing performance-critical libraries, improving cross-platform portability, and enhancing the security and reliability of SIMD toolchains.
Also Read:
- Automated GPU Code Optimization: Introducing CUDA-L1’s Reinforcement Learning Approach
- AI for Debugging: A Reality Check on Verified Bug Fixes
For more details, you can read the full research paper: SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation.


