TLDR: ACCeLLiuM introduces two fine-tuned Large Language Models (LLMs) and a specialized dataset to automate the generation of OpenACC directives for GPU parallelization. While base LLMs struggle, the fine-tuned ACCeLLiuM models significantly improve accuracy, generating correct OpenACC pragmas for data-parallel loops with high syntactic validity, making GPU acceleration more accessible for developers.
The increasing power of Graphics Processing Units (GPUs) has made them essential for accelerating computations across many systems, from supercomputers to local clusters. However, programming these powerful devices efficiently remains a significant challenge due to their complex hardware and diverse parallel programming frameworks. While directive-based standards like OpenACC aim to simplify GPU programming by abstracting away low-level complexities, developers still need considerable expertise to use these directives effectively.
A new research initiative, ACCeLLiuM, introduces a novel approach to tackle this challenge. Developed by Samyak Jhaveri, Vanessa Klotzmann, and Crista Lopes from the University of California Irvine, ACCeLLiuM consists of two open-weights Large Language Models (LLMs) specifically fine-tuned to generate expert OpenACC directives for data-parallel loops. This work also includes the supervised fine-tuning dataset used to train these models.
The core problem ACCeLLiuM addresses is the difficulty in manually identifying parallelizable loops and crafting the correct set of directives and clauses for efficient GPU offloading. This process is often time-consuming, error-prone, and requires a deep understanding of data dependencies, memory access patterns, and OpenACC specifications. Existing automated solutions, primarily based on static analysis and compiler tools, have struggled with the complexities of real-world code and often produce suboptimal results.
ACCeLLiuM bridges a critical gap in research, as previous efforts in LLM-driven parallelization have largely focused on OpenMP for multi-core CPUs, leaving OpenACC for GPUs relatively unexplored. The ACCeLLiuM resource bundle is an end-to-end solution, comprising:
A Curated Dataset
The ACCeLLiuM SFT dataset is a collection of 4,033 OpenACC pragma-loop pairs, meticulously mined from public GitHub C/C++ repositories. This dataset is split into 3,223 pairs for training and 810 for testing. The data curation process involved several stages, including sourcing, extraction, filtering out noisy or incompatible loops (like empty loops or those with problematic control flow), and deduplication, ensuring a high-quality dataset for training.
Fine-Tuned Large Language Models
The researchers fine-tuned two distinct open-weights LLMs: Llama 3.1 70B, a general-purpose foundation model, and CodeLlama 34B, a variant of Llama 2 pre-trained on a large corpus of code data. The objective was to train these LLMs to annotate data-parallel loops with the most appropriate OpenACC pragmas, focusing on devising directives for loops already identified as potentially benefiting from GPU acceleration.
An Open-Source Pipeline
ACCeLLiuM also provides an open-source pipeline for dataset creation, model fine-tuning, and evaluation, aiming to establish a reproducible benchmark for LLM-powered OpenACC pragma generation and lower the barrier to automated GPU offloading.
Significant Performance Improvements
The experimental evaluations revealed a stark contrast between the base LLMs and their fine-tuned versions. Out-of-the-box LLMs failed to consistently generate valid OpenACC pragmas, with base Llama 3.1 achieving 0% exact match accuracy and base CodeLlama only 0.01%. However, after supervised fine-tuning on the ACCeLLiuM dataset, the models showed pronounced improvements:
- The fine-tuned CodeLlama model achieved a 50.4% exact-match accuracy on the held-out test set.
- It generated valid pragmas with the correct directive type for 87.3% of the data-parallel loops.
- The fine-tuned Llama 3.1 model also performed strongly, with 43% exact match accuracy and 89% correct directive type generation.
- Both fine-tuned models demonstrated high syntactic validity, with over 80% of generated pragmas successfully compiling with an OpenACC-compliant compiler, comparable to human-written reference pragmas.
Even when not an exact match, generated pragmas frequently incorporated correct clauses in a different order or included additional clauses that offered practical value beyond strict string-matching, such as finer control over parallel execution and data movement.
Also Read:
- PPSD: Boosting LLM Inference Speed with Pipelined Self-Speculative Decoding
- MelcotCR: Enhancing AI Code Review with Multi-Dimensional Analysis and Advanced Reasoning
Error Patterns and Future Directions
The most common errors in the fine-tuned models involved incorrect or partially correct clause generation, often due to the models only receiving the data-parallel loops as input, lacking full program context. For instance, a model might suggest `copyin` instead of `present` for data already on the GPU, leading to redundant transfers. Clause reordering was also common but often semantically correct, highlighting the limitations of strict string-based metrics.
The researchers acknowledge that the current study focuses on annotating single, pre-identified data-parallel loops and uses static evaluation. Future work will involve dynamic evaluation with HPC benchmark suites to measure real-world speedup, extending capabilities to reason about larger code regions, and exploring reinforcement learning strategies for more robust parallelization assistants.
By publicly releasing the code, models, and dataset as ACCeLLiuM, the researchers hope to establish a reproducible benchmark and significantly lower the barrier to automated GPU offloading for developers and scientists who may not be experts in parallel programming. You can find more details about this research in the full paper: ACCeLLiuM: Supervised Fine-Tuning for Automated OpenACC Pragma Generation.


