Automating GPU Kernel Development with AMD's GEAK Agent

TLDR: GEAK is an AMD framework that uses AI agents and Large Language Models (LLMs) to automatically generate highly efficient GPU kernels in the Triton language for AMD Instinct GPUs. It significantly outperforms direct LLM prompting and other methods, achieving higher correctness and speedup by employing an iterative refinement process with Generator, Evaluator, Reflector, and Optimizer modules. The framework also introduces new, robust benchmarks for evaluating AI-generated GPU code and has been open-sourced to foster community collaboration.

The world of artificial intelligence (AI) is constantly evolving, and with it, the demand for highly efficient and specialized software that can run on powerful Graphics Processing Units (GPUs). As AI workloads become more complex and diverse, there’s a growing need to automate the creation of low-level GPU programs, known as kernels, to ensure top-notch performance and productivity.

Traditionally, developing these kernels requires significant manual effort and expert knowledge to optimize them for specific hardware. However, major tech companies and research institutions are now heavily investing in AI-driven code generation for GPUs, aiming to reduce this manual work while achieving performance levels comparable to human experts.

One language that has gained popularity for generating such AI-powered kernels is Triton. It’s a Python-based language designed for GPU programming, striking a good balance between performance and ease of coding.

Introducing GEAK: AMD’s Innovative AI Agent

In this landscape, Advanced Micro Devices, Inc. (AMD) has introduced a groundbreaking framework called GEAK (Generating Efficient AI-centric GPU Kernels). GEAK is an agent-based system that leverages cutting-edge Large Language Models (LLMs) to automatically generate high-performing Triton code specifically for AMD GPUs, including the powerful AMD Instinct™ MI300X and MI250.

GEAK stands out because it uses a sophisticated reasoning loop, inspired by Reflexion-style feedback mechanisms, to refine the generated code. This means the AI agent doesn’t just generate code once; it iteratively improves it based on evaluation and reflection.

How GEAK Works: A Modular Approach

The GEAK system is built with four core modules that work together in a pipeline:

Generator: This module creates the initial code based on a user’s request and any relevant context.
Evaluator: It tests the generated code for correctness and performance. If the code fails, it provides error traces.
Reflector: This module analyzes failed code and error traces to identify issues and suggest solutions, feeding this feedback back to the Generator.
Optimizer: For functionally correct code, the Optimizer module devises strategies to enhance its performance, focusing on speed and efficiency.

To further boost its capabilities, GEAK incorporates several techniques. It uses ‘1-shot prompting’ by providing a similar, existing Triton code example to guide the LLM. ‘Knowledge Injection’ enhances the prompt with domain-specific information about writing efficient Triton kernels and hardware specifications. The ‘Reflexion’ module enables self-correction and iterative refinement, while an ‘LLM as Optimizer’ component helps identify performance improvements. GEAK also includes a ‘Debugging Trap’ mechanism to prevent the agent from getting stuck on persistent bugs and employs ‘Parallel Scaling’ by running multiple instances of GEAK simultaneously to generate diverse and potentially better code.

Performance and Benchmarks

The researchers evaluated GEAK using two benchmark suites: a revised version of TritonBench (TritonBench-revised) and a new set of real-world kernels from open-source AMD ROCm repositories, called the ROCm Triton Benchmark. These benchmarks measure three key aspects:

Call Accuracy: How often the generated kernels compile and run without errors.
Execution Accuracy: The percentage of kernels that pass all unit tests.
Speedup: How much faster the AI-generated kernels run compared to reference kernels.

The results are impressive. GEAK significantly outperformed direct prompting of state-of-the-art LLMs (like GPT-4.1, Gemini 2.5 Pro, and Claude 3.7 Sonnet). While direct prompting often yielded less than 15% correctness, GEAK achieved up to 54.89% execution accuracy on TritonBench-revised and 63.33% on the ROCm Triton Benchmark. Furthermore, GEAK-generated kernels demonstrated an average speedup of up to 2.59 times over their reference counterparts.

A detailed study on a specific kernel, ‘test_triton_flip.py’ from the ROCm Triton Benchmark, showed GEAK’s generated code achieved a 2.26x speedup. This was attributed to GEAK’s optimized memory access patterns, better memory efficiency, explicit masking, and coalesced memory access, which reduced memory bandwidth usage and register pressure compared to expert-written code.

The study also highlighted the benefits of increasing computational resources during inference. Both sequential scaling (more iterations of refinement) and parallel scaling (running multiple GEAK instances) led to substantial improvements in accuracy and performance, demonstrating the framework’s flexibility and robustness across different hardware platforms.

Also Read:

Conclusion and Future Outlook

GEAK represents a significant step forward in automating the generation of efficient GPU kernels. By combining advanced LLMs with a structured, agent-based framework, it iteratively refines code for both correctness and performance without needing additional training. The introduction of new, robust benchmarks further solidifies the evaluation of AI-generated GPU code.

AMD has open-sourced the GEAK agent implementation and evaluation framework, inviting the open-source community to contribute and accelerate the development of GPU kernels. This initiative aims to foster innovation and collaboration, ultimately improving the efficiency of training and inference for large-scale AI models. You can find more details in the original research paper.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Automating GPU Kernel Development with AMD’s GEAK Agent

Introducing GEAK: AMD’s Innovative AI Agent

How GEAK Works: A Modular Approach

Performance and Benchmarks

Conclusion and Future Outlook

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates