TLDR: CudaForge is a novel, training-free AI framework that uses a multi-agent system (Coder and Judge) to automatically generate and optimize CUDA kernels. By mimicking human expert workflows and integrating real-time hardware feedback from tools like Nsight Compute, CudaForge achieves high correctness and significant speedups for AI applications, including LLMs. It demonstrates strong generalization across various GPUs and base models while being remarkably cost-efficient compared to existing methods.
Developing efficient software for Graphics Processing Units (GPUs) is crucial for today’s advanced Artificial Intelligence (AI) applications, especially for training large language models (LLMs). These specialized pieces of code, known as CUDA kernels, are the backbone of high-performance computing on NVIDIA GPUs. However, creating these kernels manually is a complex, time-consuming, and expensive task, requiring deep expertise in GPU architecture and parallel programming.
Existing automated methods that use LLMs for code generation often fall short. They tend to produce kernels that aren’t very efficient, incur high computational costs, and struggle to adapt to different hardware setups. This means that while they can generate code, the performance isn’t always up to par, and they might not work well if you switch to a different GPU.
Introducing CudaForge: A Smarter Way to Optimize CUDA Kernels
A new approach called CudaForge aims to overcome these challenges. CudaForge is a training-free, multi-agent system designed for generating and optimizing CUDA kernels. Its design is inspired by the iterative process human experts follow: developing an initial kernel, testing it for correctness, analyzing hardware performance feedback, and then making iterative improvements.
CudaForge employs two specialized LLM agents: a ‘Coder’ and a ‘Judge’. The Coder is responsible for generating the CUDA kernels based on task instructions and feedback. The Judge, on the other hand, evaluates these kernels by testing their correctness and, crucially, by analyzing hardware feedback. This feedback includes detailed metrics from tools like Nsight Compute (NCU), which reveal how efficiently the kernel is running on the GPU. The Judge then uses this information to identify performance bottlenecks and provide targeted optimization guidance back to the Coder.
How CudaForge Works
The process is iterative. The Coder generates a kernel, which is then tested. If there are errors (like compilation failures or incorrect outputs), the Judge provides ‘correction feedback’. Once a kernel is functionally correct, the Judge switches to ‘optimization mode’. It profiles the kernel with NCU, looking at metrics like memory usage, processor utilization, and warp efficiency. Based on these hardware insights, the Judge pinpoints the main performance bottleneck (e.g., if the kernel is limited by memory access or computation power) and suggests a specific optimization strategy to the Coder. This cycle of generation, testing, and feedback continues for several rounds, leading to progressively more efficient kernels.
Key to CudaForge’s success are three design choices: a two-agent system that separates code generation from evaluation, an iterative optimization process that refines kernels over time, and the explicit integration of hardware feedback. This hardware awareness allows CudaForge to tailor optimizations to the specific GPU it’s running on, making it highly adaptable.
Also Read:
- Optimizing LLM Collaboration: A Graph-Based Approach to Test-Time Scaling
- Agentic AI’s Hidden Engine: The CPU’s Critical Role in Performance
Impressive Results and Cost Efficiency
CudaForge has been rigorously evaluated on the KernelBench benchmark, which includes tasks of varying difficulty, from basic operations to complex neural network architectures. It achieved an impressive 97.6% correctness rate for generated kernels and an average speedup of 1.68 times over PyTorch baselines. This performance significantly surpasses other state-of-the-art models, including OpenAI-o3 and Kevin-32B, which is an RL-based model.
Beyond performance, CudaForge also demonstrates strong generalization across different GPUs (like A100, RTX 6000, 4090, 3090) and various base LLMs (OpenAI-o3, GPT-5, Claude-Sonnet-4, etc.). This means it’s not tied to a single model or hardware. Furthermore, CudaForge is remarkably cost-effective. Generating an optimized kernel takes about 26.5 minutes on an RTX 6000 GPU and incurs an average API cost of only $0.3 per kernel. This is significantly cheaper than other agentic approaches, which can cost much more in terms of GPU hours and API expenses.
The efficiency comes from several factors: the Judge’s targeted feedback, the use of a curated subset of critical NCU metrics (instead of overwhelming the system with all data), and a lightweight memory design for the agents, which reduces redundant context and computational overhead.
CudaForge represents a significant step forward in automated CUDA kernel optimization, offering a practical, cost-effective, and highly performant solution for accelerating AI workloads. You can find the code and learn more about this project at the CudaForge GitHub repository.


