CudaForge: AI Agents Streamline CUDA Kernel Optimization with Hardware Insights

TLDR: CudaForge is a novel, training-free AI framework that uses a multi-agent system (Coder and Judge) to automatically generate and optimize CUDA kernels. By mimicking human expert workflows and integrating real-time hardware feedback from tools like Nsight Compute, CudaForge achieves high correctness and significant speedups for AI applications, including LLMs. It demonstrates strong generalization across various GPUs and base models while being remarkably cost-efficient compared to existing methods.

Developing efficient software for Graphics Processing Units (GPUs) is crucial for today’s advanced Artificial Intelligence (AI) applications, especially for training large language models (LLMs). These specialized pieces of code, known as CUDA kernels, are the backbone of high-performance computing on NVIDIA GPUs. However, creating these kernels manually is a complex, time-consuming, and expensive task, requiring deep expertise in GPU architecture and parallel programming.

Existing automated methods that use LLMs for code generation often fall short. They tend to produce kernels that aren’t very efficient, incur high computational costs, and struggle to adapt to different hardware setups. This means that while they can generate code, the performance isn’t always up to par, and they might not work well if you switch to a different GPU.

Introducing CudaForge: A Smarter Way to Optimize CUDA Kernels

A new approach called CudaForge aims to overcome these challenges. CudaForge is a training-free, multi-agent system designed for generating and optimizing CUDA kernels. Its design is inspired by the iterative process human experts follow: developing an initial kernel, testing it for correctness, analyzing hardware performance feedback, and then making iterative improvements.

CudaForge employs two specialized LLM agents: a ‘Coder’ and a ‘Judge’. The Coder is responsible for generating the CUDA kernels based on task instructions and feedback. The Judge, on the other hand, evaluates these kernels by testing their correctness and, crucially, by analyzing hardware feedback. This feedback includes detailed metrics from tools like Nsight Compute (NCU), which reveal how efficiently the kernel is running on the GPU. The Judge then uses this information to identify performance bottlenecks and provide targeted optimization guidance back to the Coder.

How CudaForge Works

The process is iterative. The Coder generates a kernel, which is then tested. If there are errors (like compilation failures or incorrect outputs), the Judge provides ‘correction feedback’. Once a kernel is functionally correct, the Judge switches to ‘optimization mode’. It profiles the kernel with NCU, looking at metrics like memory usage, processor utilization, and warp efficiency. Based on these hardware insights, the Judge pinpoints the main performance bottleneck (e.g., if the kernel is limited by memory access or computation power) and suggests a specific optimization strategy to the Coder. This cycle of generation, testing, and feedback continues for several rounds, leading to progressively more efficient kernels.

Key to CudaForge’s success are three design choices: a two-agent system that separates code generation from evaluation, an iterative optimization process that refines kernels over time, and the explicit integration of hardware feedback. This hardware awareness allows CudaForge to tailor optimizations to the specific GPU it’s running on, making it highly adaptable.

Also Read:

Impressive Results and Cost Efficiency

CudaForge has been rigorously evaluated on the KernelBench benchmark, which includes tasks of varying difficulty, from basic operations to complex neural network architectures. It achieved an impressive 97.6% correctness rate for generated kernels and an average speedup of 1.68 times over PyTorch baselines. This performance significantly surpasses other state-of-the-art models, including OpenAI-o3 and Kevin-32B, which is an RL-based model.

Beyond performance, CudaForge also demonstrates strong generalization across different GPUs (like A100, RTX 6000, 4090, 3090) and various base LLMs (OpenAI-o3, GPT-5, Claude-Sonnet-4, etc.). This means it’s not tied to a single model or hardware. Furthermore, CudaForge is remarkably cost-effective. Generating an optimized kernel takes about 26.5 minutes on an RTX 6000 GPU and incurs an average API cost of only $0.3 per kernel. This is significantly cheaper than other agentic approaches, which can cost much more in terms of GPU hours and API expenses.

The efficiency comes from several factors: the Judge’s targeted feedback, the use of a curated subset of critical NCU metrics (instead of overwhelming the system with all data), and a lightweight memory design for the agents, which reduces redundant context and computational overhead.

CudaForge represents a significant step forward in automated CUDA kernel optimization, offering a practical, cost-effective, and highly performant solution for accelerating AI workloads. You can find the code and learn more about this project at the CudaForge GitHub repository.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

CudaForge: AI Agents Streamline CUDA Kernel Optimization with Hardware Insights

Introducing CudaForge: A Smarter Way to Optimize CUDA Kernels

How CudaForge Works

Impressive Results and Cost Efficiency

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

Astreya Unveils New Wave of Enterprise AI Agents to Boost Business Efficiency and Automation

Vida Secures $4 Million Series A Funding to Advance AI Voice Technology and Expand Leadership

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates