PerfDojo: Automating Machine Learning Library Optimization for Diverse Hardware

TLDR: PerfDojo introduces a novel method for automatically optimizing machine learning libraries across various hardware architectures (CPUs, GPUs, accelerators). It uses a human-readable code representation that guarantees semantic validity during transformations. Coupled with PerfLLM, a system leveraging Large Language Models and Reinforcement Learning, it learns to discover high-performance code optimizations without requiring prior hardware-specific knowledge, achieving significant speedups over existing frameworks.

The world of machine learning is constantly evolving, with models becoming more complex and hardware architectures more diverse. This rapid advancement, however, brings a significant challenge: optimizing machine learning libraries to achieve peak performance across various CPUs, GPUs, and specialized accelerators. Traditionally, this optimization has been a time-consuming and highly specialized task, often requiring deep hardware knowledge and manual tuning.

A new research paper introduces an innovative approach to tackle this problem: PerfLLM, a methodology that leverages Large Language Models (LLMs) and Reinforcement Learning (RL) for automated optimization. At the heart of PerfLLM is an environment called PerfDojo, which redefines code optimization as a game. This game uses a human-readable, mathematically-inspired code representation that ensures any transformation applied maintains the original meaning of the code.

The Challenge of Optimization

Modern machine learning models demand immense computational power. To meet this demand, a wide array of hardware, from NVIDIA A100 GPUs to Google TPU v4 and RISC-V processors, has emerged. Each of these architectures has unique instruction sets, memory layouts, and specialized requirements for different data types and model features like sparsity or quantization. Manually optimizing code for such a heterogeneous landscape is incredibly resource-intensive. Existing automated tools often rely on complex, hardware-specific rules and obscure intermediate representations, which makes them difficult to adapt and understand.

PerfDojo: A New Approach to Code Representation

PerfDojo addresses these limitations by providing a flexible and interpretable way to represent programs and their transformations. Imagine code as a set of mathematical formulas, where each step of optimization is a transformation that is guaranteed to be semantically valid. This means the code’s original function remains intact, even as its structure is changed for performance. This human-centric design not only helps engineers understand and debug the optimization process but also allows RL agents to explore and apply code transformations more effectively without needing prior hardware knowledge.

The system ensures correctness by embedding validity checks directly into the transformation logic. For example, if a transformation like ‘dimension reuse’ is applied, PerfDojo automatically verifies that it won’t break the code’s meaning. This eliminates the need for users to manually verify correctness, allowing the RL agent to focus solely on finding performance improvements.

PerfLLM: Learning to Optimize with AI

PerfLLM builds on PerfDojo by using LLMs to understand the program’s state and RL to navigate the vast possibilities of transformations. The LLM encodes the program’s representation into a numerical vector, capturing its current configuration. The RL agent then learns to select the best sequence of transformations to improve performance. Unlike traditional Q-learning, which aims to maximize average rewards, PerfLLM uses a ‘Max Q-learning’ approach. This method specifically targets finding the single best sequence of transformations that leads to the highest possible performance gain, making it ideal for code optimization where a single optimal path is desired.

The reward system in PerfLLM is designed to incentivize actions that directly improve kernel runtime, rather than relying on relative speedups that could lead to unstable learning. This continuous feedback helps the agent learn efficiently, even in complex scenarios.

Impressive Performance Gains

The results of PerfLLM are compelling. On the GH200 CPU (Arm architecture), PerfDojo achieved a geometric mean speedup of 6.65 times compared to PyTorch and 13.65 times compared to TVM. Even on the AMD MI300A CPU (x86 architecture), PerfDojo showed a 1.56 times speedup over PyTorch and 1.80 times over TVM. These significant gains demonstrate PerfLLM’s ability to discover highly optimized implementations across diverse hardware without explicit hardware-specific heuristics.

For instance, in an element-wise multiplication task, the RL-discovered variant outperformed PyTorch by 1.62 times and TVM by 1.22 times on MI300A. This was achieved by applying common optimization techniques like vectorizing the innermost loop for efficient data loading. In batch normalization, PerfLLM’s implementation on MI300A surpassed PyTorch by 1.12 times and TVM by 1.76 times by intelligently managing temporary computations and block sizes.

Also Read:

Looking Ahead

While the search process with PerfLLM is more computationally intensive than heuristic-guided methods, the one-time investment in optimizing a full library of operators represents a substantial saving compared to the manual engineering effort required to achieve similar performance levels on new hardware. This work paves the way for a future where machine learning libraries can be automatically generated and optimized for any new hardware architecture, significantly reducing development time and boosting performance across the board.

You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

PerfDojo: Automating Machine Learning Library Optimization for Diverse Hardware

The Challenge of Optimization

PerfDojo: A New Approach to Code Representation

PerfLLM: Learning to Optimize with AI

Impressive Performance Gains

Looking Ahead

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates