Predicting Code Outcomes with Regression Language Models

TLDR: Regression Language Models (RLMs) offer a novel, unified approach to predict numerical outcomes of code executions, such as memory usage, latency, and neural network performance. By treating code-to-metric regression as a text-to-text problem, RLMs can directly process code text and generate numerical predictions without extensive feature engineering. The models demonstrate strong performance across various programming languages and neural architecture search benchmarks, simplifying complex computational graph analysis into a next-token prediction task.

Predicting the numerical outcomes of code execution, such as how much memory a program will use or how fast it will run, has traditionally been a complex challenge. This task, often referred to as code-to-metric regression, typically required extensive and specialized “feature engineering” – essentially, hand-crafting specific characteristics from the code to feed into a prediction model.

However, a recent research paper introduces a groundbreaking approach: Regression Language Models (RLMs) for Code. These models offer a unified and simplified way to predict these crucial metrics directly from the code’s text, bypassing the need for laborious feature engineering.

A Unified Approach to Code Prediction

The core idea behind RLMs is to treat the prediction of numerical outcomes as a “text-to-text regression” problem. This means the model takes code as input text and directly generates the numerical prediction as output text. This is a significant departure from older methods that struggled with the open-ended and graph-like nature of programming languages.

The researchers demonstrate that a single RLM can simultaneously predict a variety of metrics across different programming languages and computational structures. For instance, it can estimate:

The memory footprint of code written in high-level languages like Python and C++.

The latency (speed) of specialized Triton GPU kernels.

The accuracy and speed of trained neural networks represented in ONNX format.

A relatively compact RLM with 300 million parameters, initialized from T5Gemma, achieved impressive results. It obtained a Spearman-rank correlation of over 0.9 on competitive programming submissions from the APPS dataset for peak memory usage. Furthermore, a single unified model achieved an average Spearman-rank of over 0.5 across 17 different languages from the CodeNet dataset. In the realm of Neural Architecture Search (NAS), the RLM even surpassed traditional graph neural networks, achieving the highest average Kendall-Tau of 0.46 on five classic NAS design spaces and simultaneously predicting architecture latencies across multiple hardware platforms.

How RLMs Work

At its heart, an RLM is structured as an encoder-decoder model. The encoder processes the input code (or computational graph represented as text), leveraging the flexibility of strings. The decoder then generates the numerical output. A key innovation is the use of explicit digit-by-digit numeric tokenization for the output, which avoids issues like numeric instabilities or the need to normalize values across vastly different scales (e.g., from 10^-2 to 10⁶).

This design also naturally supports multi-task regression, allowing a single model to be trained on diverse datasets and predict different metrics. It also enables multi-objective modeling, where the model can predict multiple related metrics sequentially, capturing their inherent correlations – for example, understanding how a neural network’s latency might relate to its accuracy.

Also Read:

Broad Applications and Future Impact

The research highlights the versatility of RLMs by testing them on a wide array of datasets. For high-level programming languages, they used APPS, KernelBook (for Triton kernel latency), and CodeNet. For neural network architectures, they converted over 520,000 unique architectures from various NAS benchmarks into a unified text-based ONNX intermediate representation.

The findings from ablation studies further underscore the robustness of RLMs. Pretraining the models on general language data or even synthetic regression tasks significantly improves convergence and performance. The decoder-based numeric outputs consistently outperformed traditional MSE-based regression heads, and larger pretrained encoder sizes led to better regression. Custom tokenization tailored for ONNX graphs and longer sequence lengths also contributed to improved accuracy.

This work paves the way for a future where complex computational graph regression can be simplified into a generic “next-token prediction” problem, aligning seamlessly with the modern paradigm of large language models. This could have profound implications for speeding up program search, optimizing hardware-software co-design, and enhancing compiler optimization. For more in-depth technical details, you can refer to the full research paper: Regression Language Models for Code.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Predicting Code Outcomes with Regression Language Models

A Unified Approach to Code Prediction

How RLMs Work

Broad Applications and Future Impact

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

SecureVibes Unveils AI-Powered Multi-Language Code Vulnerability Scanner Leveraging Claude AI Agents

Microsoft Research Unveils BlueCodeAgent: AI-Powered Defense for Secure Code Generation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates