TLDR: Regression Language Models (RLMs) offer a novel, unified approach to predict numerical outcomes of code executions, such as memory usage, latency, and neural network performance. By treating code-to-metric regression as a text-to-text problem, RLMs can directly process code text and generate numerical predictions without extensive feature engineering. The models demonstrate strong performance across various programming languages and neural architecture search benchmarks, simplifying complex computational graph analysis into a next-token prediction task.
Predicting the numerical outcomes of code execution, such as how much memory a program will use or how fast it will run, has traditionally been a complex challenge. This task, often referred to as code-to-metric regression, typically required extensive and specialized “feature engineering” – essentially, hand-crafting specific characteristics from the code to feed into a prediction model.
However, a recent research paper introduces a groundbreaking approach: Regression Language Models (RLMs) for Code. These models offer a unified and simplified way to predict these crucial metrics directly from the code’s text, bypassing the need for laborious feature engineering.
A Unified Approach to Code Prediction
The core idea behind RLMs is to treat the prediction of numerical outcomes as a “text-to-text regression” problem. This means the model takes code as input text and directly generates the numerical prediction as output text. This is a significant departure from older methods that struggled with the open-ended and graph-like nature of programming languages.
The researchers demonstrate that a single RLM can simultaneously predict a variety of metrics across different programming languages and computational structures. For instance, it can estimate:
The memory footprint of code written in high-level languages like Python and C++.
The latency (speed) of specialized Triton GPU kernels.
The accuracy and speed of trained neural networks represented in ONNX format.
A relatively compact RLM with 300 million parameters, initialized from T5Gemma, achieved impressive results. It obtained a Spearman-rank correlation of over 0.9 on competitive programming submissions from the APPS dataset for peak memory usage. Furthermore, a single unified model achieved an average Spearman-rank of over 0.5 across 17 different languages from the CodeNet dataset. In the realm of Neural Architecture Search (NAS), the RLM even surpassed traditional graph neural networks, achieving the highest average Kendall-Tau of 0.46 on five classic NAS design spaces and simultaneously predicting architecture latencies across multiple hardware platforms.
How RLMs Work
At its heart, an RLM is structured as an encoder-decoder model. The encoder processes the input code (or computational graph represented as text), leveraging the flexibility of strings. The decoder then generates the numerical output. A key innovation is the use of explicit digit-by-digit numeric tokenization for the output, which avoids issues like numeric instabilities or the need to normalize values across vastly different scales (e.g., from 10-2 to 106).
This design also naturally supports multi-task regression, allowing a single model to be trained on diverse datasets and predict different metrics. It also enables multi-objective modeling, where the model can predict multiple related metrics sequentially, capturing their inherent correlations – for example, understanding how a neural network’s latency might relate to its accuracy.
Also Read:
- Unlocking Binary Code Secrets: How Language Models Make Code Similarity Understandable
- CoLLM-NAS: Streamlining Neural Architecture Search with Collaborative AI
Broad Applications and Future Impact
The research highlights the versatility of RLMs by testing them on a wide array of datasets. For high-level programming languages, they used APPS, KernelBook (for Triton kernel latency), and CodeNet. For neural network architectures, they converted over 520,000 unique architectures from various NAS benchmarks into a unified text-based ONNX intermediate representation.
The findings from ablation studies further underscore the robustness of RLMs. Pretraining the models on general language data or even synthetic regression tasks significantly improves convergence and performance. The decoder-based numeric outputs consistently outperformed traditional MSE-based regression heads, and larger pretrained encoder sizes led to better regression. Custom tokenization tailored for ONNX graphs and longer sequence lengths also contributed to improved accuracy.
This work paves the way for a future where complex computational graph regression can be simplified into a generic “next-token prediction” problem, aligning seamlessly with the modern paradigm of large language models. This could have profound implications for speeding up program search, optimizing hardware-software co-design, and enhancing compiler optimization. For more in-depth technical details, you can refer to the full research paper: Regression Language Models for Code.


