TLDR: This research introduces a framework for optimizing Large Language Model (LLM) performance during inference by dynamically selecting generation strategies and allocating compute resources. Unlike previous methods that primarily focused on token usage and parallel generation, this new approach explicitly considers both token cost and wall-clock latency, including incremental decoding methods like beam search. Experiments show that this query-adaptive strategy consistently achieves better accuracy-cost trade-offs than static methods, making LLMs more efficient and responsive, especially for agentic workflows.
Large Language Models (LLMs) have become incredibly powerful, especially when they can generate multiple possible answers and pick the best one. This approach, known as inference-time scaling, significantly boosts performance, particularly in complex tasks like mathematical reasoning and coding. However, this enhanced performance comes at a considerable computational cost, and managing this cost efficiently is a major challenge.
Traditional methods for optimizing this process often focus solely on the number of tokens generated, which is a measure of computational load. They also tend to overlook certain generation techniques, such as beam search, which works incrementally. Crucially, these methods frequently ignore a critical factor for user experience: wall-clock latency, or the actual time it takes for a response to be generated. This is especially important in interactive or ‘agentic’ systems where models need to make many quick decisions.
A New Approach to LLM Efficiency
A recent research paper, “Latency and Token-Aware Test-Time Compute,” introduces a novel framework that addresses these limitations. The authors propose treating inference-time scaling as a problem of dynamically allocating computational resources and selecting the best strategy for each individual query. Their framework explicitly considers both the token cost (how much computation is used) and the wall-clock latency (how long it takes), aiming to strike a better balance between accuracy and efficiency.
The core idea is to decide, for every query, which generation strategy to use and how much compute to give it. The paper explores different inference scaling methods:
- Sampling-based methods like Majority Voting and Best-of-N, which can generate multiple candidate responses in parallel, meaning latency doesn’t increase dramatically with more candidates.
- Beam Search, an incremental method where partial solutions are built step-by-step. This requires synchronization at each step, which can lead to higher latency.
To make these decisions, the system uses a ‘utility’ function that weighs accuracy against token cost and latency, based on user-defined preferences. Since the actual accuracy, token count, and latency aren’t known before generation, the framework trains lightweight predictors to estimate these values in advance. An ‘accuracy model’ estimates the probability of a correct answer, while ‘cost models’ use precomputed average token counts and execution times for different strategies.
Experimental Insights
The researchers tested their query-adaptive strategy on the NuminaMath-CoT dataset, a benchmark for mathematical reasoning. They used Alibaba’s Qwen2.5-1.5B-Instruct as the generator and Qwen/Qwen2.5-Math-PRM-7B as a reward model for evaluation. The results were compelling: the query-adaptive strategy consistently achieved better trade-offs between accuracy and cost compared to static, fixed strategies.
The experiments also revealed how the system adapts its choices. When penalties for latency and token usage were low, the adaptive method frequently opted for more compute-intensive strategies like beam search, prioritizing higher accuracy. As these penalties increased, the system shifted towards lighter, lower-cost options, significantly reducing latency and token usage while still maintaining competitive accuracy.
Also Read:
- Streamlining AI at the Edge: Adaptive Token Merging for Transformers
- Assessing Foundation Models for Planning Assistance
Looking Ahead
This work highlights the importance of considering both computational load and responsiveness in LLM inference. By dynamically adapting strategies based on query difficulty and user preferences for cost and latency, the framework offers a practical way to improve the efficiency of LLMs, especially in complex agentic workflows where models must handle multiple queries efficiently. Future work aims to extend this approach to other domains like coding and dialogue, and to further refine the accuracy prediction models.
You can read the full research paper here: Latency and Token-Aware Test-Time Compute.


