TLDR: A new research paper introduces ‘Power Sampling,’ an iterative, training-free algorithm that enables base language models to achieve reasoning capabilities comparable to, and sometimes exceeding, those of reinforcement learning (RL) post-trained models. Inspired by MCMC techniques, Power Sampling leverages base model likelihoods to sample from a ‘power distribution,’ enhancing single-shot and multi-shot reasoning on tasks like MATH500, HumanEval, and GPQA, while avoiding the diversity collapse characteristic of RL. The method suggests that base models possess significant untapped reasoning potential that can be unlocked through smarter inference-time sampling.
Large Language Models (LLMs) have shown remarkable reasoning abilities across many fields, largely due to post-training methods like reinforcement learning (RL). However, a key question has been whether these enhanced capabilities are truly novel behaviors learned during RL, or if they are simply a ‘sharpened’ version of what the base models already possess. A new research paper, titled “Reasoning with Sampling: Your Base Model is Smarter Than You Think,” explores this question from a fresh perspective.
Authored by Aayush Karan and Yilun Du from Harvard University, this paper introduces a surprising finding: comparable reasoning capabilities can be drawn from base models at inference time through pure sampling, without any additional training. This approach challenges the notion that extensive post-training is always necessary to unlock advanced reasoning.
Unlocking Latent Reasoning with Power Sampling
The researchers propose a simple iterative sampling algorithm, inspired by Markov chain Monte Carlo (MCMC) techniques, which leverages the base models’ own likelihoods. They call this method ‘Power Sampling.’ The core idea is to sample from a ‘power distribution,’ which effectively reweights the base model’s distribution, giving more emphasis to high-likelihood regions and less to low-likelihood ones. This sharpening effect is similar to what RL aims to achieve, but Power Sampling does it without any training.
Unlike traditional low-temperature sampling, which can sometimes favor tokens with many low-likelihood future paths, Power Sampling is designed to encourage sampling tokens that lead to fewer but higher-likelihood future paths. This behavior is particularly valuable for complex reasoning tasks, where choosing the ‘right’ pivotal tokens can significantly impact the correctness of the output.
Key Advantages Over Reinforcement Learning
Power Sampling offers several significant advantages:
- Training-Free: It requires no additional training, curated datasets, or a verifier, which are common requirements and potential weaknesses of RL methods. This makes it broadly applicable, even in domains where ground truth verification is difficult.
- Enhanced Performance: The algorithm provides substantial boosts in reasoning performance, nearly matching and sometimes even outperforming RL-posttraining on a variety of single-shot tasks. These include benchmarks like MATH500 (mathematics), HumanEval (coding), and GPQA (science), as well as the non-verifiable AlpacaEval 2.0 for general helpfulness.
- Maintained Diversity: A common issue with RL-posttraining is a collapse in generation diversity over multiple samples. Power Sampling, however, avoids this, demonstrating strong performance in multi-shot reasoning (pass@k accuracy) while maintaining diversity.
For instance, on MATH500, an in-domain task for RL, Power Sampling achieves accuracies on par with Group Relative Policy Optimization (GRPO), a standard RL algorithm. On out-of-domain tasks like HumanEval and AlpacaEval 2.0, Power Sampling consistently outperforms GRPO, showcasing its generalizability.
How It Works: An Iterative Process
The algorithm works by progressively sampling from a series of intermediate distributions. It initializes a Metropolis-Hastings process, an MCMC algorithm, by extending a prefix with a proposal LLM. It then iteratively resamples token subsequences based on their base model likelihoods, accepting or rejecting new candidates to converge towards the target power distribution. This process, while involving multiple inference calls, is considered ‘single-shot’ in that the final decision is based purely on base model likelihoods to simulate sampling a single, high-quality sequence.
The researchers found that an intermediate ‘alpha’ value of 4.0 for the power distribution and a moderate number of MCMC steps (around 10) yielded optimal performance. This approach essentially expends additional computational resources during inference to obtain a higher-quality, higher-likelihood sample, a concept the authors refer to as ‘inference-time scaling.’
Also Read:
- Unlocking LLM Evaluation: How Confidence Scores Can Transform Reward Models
- Boosting LLM Reasoning with Last-Token Self-Rewarding
Implications for LLM Development
The success of Power Sampling suggests that existing base models possess much greater latent reasoning capabilities than previously understood, which current sampling methods might not fully reveal. The findings highlight a strong correlation between high-likelihood regions of the base model and robust reasoning abilities. This research opens a promising new direction for expanding the scope of reasoning in LLMs, particularly in areas beyond easily verifiable domains, by focusing on smarter, training-free sampling techniques.
For more technical details, you can read the full research paper here.


