TLDR: A new research paper introduces “Local Naturalness,” a method to improve how smaller language models (LLMs) learn complex reasoning from larger “teacher” LLMs. Unlike previous methods that assess an entire reasoning trace, Local Naturalness evaluates the student model’s confidence in short, sequential reasoning steps. This localized approach proves more reliable for long reasoning tasks and multi-teacher settings, significantly boosting student LLM performance on math, science, and code benchmarks by enabling better teacher selection and more effective data curation from diverse sources.
The field of artificial intelligence has seen remarkable advancements in large language models (LLMs), with a growing focus on their ability to perform complex reasoning. A common strategy to equip smaller, more efficient LLMs with sophisticated reasoning capabilities is through a process called supervised fine-tuning (SFT). This involves distilling long, detailed reasoning steps, often referred to as chain-of-thought (CoT), from powerful ‘teacher’ models into ‘student’ models. While this approach is practical and efficient, a significant challenge has remained underexplored: how to select the best reasoning response when a student model has access to multiple outputs from different teacher models for the same prompt.
Traditionally, researchers have often focused on selecting prompts, assuming a single, fixed teacher response. However, in real-world scenarios, multiple teachers might offer diverse responses, varying in logical depth and clarity. The existing method for response selection, known as ‘global naturalness,’ attempts to pick the response that the student model finds most ‘natural’ by assigning it the highest overall log-probability. This intuition suggests that data the model already understands well would be easiest to learn from.
However, recent research, including the paper by Hoang Anh Just, Myeongseob Ko, and Ruoxi Jia from Virginia Tech, reveals a critical flaw in this global approach, especially when dealing with long reasoning traces (over 10,000 tokens) and multiple teacher models. The problem arises because student models, often trained with shorter context windows, struggle to maintain information consistency over extended input lengths. Consequently, a high global log-probability for a long response doesn’t necessarily correlate with improved performance after fine-tuning. In fact, a response with a lower global likelihood might sometimes yield superior downstream accuracy.
Introducing Local Naturalness
To overcome this limitation, the researchers propose a novel method called ‘Local Naturalness.’ Instead of evaluating the entire response at once, Local Naturalness scores a response by measuring the student’s log-probabilities over short, sequential reasoning steps, such as individual sentences. Each step is evaluated based on a small, localized context window of preceding steps. This approach aligns with the idea that effective reasoning often emerges from a ‘locality of experience,’ where models learn to chain accurate local inferences.
The shift from a global to a local assessment offers a more reliable measure of reasoning quality. Individual logical steps are shorter and less complex, making them easier for student models to accurately assess. By focusing on these ‘stepping stones’ of the reasoning process, Local Naturalness can directly evaluate the quality and suitability of intermediate components for the student model, mitigating biases inherent in global log-probability assessments.
Two Key Applications
Local Naturalness enables two significant applications:
- Teacher Selection: By aggregating local scores across prompts, the method can reliably identify the most helpful teacher model for a specific student. This is a task where global scoring completely fails.
- Response Selection from a Mixed-Teacher Dataset: When combining answers from various teachers, Local Naturalness proves highly effective. Experiments showed that using this method boosted a 32-billion-parameter student model’s accuracy on math benchmarks by 9.4% compared to global-naturalness-based selection. Remarkably, it even surpassed the performance achieved by training on data from the single best teacher.
Also Read:
- AI Models Learn to Adapt and Specialize with Self-Curated Training
- Unlocking Latent Reasoning in LLMs with Temperature Scaling
Experimental Validation and Generalizability
The research rigorously evaluated Local Naturalness using various student models (like Qwen2.5-7B-Instruct, Qwen2.5-32B-Instruct, and Llama-3.1-8B-Instruct) and teacher models (including Qwen3-32B-Instruct, DeepSeek-R1, and QWQ-32B) across a suite of math and science benchmarks. The results consistently demonstrated that local log probabilities, particularly those derived from shorter contexts, offer a robust and efficient method for identifying optimal teacher models and curating high-quality reasoning data.
Furthermore, the generalizability of Local Naturalness was tested beyond mathematics. It showed significant improvements in scientific reasoning on the GPQA-Diamond benchmark and in code reasoning on the LiveCodeBench v2 benchmark. This suggests that the core principle of Local Naturalness is broadly applicable to other domains requiring complex, step-by-step inference.
In conclusion, this work highlights the power of localized data quality evaluation and data mixing for more effective reasoning distillation. By focusing on the ‘naturalness’ of individual reasoning steps rather than the entire response, Local Naturalness provides a more nuanced and reliable assessment, ultimately leading to better-performing student LLMs. For more in-depth information, you can read the full research paper here.


