Enhancing LLM Training: Focusing on Local Steps for Better Reasoning

TLDR: A new research paper introduces “Local Naturalness,” a method to improve how smaller language models (LLMs) learn complex reasoning from larger “teacher” LLMs. Unlike previous methods that assess an entire reasoning trace, Local Naturalness evaluates the student model’s confidence in short, sequential reasoning steps. This localized approach proves more reliable for long reasoning tasks and multi-teacher settings, significantly boosting student LLM performance on math, science, and code benchmarks by enabling better teacher selection and more effective data curation from diverse sources.

The field of artificial intelligence has seen remarkable advancements in large language models (LLMs), with a growing focus on their ability to perform complex reasoning. A common strategy to equip smaller, more efficient LLMs with sophisticated reasoning capabilities is through a process called supervised fine-tuning (SFT). This involves distilling long, detailed reasoning steps, often referred to as chain-of-thought (CoT), from powerful ‘teacher’ models into ‘student’ models. While this approach is practical and efficient, a significant challenge has remained underexplored: how to select the best reasoning response when a student model has access to multiple outputs from different teacher models for the same prompt.

Traditionally, researchers have often focused on selecting prompts, assuming a single, fixed teacher response. However, in real-world scenarios, multiple teachers might offer diverse responses, varying in logical depth and clarity. The existing method for response selection, known as ‘global naturalness,’ attempts to pick the response that the student model finds most ‘natural’ by assigning it the highest overall log-probability. This intuition suggests that data the model already understands well would be easiest to learn from.

However, recent research, including the paper by Hoang Anh Just, Myeongseob Ko, and Ruoxi Jia from Virginia Tech, reveals a critical flaw in this global approach, especially when dealing with long reasoning traces (over 10,000 tokens) and multiple teacher models. The problem arises because student models, often trained with shorter context windows, struggle to maintain information consistency over extended input lengths. Consequently, a high global log-probability for a long response doesn’t necessarily correlate with improved performance after fine-tuning. In fact, a response with a lower global likelihood might sometimes yield superior downstream accuracy.

Introducing Local Naturalness

To overcome this limitation, the researchers propose a novel method called ‘Local Naturalness.’ Instead of evaluating the entire response at once, Local Naturalness scores a response by measuring the student’s log-probabilities over short, sequential reasoning steps, such as individual sentences. Each step is evaluated based on a small, localized context window of preceding steps. This approach aligns with the idea that effective reasoning often emerges from a ‘locality of experience,’ where models learn to chain accurate local inferences.

The shift from a global to a local assessment offers a more reliable measure of reasoning quality. Individual logical steps are shorter and less complex, making them easier for student models to accurately assess. By focusing on these ‘stepping stones’ of the reasoning process, Local Naturalness can directly evaluate the quality and suitability of intermediate components for the student model, mitigating biases inherent in global log-probability assessments.

Two Key Applications

Local Naturalness enables two significant applications:

Teacher Selection: By aggregating local scores across prompts, the method can reliably identify the most helpful teacher model for a specific student. This is a task where global scoring completely fails.
Response Selection from a Mixed-Teacher Dataset: When combining answers from various teachers, Local Naturalness proves highly effective. Experiments showed that using this method boosted a 32-billion-parameter student model’s accuracy on math benchmarks by 9.4% compared to global-naturalness-based selection. Remarkably, it even surpassed the performance achieved by training on data from the single best teacher.

Also Read:

Experimental Validation and Generalizability

The research rigorously evaluated Local Naturalness using various student models (like Qwen2.5-7B-Instruct, Qwen2.5-32B-Instruct, and Llama-3.1-8B-Instruct) and teacher models (including Qwen3-32B-Instruct, DeepSeek-R1, and QWQ-32B) across a suite of math and science benchmarks. The results consistently demonstrated that local log probabilities, particularly those derived from shorter contexts, offer a robust and efficient method for identifying optimal teacher models and curating high-quality reasoning data.

Furthermore, the generalizability of Local Naturalness was tested beyond mathematics. It showed significant improvements in scientific reasoning on the GPQA-Diamond benchmark and in code reasoning on the LiveCodeBench v2 benchmark. This suggests that the core principle of Local Naturalness is broadly applicable to other domains requiring complex, step-by-step inference.

In conclusion, this work highlights the power of localized data quality evaluation and data mixing for more effective reasoning distillation. By focusing on the ‘naturalness’ of individual reasoning steps rather than the entire response, Local Naturalness provides a more nuanced and reliable assessment, ultimately leading to better-performing student LLMs. For more in-depth information, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing LLM Training: Focusing on Local Steps for Better Reasoning

Introducing Local Naturalness

Two Key Applications

Experimental Validation and Generalizability

Gen AI News and Updates

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

UNESCO’s 43rd General Conference Concludes with New Leadership and Landmark Ethics Frameworks for Technology

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates