Unlocking Deeper Exploration in Language Models with SESA

TLDR: SESA (Sequential Sampling) is a novel framework that enhances exploration in large language models (LLMs) trained with reinforcement learning (RL). It addresses the problem of limited exploration and ‘policy collapse’ in traditional parallel sampling methods by generating diverse solution sketches sequentially, conditioning each new output on previous ones. This two-stage approach, involving sequential method drafting and parallel guided solution generation, significantly boosts output diversity, helps models discover new strategies, and leads to sustained performance improvements across agent benchmarks, Sudoku, and mathematical reasoning tasks. SESA can even revive collapsed policies, ensuring continuous learning and preventing stagnation in LLM training.

Large Language Models (LLMs) have made incredible strides in reasoning, often thanks to Reinforcement Learning (RL). However, a persistent challenge in RL training for LLMs is the tendency for models to get stuck in a rut, repeatedly exploiting a narrow set of solutions. This issue, known as limited exploration or entropy collapse, prevents models from discovering new and potentially better strategies, ultimately hindering their performance.

Traditional RL methods often use ‘parallel sampling,’ where multiple outputs are generated independently from the same distribution. While seemingly efficient, this approach can lead to outputs that are too similar, causing the model to converge prematurely to a few high-reward solutions and lose diversity. Once this ‘policy collapse’ occurs, further training becomes ineffective as the model has no new strategies to explore.

Introducing SESA: A New Approach to Exploration

To tackle this, researchers Shijia Kang and Muhan Zhang from Peking University have proposed a novel framework called SESA (SEquential SAmpling). SESA fundamentally shifts the sampling paradigm by generating diverse solution sketches sequentially, with each new output conditioned on the ones that came before it. This ensures that every new candidate is distinct from its predecessors, actively promoting diversity and preventing the model from falling into a policy collapse.

For complex real-world tasks, SESA employs a clever two-stage procedure to maintain both diversity and efficiency:

Stage I: Sequential Method Drafting: The model first generates several concise ‘method sketches’ sequentially. These sketches are brief plans or strategies, and because they are short, this stage adds minimal latency and doesn’t strain the model’s context window. Each sketch is designed to be different from the ones already generated.
Stage II: Guided Solution Generation: After the sketches are created, each one is expanded into a full solution in parallel. This means that while the initial plans are diversified sequentially, the detailed execution of those plans happens simultaneously. This parallel expansion restores throughput while ensuring that each final solution is unique and self-contained, anchored to its distinct initial plan.

Demonstrated Benefits Across Tasks

The effectiveness of SESA has been rigorously tested across various benchmarks. In a synthetic ‘Path Exploration’ task, sequential sampling consistently outperformed parallel sampling, uncovering strategies that the latter failed to discover and retaining a significantly larger proportion of correct solutions. While parallel sampling quickly plateaued, SESA continued to find new paths, demonstrating its superior exploration capabilities.

On three classic RL agent benchmarks—Sokoban, Countdown, and FrozenLake—SESA showed substantial improvements in success rates. For instance, on Sokoban, SESA boosted the success rate by 0.25 over the base model, a 211% larger improvement than baseline RL methods. Similar gains were observed in FrozenLake and Countdown, highlighting SESA’s ability to preserve diversity and enhance exploration during training.

Beyond agent tasks, SESA also proved beneficial in general reasoning tasks like Sudoku and mathematical problems from AIME24. In Sudoku, SESA improved the success rate by 6% over the baseline. For math problems, it achieved comparable performance in Pass@1 (first attempt success) but significantly improved Pass@k (success within k attempts) by 9%, indicating a greater diversity of correct outputs.

Also Read:

Reviving ‘Dead Policies’

One of SESA’s most compelling advantages is its ability to recover models from a ‘dead policy’ state. When parallel sampling leads to policy collapse, the model’s outputs become nearly identical, and further training yields no progress. Researchers demonstrated that by resuming training with sequential sampling from such a collapsed state, the model’s diversity gradually increased, and its performance recovered, proving that SESA can revitalize exploration and prevent stagnation.

By introducing a structured approach to exploration, SESA offers a robust method for sustained performance gains in RL-trained LLMs, ensuring they can discover a broader range of valid strategies and continue learning effectively. You can read the full research paper for more details here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Deeper Exploration in Language Models with SESA

Introducing SESA: A New Approach to Exploration

Demonstrated Benefits Across Tasks

Reviving ‘Dead Policies’

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates