TLDR: SESA (Sequential Sampling) is a novel framework that enhances exploration in large language models (LLMs) trained with reinforcement learning (RL). It addresses the problem of limited exploration and ‘policy collapse’ in traditional parallel sampling methods by generating diverse solution sketches sequentially, conditioning each new output on previous ones. This two-stage approach, involving sequential method drafting and parallel guided solution generation, significantly boosts output diversity, helps models discover new strategies, and leads to sustained performance improvements across agent benchmarks, Sudoku, and mathematical reasoning tasks. SESA can even revive collapsed policies, ensuring continuous learning and preventing stagnation in LLM training.
Large Language Models (LLMs) have made incredible strides in reasoning, often thanks to Reinforcement Learning (RL). However, a persistent challenge in RL training for LLMs is the tendency for models to get stuck in a rut, repeatedly exploiting a narrow set of solutions. This issue, known as limited exploration or entropy collapse, prevents models from discovering new and potentially better strategies, ultimately hindering their performance.
Traditional RL methods often use ‘parallel sampling,’ where multiple outputs are generated independently from the same distribution. While seemingly efficient, this approach can lead to outputs that are too similar, causing the model to converge prematurely to a few high-reward solutions and lose diversity. Once this ‘policy collapse’ occurs, further training becomes ineffective as the model has no new strategies to explore.
Introducing SESA: A New Approach to Exploration
To tackle this, researchers Shijia Kang and Muhan Zhang from Peking University have proposed a novel framework called SESA (SEquential SAmpling). SESA fundamentally shifts the sampling paradigm by generating diverse solution sketches sequentially, with each new output conditioned on the ones that came before it. This ensures that every new candidate is distinct from its predecessors, actively promoting diversity and preventing the model from falling into a policy collapse.
For complex real-world tasks, SESA employs a clever two-stage procedure to maintain both diversity and efficiency:
- Stage I: Sequential Method Drafting: The model first generates several concise ‘method sketches’ sequentially. These sketches are brief plans or strategies, and because they are short, this stage adds minimal latency and doesn’t strain the model’s context window. Each sketch is designed to be different from the ones already generated.
- Stage II: Guided Solution Generation: After the sketches are created, each one is expanded into a full solution in parallel. This means that while the initial plans are diversified sequentially, the detailed execution of those plans happens simultaneously. This parallel expansion restores throughput while ensuring that each final solution is unique and self-contained, anchored to its distinct initial plan.
Demonstrated Benefits Across Tasks
The effectiveness of SESA has been rigorously tested across various benchmarks. In a synthetic ‘Path Exploration’ task, sequential sampling consistently outperformed parallel sampling, uncovering strategies that the latter failed to discover and retaining a significantly larger proportion of correct solutions. While parallel sampling quickly plateaued, SESA continued to find new paths, demonstrating its superior exploration capabilities.
On three classic RL agent benchmarks—Sokoban, Countdown, and FrozenLake—SESA showed substantial improvements in success rates. For instance, on Sokoban, SESA boosted the success rate by 0.25 over the base model, a 211% larger improvement than baseline RL methods. Similar gains were observed in FrozenLake and Countdown, highlighting SESA’s ability to preserve diversity and enhance exploration during training.
Beyond agent tasks, SESA also proved beneficial in general reasoning tasks like Sudoku and mathematical problems from AIME24. In Sudoku, SESA improved the success rate by 6% over the baseline. For math problems, it achieved comparable performance in Pass@1 (first attempt success) but significantly improved Pass@k (success within k attempts) by 9%, indicating a greater diversity of correct outputs.
Also Read:
- Optimizing AI Reasoning for Shorter, Smarter Responses
- Enhancing Language Model Reasoning with Calibrated Sampling
Reviving ‘Dead Policies’
One of SESA’s most compelling advantages is its ability to recover models from a ‘dead policy’ state. When parallel sampling leads to policy collapse, the model’s outputs become nearly identical, and further training yields no progress. Researchers demonstrated that by resuming training with sequential sampling from such a collapsed state, the model’s diversity gradually increased, and its performance recovered, proving that SESA can revitalize exploration and prevent stagnation.
By introducing a structured approach to exploration, SESA offers a robust method for sustained performance gains in RL-trained LLMs, ensuring they can discover a broader range of valid strategies and continue learning effectively. You can read the full research paper for more details here.


