TLDR: Self-Questioning Language Models (SQLM) is a new framework where LLMs improve reasoning skills by generating their own questions and answers. Using an “asymmetric self-play” setup with a “proposer” creating problems and a “solver” solving them, both trained via reinforcement learning, the system adapts reward mechanisms (majority voting for math, unit tests for coding). This self-supervised approach significantly boosts performance on arithmetic, algebra, and coding benchmarks without external data.
The paper introduces a fascinating new approach to improving large language models (LLMs) without relying on extensive, hand-curated datasets. This method, called Self-Questioning Language Models (SQLM), uses an asymmetric self-play framework where an LLM essentially teaches itself by generating its own questions and then attempting to answer them.
At its core, SQLM involves two main components: a “proposer” and a “solver.” Both of these are instances of the language model itself. The proposer’s role is to create new problems or questions based on a given topic, such as “algebra word problems.” The solver then takes these generated questions and tries to find the answers. This dynamic interaction allows the models to continuously learn and refine their reasoning abilities.
The training process for both the proposer and solver is based on reinforcement learning. The proposer receives a reward if the question it generates is neither too easy nor too difficult for the solver. This encourages the proposer to create a curriculum of increasingly challenging yet solvable problems. For the solver, the reward mechanism varies depending on the type of problem. For tasks like arithmetic or algebra, where verifying an answer is as hard as solving it, the system uses a “majority voting” approach. This means if multiple attempts by the solver yield the same answer, that answer is considered correct, and the solver is rewarded.
However, for tasks like coding, where verifying a solution (e.g., through unit tests) is often easier than generating the correct code, the proposer also generates unit tests along with the problem. The solver’s reward is then based on how many of these unit tests its solution passes. This clever design adapts to the inherent “generator-verifier gap” of different problem types.
The researchers evaluated SQLM on three distinct benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. A key finding was that even starting with just a single prompt describing the task and no example problems or labeled data, the models showed significant improvements. For instance, the Qwen2.5-3B-Instruct model improved its accuracy by 14% on Arithmetic and 16% on Algebra, and the Qwen2.5-Coder-3B-Instruct model improved by 7% on Coding. These gains demonstrate that LLMs can indeed enhance their reasoning skills through this self-supervised learning process.
The study also explored the impact of how frequently the proposer updates its problem generation strategy. They found that updating the proposer every five steps provided a good balance, leading to better performance and lower variance across training runs. This iterative generation of problems ensures a continuous flow of diverse and appropriately challenging tasks.
Also Read:
- Enhancing LLM Reasoning with Consistency-Aware Policy Optimization
- MOTIF: Advancing Algorithmic Design Through Competitive LLM Interaction
This self-questioning framework represents a significant step towards more autonomous language model refinement, reducing the heavy reliance on human-curated datasets that traditionally demand substantial effort and supervision. While the method still requires some initial prompt tuning to guide the generation space, it opens up exciting possibilities for LLMs to become active agents in their own training and development. For more in-depth details, you can refer to the full research paper here.


