TLDR: A new research paper introduces Unary Feedback as Observation (UFO), a method that significantly improves Large Language Models’ (LLMs) ability to reason and revise answers in multi-turn interactions. By using simple ‘Try Again’ feedback during reinforcement learning, UFO helps LLMs overcome the tendency to repeat mistakes, leading to up to 14% higher success rates in multi-turn problem-solving and better generalization across various tasks. The approach also incorporates reward decay and repetition penalties to encourage efficient and diverse reasoning.
Large Language Models (LLMs) have made incredible strides in solving complex tasks, from generating code to tackling advanced math problems. Much of this progress is thanks to Reinforcement Learning (RL), which trains these models to maximize rewards for correct answers. However, a significant challenge remains: these powerful models often struggle with multi-turn problem-solving, where they need to reflect on their previous attempts and revise their answers based on feedback. Instead of adapting, models trained with traditional single-turn RL tend to repeat the same incorrect responses, leading to a frustrating user experience.
A recent research paper, titled “A Simple “Try Again” Can Elicit Multi-Turn LLM Reasoning,” by Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, and Manling Li, introduces a surprisingly simple yet effective solution to this problem. The core idea is to train LLMs using multi-turn reinforcement learning with only “unary feedback” – a minimal, generic signal like “Let’s try again” when an answer is wrong. This approach is called Unary Feedback as Observation (UFO).
The Problem with Single-Turn Training
Imagine trying to teach someone to solve a puzzle, but only telling them if their final answer is right or wrong, without any guidance on how to improve if they fail. That’s similar to how many LLMs are trained. While they become excellent at providing a single correct answer, they often lose the ability to learn from in-context feedback. The researchers observed that in many cases, models trained with single-turn RL would generate the exact same incorrect answer across multiple turns, even when prompted to try again. This highlights a critical gap: real-world applications like chatbots and educational tools require models to adapt and refine their reasoning iteratively.
Introducing Unary Feedback as Observation (UFO)
UFO addresses this by reframing the problem-solving process as a multi-turn interaction. Instead of needing complex, detailed feedback, the model simply receives a generic “Try Again” signal if its answer is incorrect. If the answer is correct, the interaction ends. This simple mechanism allows existing single-turn datasets to be transformed into multi-turn training scenarios without requiring expensive human annotations or complex execution environments.
During training, the model learns to condition its responses on the full history of past attempts and the unary feedback. This encourages context-sensitive behaviors like error correction and hypothesis refinement. To further guide the model towards efficient and diverse reasoning, the researchers designed a clever reward structure. They introduced a “reward decay” that gives higher rewards for solving problems in fewer turns, promoting conciseness. Additionally, an “answer repetition penalty” discourages the model from generating identical responses, encouraging it to explore different strategies when it makes a mistake.
Impressive Results and Generalization
The experimental results are compelling. Models trained with UFO showed a significant improvement in multi-turn reasoning accuracy, up to 14% higher success rates compared to previous single-turn RL approaches. What’s more, this improvement wasn’t limited to multi-turn scenarios; UFO also enhanced single-turn performance, suggesting that learning to adapt in multi-turn settings makes the model generally smarter. The benefits extended across various domains, including mathematical reasoning, question answering, and general knowledge tasks, demonstrating strong cross-task generalization.
The study also confirmed that explicit feedback prompts are crucial for effective revision. Models performed significantly better when they received a “Please think again” type of prompt compared to no feedback. Furthermore, the reward shaping strategies proved effective: exponential reward decay led to models solving problems in fewer turns, indicating more efficient problem-solving, and the repetition penalty successfully encouraged the generation of more diverse answers over time.
Also Read:
- Unpacking RLVR’s Limits: Precision Gains Versus Reasoning Horizons
- AI’s Achilles’ Heel: Why More Feedback Can Harm Large Language Model Performance
Conclusion
The UFO framework offers a lightweight, generalizable, and effective method for training LLMs to excel in multi-turn interactive problem-solving. By leveraging simple unary feedback and smart reward design, LLMs can learn to self-correct, explore diverse reasoning paths, and ultimately provide more accurate and efficient responses in conversational settings. This work highlights a crucial step towards building more adaptive and human-like AI assistants. You can find the full research paper here: A Simple “Try Again” Can Elicit Multi-Turn LLM Reasoning.


