Enabling Language Models to Learn from 'Try Again' Feedback

TLDR: A new research paper introduces Unary Feedback as Observation (UFO), a method that significantly improves Large Language Models’ (LLMs) ability to reason and revise answers in multi-turn interactions. By using simple ‘Try Again’ feedback during reinforcement learning, UFO helps LLMs overcome the tendency to repeat mistakes, leading to up to 14% higher success rates in multi-turn problem-solving and better generalization across various tasks. The approach also incorporates reward decay and repetition penalties to encourage efficient and diverse reasoning.

Large Language Models (LLMs) have made incredible strides in solving complex tasks, from generating code to tackling advanced math problems. Much of this progress is thanks to Reinforcement Learning (RL), which trains these models to maximize rewards for correct answers. However, a significant challenge remains: these powerful models often struggle with multi-turn problem-solving, where they need to reflect on their previous attempts and revise their answers based on feedback. Instead of adapting, models trained with traditional single-turn RL tend to repeat the same incorrect responses, leading to a frustrating user experience.

A recent research paper, titled “A Simple “Try Again” Can Elicit Multi-Turn LLM Reasoning,” by Licheng Liu, Zihan Wang, Linjie Li, Chenwei Xu, Yiping Lu, Han Liu, Avirup Sil, and Manling Li, introduces a surprisingly simple yet effective solution to this problem. The core idea is to train LLMs using multi-turn reinforcement learning with only “unary feedback” – a minimal, generic signal like “Let’s try again” when an answer is wrong. This approach is called Unary Feedback as Observation (UFO).

The Problem with Single-Turn Training

Imagine trying to teach someone to solve a puzzle, but only telling them if their final answer is right or wrong, without any guidance on how to improve if they fail. That’s similar to how many LLMs are trained. While they become excellent at providing a single correct answer, they often lose the ability to learn from in-context feedback. The researchers observed that in many cases, models trained with single-turn RL would generate the exact same incorrect answer across multiple turns, even when prompted to try again. This highlights a critical gap: real-world applications like chatbots and educational tools require models to adapt and refine their reasoning iteratively.

Introducing Unary Feedback as Observation (UFO)

UFO addresses this by reframing the problem-solving process as a multi-turn interaction. Instead of needing complex, detailed feedback, the model simply receives a generic “Try Again” signal if its answer is incorrect. If the answer is correct, the interaction ends. This simple mechanism allows existing single-turn datasets to be transformed into multi-turn training scenarios without requiring expensive human annotations or complex execution environments.

During training, the model learns to condition its responses on the full history of past attempts and the unary feedback. This encourages context-sensitive behaviors like error correction and hypothesis refinement. To further guide the model towards efficient and diverse reasoning, the researchers designed a clever reward structure. They introduced a “reward decay” that gives higher rewards for solving problems in fewer turns, promoting conciseness. Additionally, an “answer repetition penalty” discourages the model from generating identical responses, encouraging it to explore different strategies when it makes a mistake.

Impressive Results and Generalization

The experimental results are compelling. Models trained with UFO showed a significant improvement in multi-turn reasoning accuracy, up to 14% higher success rates compared to previous single-turn RL approaches. What’s more, this improvement wasn’t limited to multi-turn scenarios; UFO also enhanced single-turn performance, suggesting that learning to adapt in multi-turn settings makes the model generally smarter. The benefits extended across various domains, including mathematical reasoning, question answering, and general knowledge tasks, demonstrating strong cross-task generalization.

The study also confirmed that explicit feedback prompts are crucial for effective revision. Models performed significantly better when they received a “Please think again” type of prompt compared to no feedback. Furthermore, the reward shaping strategies proved effective: exponential reward decay led to models solving problems in fewer turns, indicating more efficient problem-solving, and the repetition penalty successfully encouraged the generation of more diverse answers over time.

Also Read:

Conclusion

The UFO framework offers a lightweight, generalizable, and effective method for training LLMs to excel in multi-turn interactive problem-solving. By leveraging simple unary feedback and smart reward design, LLMs can learn to self-correct, explore diverse reasoning paths, and ultimately provide more accurate and efficient responses in conversational settings. This work highlights a crucial step towards building more adaptive and human-like AI assistants. You can find the full research paper here: A Simple “Try Again” Can Elicit Multi-Turn LLM Reasoning.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enabling Language Models to Learn from ‘Try Again’ Feedback

The Problem with Single-Turn Training

Introducing Unary Feedback as Observation (UFO)

Impressive Results and Generalization

Conclusion

Gen AI News and Updates

Amazon Bedrock’s A2A Protocol: The Catalyst for Next-Gen Cross-Framework Multi-Agent AI Systems

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates