TLDR: A new study reveals that while Large Language Models (LLMs) show promise in decision-making tasks, providing them with additional feedback, such as past actions or rewards, often leads to a decline in their performance, especially in complex environments. This counter-intuitive finding suggests that LLMs struggle with integrating and reasoning from too much context, highlighting limitations in their planning abilities without specific fine-tuning or advanced guidance.
Large Language Models, or LLMs, have shown incredible promise in understanding and generating human-like text. This capability naturally leads to questions about their potential in complex decision-making scenarios, especially in autonomous systems. A recent research paper titled “Feedback-Induced Performance Decline in LLM-Based Decision-Making” explores this very topic, delving into how these advanced AI models behave within environments that require sequential decision-making under uncertainty, often referred to as Markov Decision Processes (MDPs).
The study, conducted by Xiao Yang, Juxi Leitner, and Michael Burke from Monash University, set out to investigate whether LLMs could leverage their vast pre-trained knowledge for faster adaptation in these decision-making tasks, potentially outperforming traditional methods like Reinforcement Learning (RL). RL typically relies on extensive trial-and-error exploration, which can be slow and inefficient in real-world applications.
Unexpected Findings on Feedback
The researchers initially hypothesized that LLMs, guided by structured prompting strategies, could excel in these scenarios. However, their findings revealed a surprising and counter-intuitive outcome: while LLMs showed improved initial performance in simpler environments, they struggled significantly with planning and reasoning in more complex situations. Even more notably, feedback mechanisms, which are usually intended to help improve decision-making, often introduced confusion and led to a decline in performance.
This means that simply giving the LLM more information about its previous actions, the environment’s changes, or the rewards it received, did not necessarily make it smarter. In fact, it often made the model perform worse. This suggests that LLMs, on their own, might not be able to effectively plan or reason, and that naive prompting strategies with additional context can actually hinder their effectiveness.
Key Contributions of the Research
The paper makes several important contributions. It provides a thorough evaluation of LLM-based decision-making policies in various MiniGrid environments, comparing them against classical RL methods. The study highlights that despite their extensive prior knowledge, LLMs lack the fundamental grounding and reasoning skills to effectively use this knowledge for problem-solving without extra guidance. Crucially, it demonstrates that incorporating feedback can lead to a degradation of the AI’s policy, where irrelevant or misleading information distracts the model from making good decisions.
How the Study Was Conducted
The researchers formulated the problem as a Markov Decision Process, where an AI agent receives observations from an environment and selects actions to maximize cumulative rewards over time. They tested different prompting strategies for the LLMs, ranging from providing only the current state to including memory of past interactions, immediate reward feedback, and even cumulative reward and policy feedback. The goal was to see how different levels of contextual information influenced the LLM’s decision-making.
Experiments were conducted using the MiniGrid environment, which offers grid-based worlds of varying complexity, from a simple 5×5 grid to a more challenging 9×9 grid with internal walls. The LLMs used were Llama 3.1 8B and Qwen 2.5 1.5b. Each approach was evaluated over 100 episodes, measuring cumulative reward and success rate.
Results: LLMs vs. Traditional RL
The results showed that traditional Reinforcement Learning policies achieved near-perfect success rates and high average rewards across all configurations. In contrast, LLM-based policies, while sometimes better than a random policy, generally performed worse than the RL baseline. The most striking observation was the consistent performance decline as more forms of feedback were provided to the LLMs, especially in more complex environments. For instance, in the most complex configuration with internal obstacles, all LLM models performed quite poorly, and adding feedback further worsened their performance.
Even advanced ‘reasoning models’ tested with one-shot prompting showed mixed results, succeeding in simpler tasks but failing to generate effective plans for more complex ones. This further supports the idea that LLMs struggle with genuine planning and reasoning, often relying more on memory retrieval than true problem-solving.
Also Read:
- LLMs Learn to Think Smarter with Hierarchical Budget Policy Optimization
- Unlocking LLM Potential: A New Approach to Positional Bias
Looking Ahead
The findings underscore the limitations of current prompt-based methods for LLMs in complex sequential decision-making. Simply adding more feedback can dilute the model’s attention, diverting its focus from critical task-relevant signals and ultimately reducing its effectiveness. This research suggests a need for further exploration into hybrid strategies, fine-tuning, and advanced memory integration to truly enhance LLM-based decision-making capabilities. For more details, you can read the full research paper here.


