TLDR: This research investigates how Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) shape Large Language Models’ (LLMs) exploration strategies in multi-armed bandit tasks. By introducing novel reward signals for RL, the study shows that trained LLMs achieve strong performance and robust generalization, comparable to optimal baselines. However, a behavioral analysis reveals that these gains often stem from a more sophisticated but ultimately greedier exploitation bias, leading to premature abandonment of exploration. The findings highlight the need for tailored reward design and evaluation beyond average regret to ensure robust exploratory behavior in LLM agents.
Large Language Models, or LLMs, are increasingly seen as the foundation for future autonomous agents. However, a significant hurdle remains: their ability to explore effectively in situations requiring sequential decision-making. This challenge is particularly evident in the classic ‘multi-armed bandit’ problem, where an agent must choose between several options, each offering a different, often uncertain, reward. LLMs frequently fall into a trap of being too ‘greedy,’ focusing on immediate, known rewards rather than exploring new options that might yield better long-term gains.
Recent efforts to improve this have focused on two main training approaches: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). SFT involves training an LLM to mimic the behavior of an expert algorithm, like Upper Confidence Bound (UCB), by showing it many examples of optimal decisions. RL, on the other hand, allows the model to learn directly from rewards received from its environment. When trained this way, LLMs essentially become ‘meta-bandit agents,’ capable of applying learned exploration strategies to new and unfamiliar environments.
This research delves into both SFT and RL, exploring how these methods shape an LLM’s exploration strategies and how well these strategies generalize. The team trained LLMs using SFT on expert demonstrations and RL with a variety of specially designed reward signals. Beyond the standard bandit rewards, they introduced two innovative reward types for RL:
- Strategic Reward (RL-STR): This reward is based on ‘regret,’ which measures the difference between the optimal possible reward and the reward the agent actually received. By focusing on regret, this method helps stabilize the learning process, especially in environments with highly variable rewards.
- Algorithmic Reward (RL-ALG): This is a simpler, binary reward given if the LLM’s action matches the decision of an expert algorithm, such as UCB. This approach simplifies the ‘credit assignment problem’ – where it’s hard to tell which past actions led to a good or bad outcome – making learning more efficient.
The findings were quite promising. The trained LLM agents significantly outperformed pre-trained models and achieved performance comparable to established optimal strategies like UCB and Thompson Sampling. They also showed impressive generalization, performing well even when faced with tasks six times longer than their training horizon and across entirely different types of bandit problems (e.g., from Gaussian to Bernoulli reward distributions).
Specifically, the RL-ALG approach, which imitated the UCB oracle, consistently delivered the best results among the learned policies. The strategic reward (RL-STG) also proved beneficial, improving training efficiency in environments with high reward variance. While SFT policies were competitive, RL agents demonstrated more robust generalization across different bandit families. Interestingly, smaller LLMs (3B parameters) struggled with RL when relying solely on environmental rewards but showed significant improvement when guided by a teacher signal, like in RL-ALG or SFT.
However, the research uncovered a crucial insight: the performance gains often came at a cost. A detailed behavioral analysis revealed that these LLM agents, despite their sophistication, developed a more ‘greedy’ exploitation bias. They were more prone to ‘early catastrophic failure’ – prematurely abandoning exploration of potentially better options – compared to pre-trained models. For instance, agents trained to imitate UCB sometimes learned to outperform their teacher by adopting more exploitative variants of the algorithm, stopping exploration of an arm if it didn’t yield satisfactory short-term rewards.
The LLMs’ internal reasoning, when examined, often showed templated heuristics that prioritized the arm with the highest average reward. While some exploration was driven by UCB-like calculations, the learned UCB variants often depended only on the number of times a specific arm was pulled, rather than the total number of pulls. This subtle difference allowed for premature abandonment of arms. SFT policies, while initially mimicking the teacher more closely, were also susceptible to overfitting and even exhibited systematic arithmetic errors when encountering negative rewards, leading to fragile generalization.
Also Read:
- Knapsack RL: Optimizing LLM Exploration for Enhanced Learning
- Unlocking New Abilities: How Reinforcement Learning Helps Language Models Compose Skills
This study, titled When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training, highlights that while SFT and RL can dramatically improve LLM performance in sequential decision-making, they can also inadvertently foster short-sighted, exploitative behaviors. The emergent greediness is a consequence of exploration signals being easily overshadowed by frequent exploitation in training data. The findings emphasize the need for carefully designed reward functions and evaluation metrics that look beyond just average performance to truly promote robust and effective exploratory behavior in LLMs.


