TLDR: Reinforcement learning (RL) in large language models (LLMs) improves their ability to navigate and search existing hierarchical knowledge, rather than degrading memorized knowledge or acquiring new facts. A new study demonstrates that RL-enhanced models outperform base models on structured knowledge recall tasks. This is supported by experiments showing structured prompting can mimic some RL gains, and internal analysis revealing RL primarily transforms how models process queries, not the factual knowledge itself.
A recent study challenges the conventional wisdom that reinforcement learning (RL) in large language models (LLMs) comes at the cost of degrading memorized knowledge. Instead, the research suggests that RL-enhanced models consistently outperform their base and supervised fine-tuned (SFT) counterparts on tasks requiring pure knowledge recall, especially when that knowledge is hierarchical and structured, like medical codes.
The paper, titled “Reinforcement Learning Improves Traversal of Hierarchical Knowledge in LLMs” by Renfei Zhang, Manasa Kaniselvan, and Niloofar Mireshghallah, proposes a compelling hypothesis: these performance gains don’t come from acquiring new data or facts. Rather, they stem from improved procedural skills in navigating and searching through the existing knowledge hierarchies already embedded within the model’s parameters.
Unpacking the Hypothesis: Navigation Over New Knowledge
To test this idea, the researchers conducted three key experiments. The first focused on whether explicit prompting could bridge the performance gap between SFT and RL models. They found that structured prompting, which explicitly guides SFT models through hierarchical traversal, significantly reduced the performance difference. For instance, on the MedConceptsQA dataset, structured prompting cut the gap from 24 percentage points to just 7 percentage points for DeepSeek-V3/R1. This suggests that the necessary knowledge is present in SFT models but requires better navigation strategies to be accessed effectively.
The second experiment delved into how reasoning models handle deeper hierarchies. Using an expanded International Patent Classification (IPC) dataset, they introduced a “Path Matching Score” to measure the accuracy of hierarchical traversal. They observed that as the complexity of retrieval increased (requiring more steps in the hierarchy), RL-enhanced models demonstrated superior path recall accuracy. This performance gap widened from 5 percentage points on simpler tasks to 9 percentage points on more complex ones, indicating that reasoning models truly excel at navigating intricate hierarchical structures.
Internal Insights: Query Processing vs. Factual Knowledge
The third and perhaps most insightful experiment involved a layer-wise internal activation analysis. By comparing how SFT and RL models process both factual statements (e.g., “code 57.95 refers to urinary infection”) and interrogative queries (e.g., “what is code 57.95”), the researchers uncovered a crucial distinction. They found that factual representations maintained high cosine similarity between SFT and RL models, meaning the underlying knowledge itself remained largely unchanged. However, query representations diverged noticeably. This suggests that RL primarily transforms how models process questions and traverse knowledge, rather than altering the knowledge representation itself.
Also Read:
- Unpacking the Role of Exploration in AI Reasoning: Why Rare Thoughts Matter
- Enhancing LLM Reasoning: A New Method to Overcome Repetitive Reflections
Implications for LLM Development
These findings carry significant implications for how we understand and train LLMs. They challenge the notion of an “alignment tax,” where RLHF (Reinforcement Learning from Human Feedback) is thought to degrade factual memorization. Instead, this research suggests that RL enhances a model’s “cognitive scaffolding” – its ability to systematically navigate structures already encoded during pretraining. This aligns with recent work indicating that RL helps surface existing intelligence within models.
The study encourages future research to explore these phenomena across broader domains and to develop RL mechanisms that explicitly optimize for hierarchical navigation. For a deeper dive into the methodology and results, you can read the full research paper here.


