TLDR: A new RAG method for hierarchical data (like code repositories) generates implicit, aggregated summaries from the bottom-up instead of using raw data. This approach achieves comparable response quality while significantly reducing the number of documents (over 68% fewer) in the vector database, making RAG more efficient and scalable for complex structures.
Large Language Models (LLMs) have transformed many applications, from chatbots to advanced analytics, primarily due to their ability to learn from information provided in their context. A powerful technique that enhances LLMs is Retrieval-Augmented Generation (RAG). RAG works by first retrieving relevant documents from a knowledge base in response to a user’s query, and then the LLM uses this retrieved information to generate an answer. While RAG is effective with unstructured data like plain text, its application to complex, structured data, especially hierarchical structures like file trees, presents unique challenges.
The core issue lies in how to best represent this retrieved knowledge for generating accurate and high-quality responses on structured data. Simply taking raw, unstructured data from a hierarchical system, such as a GitHub repository, and feeding it into a RAG system can be inefficient and may lead to a loss of crucial contextual information. Raw code, for instance, can be very lengthy and token-heavy, making it difficult to capture its full essence efficiently.
Researchers at General Motors have introduced a novel approach to tackle this problem. Their work, detailed in the paper “Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-Based Structures”, proposes a bottom-up method to linearize knowledge from tree-like structures. This method involves generating implicit, aggregated summaries at each hierarchical level. Imagine a file system: instead of just storing every file’s raw content, this approach creates summaries for individual files (leaf nodes) and then uses those summaries to create higher-level summaries for their parent folders, and so on, all the way up to the root. This distilled, implicit knowledge is then stored in a knowledge base, ready for use with RAG.
This innovative method aims to create optimized, token-limited documents for the tree’s components by distilling their core content and context, rather than relying on raw data. The process ensures that a parent node (like a folder) gains a holistic understanding of its entire sub-tree by aggregating the implicit knowledge from its children.
The study compared this new implicit knowledge generation method against a baseline RAG approach that directly indexed and stored the raw content of code files and their associated metadata. For their experiments, they used a proprietary, unstructured code repository from General Motors, which contained various MATLAB Simulink scripts organized hierarchically but lacking a cohesive logical structure, making it an ideal real-world test case.
The results were compelling. While the quality of generated responses was comparable between both methods, the proposed implicit knowledge approach demonstrated significant gains in efficiency. It required over 68% fewer documents in the retriever, meaning almost four times less data in the vector database. This suggests that leveraging implicit, linearized knowledge is a highly effective and scalable strategy for handling complex, hierarchical data structures, especially for “folder-level” queries that demand a broader contextual understanding.
The findings indicate that LLMs do not necessarily require raw, verbose data to generate high-quality responses. Instead, providing them with high-quality implicit knowledge, distilled from the raw data, can lead to equal or even superior performance while using significantly less data. This directly addresses a major challenge in RAG pipelines where an increasing number of documents can lead to performance degradation.
Also Read:
- Unlocking Large Codebases: A Vector Graph System for Smarter File Retrieval
- Bridging JIRA and GitHub: AI-Powered System Accelerates Software Issue Resolution
This research opens up exciting avenues for future exploration, including generalizing this implicit knowledge generation method to other non-hierarchical structures like knowledge graphs, and further optimizing the representation of implicit knowledge. Ultimately, this work suggests that structure-aware RAG methods can pave the way for more accurate, efficient, and scalable retrieval systems.


