Unlocking Large Codebases: A Vector Graph System for Smarter File Retrieval

TLDR: This research introduces a system that transforms large software repositories into a vectorized knowledge graph. This graph captures the architectural and semantic structure of a project, using LLM-derived summaries and vector embeddings. A hybrid retrieval pipeline combines semantic search with graph traversal and LLM-based discovery to efficiently find relevant files for natural language queries, such as bug descriptions or feature requests, significantly aiding in the automatization of software development tasks.

Large software repositories, often containing thousands of files and millions of lines of code, pose a significant challenge for developers and modern AI models alike. The sheer scale of these codebases far exceeds the context window limitations of even the most advanced Large Language Models (LLMs), making it difficult to apply AI for tasks like automatic bug fixes or feature additions. A new research paper introduces an innovative solution: a system that converts these vast repositories into a vectorized knowledge graph, designed to mirror the project’s architectural and semantic structure.

This system, detailed in the paper “Vector Graph-Based Repository Understanding for Issue-Driven File Retrieval”, aims to bridge the gap between complex codebases and the power of LLMs. The core idea is to create a comprehensive knowledge graph that encodes syntactic relationships such as containment, implementation, references, calls, and inheritance. Crucially, this graph is enriched with LLM-derived summaries and vector embeddings for its nodes, capturing both the structural and semantic essence of the code.

The Dynamic Knowledge Graph

At the heart of this approach is the concept of a Dynamic Knowledge Graph (DKG). Unlike a static graph, a DKG evolves over time with every new commit to a repository, ensuring it remains a current and accurate representation of the codebase. The graph initialization process involves parsing a repository at a specific commit ID, building a skeleton of folder and file nodes, and then extracting programming language elements like classes, functions, and their interdependencies. Docstrings, comments, and raw source code are all incorporated, and LLMs are used to generate concise, intent-focused natural-language summaries and vector embeddings for these entities.

The graph’s schema defines various node types, including Folders, Files, Classes, Functions, MemberFunctions, and a Root node for the repository itself. Edge types represent the relationships between these nodes, such as ‘Contains’ (connecting files, folders, functions, classes), ‘Inherits’ (between classes), ‘Tests’ (connecting files, functions), ‘Implements’ (connecting files, classes, functions), and ‘Calls’ (between functions). This rich structure allows the system to uncover even the most hidden dependencies within a repository.

Intelligent Search and Retrieval

The primary application of this DKG is an advanced retrieval pipeline that helps developers find the most relevant source code files for a given natural language query, such as a bug description or a new feature request. This “Search Relevant” algorithm combines several powerful techniques:

Query Preprocessing: An LLM is used to normalize and enrich the user’s input query, which significantly boosts retrieval metrics.
Semantic Search: The preprocessed query is converted into a vector embedding, which is then used to perform a semantic search across all nodes in the knowledge graph. This involves calculating the cosine similarity between the query embedding and the embeddings of each node, resulting in a ranked list of potentially relevant code entities.
Graph Traversal: To further refine and expand the search results, the system traverses the knowledge graph. This stage enriches the initial list of files by following specific relation types (e.g., ‘Calls’, ‘Implements’) and node types, allowing it to discover closely related and connected files that might not have been found through semantic similarity alone.
LLM Discovery: Recognizing that user queries sometimes explicitly mention file names or paths, an LLM is employed to extract these directly from the input. These explicitly mentioned files are then combined with the results from the semantic search and graph traversal stages.

The researchers evaluated their algorithm using GitHub issues and corresponding pull requests as ground truth across various repositories, including React, Poetry, Pytest, Junit-framework, and Eslint. Their findings indicate that the combined approach, particularly with query preprocessing and LLM discovery, significantly improves the recall of relevant files.

Beyond Retrieval: Clustering and Applications

The system also incorporates repository clustering to group similar files. The idea is that files that frequently change together or share semantic similarities are often related. Three complementary clustering methods—semantic (embedding-based), Louvain (graph-based community detection), and label propagation—are used to provide multi-view representations of the codebase. This helps in understanding the codebase better and can further enhance retrieval by prioritizing files from the same cluster as the query context.

The DKG is integrated into an agent-based coding assistant, demonstrating its practical utility in day-to-day engineering workflows. This assistant helps with tasks like impact analysis (identifying the scope of a change), bug localization (narrowing down suspect files), and architecture conformance. By translating natural language intents into deterministic, read-only graph queries, the system provides safe and interpretable answers, complete with provenance links to verifiable sources.

Also Read:

Challenges and Future Directions

While promising, the system faces challenges such as maintaining graph freshness at scale, handling the heterogeneity of programming languages, and dealing with incomplete or ambiguous user-provided issue descriptions. Future work aims to address these limitations by incorporating dynamic signals from program execution, constructing temporal graphs to model co-change patterns, training supervised re-rankers on historical data, and enriching the graph with diverse signals from pull-request metadata and CI/CD scripts.

In conclusion, this research presents a robust pipeline for converting complex software repositories into vectorized knowledge graphs, significantly enhancing the ability to retrieve relevant files using natural language queries. By combining semantic embeddings, LLM-based summarization, and intelligent graph traversal, the system offers a powerful tool for automating and simplifying software development tasks, ultimately reducing manual effort and increasing efficiency.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unlocking Large Codebases: A Vector Graph System for Smarter File Retrieval

The Dynamic Knowledge Graph

Intelligent Search and Retrieval

Beyond Retrieval: Clustering and Applications

Challenges and Future Directions

Gen AI News and Updates

Enhancing AI Agents with Self-Reflection: Learning from Experience to Refine Software Engineering Tasks

Google Labs’ Jules: Autonomous AI Coding Agent Redefines Software Development

HAFixAgent: Leveraging Repository History for Smarter Software Bug Repair

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates