Fine-Tuning LLMs to Navigate Code Repositories with Precision

TLDR: Researchers fine-tuned a Qwen3-8B LLM to accurately retrieve relevant file paths from code repositories based on natural language queries. They developed six code-aware strategies to automatically generate diverse training data, covering single-file details to cross-module interactions. The model achieved high accuracy (up to 91% exact match) on small-to-medium Python projects and demonstrated scalability on a large codebase like PyTorch, proving LLMs can act as precise, repository-aware code navigators.

In the fast-evolving world of software development, navigating complex codebases to find specific files or functionalities can be a daunting task. Developers and AI coding assistants often struggle with questions like “How does this feature work?” or “Where is this bug located?” Traditional code search methods, relying on keywords, frequently miss the semantic context and relationships between different files. While large language models (LLMs) are excellent at understanding natural language, they typically lack specific knowledge about a particular code repository, sometimes leading to incorrect or “hallucinated” file suggestions.

A recent research paper, “Repository-Aware File Path Retrieval via Fine-Tuned LLMs”, proposes an innovative solution to this challenge. Authored by Vasudha Yanuganti, Ishaan Puri, Swapnil Chhatre, Mantinder Singh, Ashok Jallepalli, Hritvik Shrivastava, and Pradeep Kumar Sharma from Persistent Systems, the paper introduces a method to fine-tune a powerful LLM to directly predict relevant file paths given a natural language query about a codebase. Instead of generating code or explanations, the model’s primary role is to act as an intelligent pointer, guiding developers to the exact files they need to inspect.

Bridging the Gap: LLMs and Code Repositories

The core idea is to make LLMs “repository-aware.” This means training them to internalize the structure and content of a specific codebase. Unlike conventional fine-tuning that expands an LLM’s general knowledge, this approach binds the model to a repository snapshot, transforming its parameters into a compact, searchable index of file paths. This method offers several advantages, including single-forward-pass latency during inference and stable, deterministic path predictions, avoiding the need for external retrieval steps common in other systems.

Crafting the Training Data: Six Code-Aware Strategies

A significant challenge in this approach is generating a high-quality dataset of natural language questions paired with their relevant file paths. Manually labeling such data is impractical. To overcome this, the researchers developed an automated pipeline using Qwen, a strong LLM, to create synthetic question-answer pairs directly from the codebase. They introduced six novel “code-aware” strategies, each designed to capture different granularities of repository knowledge:

Per-File QA (S1): Generates questions and answers specific to a single file’s content.
Hierarchical Level 1 (Repo Summary, S2): Focuses on high-level repository structure, generating questions about broad functionality and module roles.
Hierarchical Level 2 (Mid-Level AST, S3): Summarizes mid-level code structures like class and function names, yielding questions about their contributions and interactions.
Hierarchical Level 3 (Fine AST details, S4): Extracts fine-grained details (functions, methods, docstrings) for very specific implementation-level queries.
High-Level Repo Structure + File Summary (S5): Combines high-level and file-specific summaries to generate questions requiring synthesis across multiple files, reflecting real developer inquiries about module interactions or end-to-end flows.
Git Ingest Batch Mode (S6): For very large repositories, this strategy processes code in manageable chunks to generate localized QA pairs, ensuring scalability while preventing context overflow.

By merging and balancing the QA pairs from these diverse strategies, the researchers created a comprehensive training set that covers a wide spectrum of question types, from specific function behaviors to cross-module interactions.

Efficient Fine-Tuning and Impressive Results

The Qwen3-8B model was chosen for its strong performance and manageable size. To ensure efficiency, the researchers employed QLoRA (4-bit Low-Rank Adaptation) and Unsloth optimizations, allowing fine-tuning on a modest setup of two GPUs with significant speedups. The model was trained to output relevant file paths in a structured JSON format, implicitly learning the repository’s file names and their associations.

Experiments on various open-source Python repositories demonstrated remarkable accuracy. For small-to-medium codebases like Flask, Click, and Jinja2, the fine-tuned model achieved exact match accuracies of up to 91% and recall rates of 93%. This means the model not only found at least one correct file but often the entire set of relevant files for a given query. A crucial insight was the importance of balancing the training data; excluding overly simplistic “per-file” questions (S1) significantly improved the model’s ability to handle complex, multi-file queries.

Even on a large-scale codebase like PyTorch (nearly 4,000 Python files), the model remained effective, achieving an exact match of 47.85% and a recall of 59.02%. While lower than for smaller projects, this performance is still promising given the immense complexity and size of PyTorch, highlighting the scalability of the approach.

Also Read:

Future Directions and Impact

This research marks a significant step towards creating more intelligent and context-aware coding assistants. By enabling LLMs to accurately retrieve relevant file paths, developers can save considerable time in code comprehension and debugging. While the current approach fine-tunes a model per repository, future work could explore training a single model across multiple projects or integrating with larger context window LLMs. Addressing challenges like handling code updates and supporting multi-language repositories are also key areas for continued development.

Ultimately, this work combines the strengths of LLMs (natural language understanding) with structured code analysis, paving the way for developer tools that can truly understand a codebase and guide engineers to the information they need, transforming how we interact with complex software.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Fine-Tuning LLMs to Navigate Code Repositories with Precision

Bridging the Gap: LLMs and Code Repositories

Crafting the Training Data: Six Code-Aware Strategies

Efficient Fine-Tuning and Impressive Results

Future Directions and Impact

Gen AI News and Updates

Oracle Unveils ‘Ask Oracle’ Chatbot for Personalized Redwood Experience, Powered by Advanced Select AI

Ironclad Unveils Advanced AI Agents to Transform Contracts into Dynamic Assets

Dremio Launches ‘The Agentic Lakehouse’ for AI-Driven Data Management

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates