spot_img
HomeResearch & DevelopmentFine-Tuning LLMs to Navigate Code Repositories with Precision

Fine-Tuning LLMs to Navigate Code Repositories with Precision

TLDR: Researchers fine-tuned a Qwen3-8B LLM to accurately retrieve relevant file paths from code repositories based on natural language queries. They developed six code-aware strategies to automatically generate diverse training data, covering single-file details to cross-module interactions. The model achieved high accuracy (up to 91% exact match) on small-to-medium Python projects and demonstrated scalability on a large codebase like PyTorch, proving LLMs can act as precise, repository-aware code navigators.

In the fast-evolving world of software development, navigating complex codebases to find specific files or functionalities can be a daunting task. Developers and AI coding assistants often struggle with questions like “How does this feature work?” or “Where is this bug located?” Traditional code search methods, relying on keywords, frequently miss the semantic context and relationships between different files. While large language models (LLMs) are excellent at understanding natural language, they typically lack specific knowledge about a particular code repository, sometimes leading to incorrect or “hallucinated” file suggestions.

A recent research paper, “Repository-Aware File Path Retrieval via Fine-Tuned LLMs”, proposes an innovative solution to this challenge. Authored by Vasudha Yanuganti, Ishaan Puri, Swapnil Chhatre, Mantinder Singh, Ashok Jallepalli, Hritvik Shrivastava, and Pradeep Kumar Sharma from Persistent Systems, the paper introduces a method to fine-tune a powerful LLM to directly predict relevant file paths given a natural language query about a codebase. Instead of generating code or explanations, the model’s primary role is to act as an intelligent pointer, guiding developers to the exact files they need to inspect.

Bridging the Gap: LLMs and Code Repositories

The core idea is to make LLMs “repository-aware.” This means training them to internalize the structure and content of a specific codebase. Unlike conventional fine-tuning that expands an LLM’s general knowledge, this approach binds the model to a repository snapshot, transforming its parameters into a compact, searchable index of file paths. This method offers several advantages, including single-forward-pass latency during inference and stable, deterministic path predictions, avoiding the need for external retrieval steps common in other systems.

Crafting the Training Data: Six Code-Aware Strategies

A significant challenge in this approach is generating a high-quality dataset of natural language questions paired with their relevant file paths. Manually labeling such data is impractical. To overcome this, the researchers developed an automated pipeline using Qwen, a strong LLM, to create synthetic question-answer pairs directly from the codebase. They introduced six novel “code-aware” strategies, each designed to capture different granularities of repository knowledge:

  • Per-File QA (S1): Generates questions and answers specific to a single file’s content.
  • Hierarchical Level 1 (Repo Summary, S2): Focuses on high-level repository structure, generating questions about broad functionality and module roles.
  • Hierarchical Level 2 (Mid-Level AST, S3): Summarizes mid-level code structures like class and function names, yielding questions about their contributions and interactions.
  • Hierarchical Level 3 (Fine AST details, S4): Extracts fine-grained details (functions, methods, docstrings) for very specific implementation-level queries.
  • High-Level Repo Structure + File Summary (S5): Combines high-level and file-specific summaries to generate questions requiring synthesis across multiple files, reflecting real developer inquiries about module interactions or end-to-end flows.
  • Git Ingest Batch Mode (S6): For very large repositories, this strategy processes code in manageable chunks to generate localized QA pairs, ensuring scalability while preventing context overflow.

By merging and balancing the QA pairs from these diverse strategies, the researchers created a comprehensive training set that covers a wide spectrum of question types, from specific function behaviors to cross-module interactions.

Efficient Fine-Tuning and Impressive Results

The Qwen3-8B model was chosen for its strong performance and manageable size. To ensure efficiency, the researchers employed QLoRA (4-bit Low-Rank Adaptation) and Unsloth optimizations, allowing fine-tuning on a modest setup of two GPUs with significant speedups. The model was trained to output relevant file paths in a structured JSON format, implicitly learning the repository’s file names and their associations.

Experiments on various open-source Python repositories demonstrated remarkable accuracy. For small-to-medium codebases like Flask, Click, and Jinja2, the fine-tuned model achieved exact match accuracies of up to 91% and recall rates of 93%. This means the model not only found at least one correct file but often the entire set of relevant files for a given query. A crucial insight was the importance of balancing the training data; excluding overly simplistic “per-file” questions (S1) significantly improved the model’s ability to handle complex, multi-file queries.

Even on a large-scale codebase like PyTorch (nearly 4,000 Python files), the model remained effective, achieving an exact match of 47.85% and a recall of 59.02%. While lower than for smaller projects, this performance is still promising given the immense complexity and size of PyTorch, highlighting the scalability of the approach.

Also Read:

Future Directions and Impact

This research marks a significant step towards creating more intelligent and context-aware coding assistants. By enabling LLMs to accurately retrieve relevant file paths, developers can save considerable time in code comprehension and debugging. While the current approach fine-tunes a model per repository, future work could explore training a single model across multiple projects or integrating with larger context window LLMs. Addressing challenges like handling code updates and supporting multi-language repositories are also key areas for continued development.

Ultimately, this work combines the strengths of LLMs (natural language understanding) with structured code analysis, paving the way for developer tools that can truly understand a codebase and guide engineers to the information they need, transforming how we interact with complex software.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -