spot_img
HomeResearch & DevelopmentEnhancing Code Completion with Smart Context Retrieval and Code...

Enhancing Code Completion with Smart Context Retrieval and Code Chunking

TLDR: This research paper introduces a novel context collection strategy for improving repository-level code completion using large language models (LLMs). The method involves preprocessing code into smaller chunks, then employing a hybrid retrieval system that combines syntactic (BM25) and semantic (FAISS) similarity. A key innovation is ‘relative positioning,’ where the system retrieves not only similar code chunks for a given prefix and suffix but also their subsequent (‘next’) and preceding (‘previous’) chunks, respectively. This comprehensive context, which also includes the completion file and recently opened files, significantly boosts LLM performance in filling in code, earning the solution top placements in a code completion competition.

Code completion has become an indispensable feature in modern integrated development environments (IDEs), significantly boosting developer efficiency. However, despite its widespread availability, there’s an ongoing challenge in determining what constitutes truly effective context for large language models (LLMs) to perform optimally in code completion tasks.

A recent research paper, Relative Positioning Based Code Chunking Method For Rich Context Retrieval In Repository Level Code Completion Task With Code Language Model, by Imranur Rahman and Md Rayhanur Rahman, addresses this critical issue. The authors propose an innovative context collection strategy designed to enhance LLM performance in repository-level code completion.

The core of their strategy involves two key ideas: preprocessing the entire code repository into smaller, manageable ‘code chunks’ and then using a retrieval mechanism that considers both syntactic and semantic similarity, along with a novel concept of ‘relative positioning’ of these chunks.

The Problem and Approach

The research was motivated by a competition organized by JetBrains Research, co-located with the Automated Software Engineering (ASE 2025) conference. Participants were tasked with creating an effective context collection strategy to supplement completion points with useful information from across a whole repository. The setup provided the full source code, recently used files, the completion file, and a prefix and suffix of a code snippet, with the goal for an LLM to ‘fill in the middle’.

The authors’ solution, which secured third place in the Kotlin track and fourth in the Python track of the competition, focuses on intelligently selecting and arranging relevant code snippets.

How the System Works

The process begins with **Preprocessing**. Each source file in the repository is split into ‘line chunks’ with a small overlap between consecutive chunks. Importantly, the system keeps track of ‘previous’ and ‘next’ pointers for each chunk within a file. These chunks are stored in a database for later retrieval. Simultaneously, vector embeddings are computed for each chunk using an embedding model (specifically, `sentence-transformers/all-MiniLM-L6-v2` for its efficiency), and these embeddings are stored in a vector database.

Next, a sophisticated **Retriever** system is employed. The paper utilizes an Ensemble Retriever from LangChain, which combines results from multiple individual retrievers. This hybrid approach leverages both syntactic and semantic similarity. For syntactic similarity, a BM25 retriever is used, which is effective at finding documents based on keywords. For semantic similarity, FAISS (Facebook AI Similarity Search) is used, designed for efficient similarity search of dense vectors. The ensemble retriever reranks the results, with a higher emphasis (4 times more) placed on semantic similarity, reflecting the intuition that understanding the meaning of code is often more crucial than just keyword matching.

The **Context Collection** phase is where the ‘relative positioning’ comes into play. When an LLM needs to fill in code between a prefix and a suffix:

  • The prefix is fed to the ensemble retriever to find top-k similar code chunks. Then, the system retrieves the *next* chunks for each of these similar prefix chunks. The idea is that these ‘next’ chunks represent what a user might have already written *after* the retrieved similar code.
  • Similarly, the suffix is fed to the retriever to find top-k similar code chunks. For these, the system retrieves the *previous* chunks. These ‘previous’ chunks represent what a user might have written *before* the retrieved similar code.

The **Final Context** provided to the LLM is a combination of four elements: the completion file itself (considered highly relevant), recently opened files by the user, the top-k similar chunks to the prefix along with their ‘next’ chunks, and the top-k similar chunks to the suffix along with their ‘previous’ chunks.

Evolution of the Solution

Before arriving at this final solution, the researchers explored several other approaches. They initially tried statically extracting variable and function names and passing them in a structured format, but this didn’t yield performance benefits. They also experimented with ‘prompt injection’ to ask the LLM if it needed additional context, inspired by the RepoFormer concept of selective retrieval, but this also didn’t improve accuracy. A further attempt involved using a locally hosted LLM to make retrieval decisions based on confidence scores, but this proved computationally expensive and time-consuming.

These explorations led them to the simpler, yet highly effective, strategy of chunking code and incorporating ‘next’ and ‘previous’ chunks based on relative positioning, building upon existing syntactic retrieval methods.

Also Read:

Future Directions

The authors acknowledge that context collection remains a challenging problem. They suggest future work could explore dynamic chunking strategies based on abstract syntax trees (AST) for more coherent code representation. Additionally, incorporating other repository-level information like inter-file and cross-file dependencies or call graphs could further enhance performance. An ablation study to understand the individual contribution of each context component, and an analysis of factors like chunk size and top-k values, are also proposed. Crucially, memory footprint and time taken for context aggregation are highlighted as important metrics for evaluating any context collection strategy in real-world scenarios.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -