Enhancing Code Completion with Smart Context Retrieval and Code Chunking

TLDR: This research paper introduces a novel context collection strategy for improving repository-level code completion using large language models (LLMs). The method involves preprocessing code into smaller chunks, then employing a hybrid retrieval system that combines syntactic (BM25) and semantic (FAISS) similarity. A key innovation is ‘relative positioning,’ where the system retrieves not only similar code chunks for a given prefix and suffix but also their subsequent (‘next’) and preceding (‘previous’) chunks, respectively. This comprehensive context, which also includes the completion file and recently opened files, significantly boosts LLM performance in filling in code, earning the solution top placements in a code completion competition.

Code completion has become an indispensable feature in modern integrated development environments (IDEs), significantly boosting developer efficiency. However, despite its widespread availability, there’s an ongoing challenge in determining what constitutes truly effective context for large language models (LLMs) to perform optimally in code completion tasks.

A recent research paper, Relative Positioning Based Code Chunking Method For Rich Context Retrieval In Repository Level Code Completion Task With Code Language Model, by Imranur Rahman and Md Rayhanur Rahman, addresses this critical issue. The authors propose an innovative context collection strategy designed to enhance LLM performance in repository-level code completion.

The core of their strategy involves two key ideas: preprocessing the entire code repository into smaller, manageable ‘code chunks’ and then using a retrieval mechanism that considers both syntactic and semantic similarity, along with a novel concept of ‘relative positioning’ of these chunks.

The Problem and Approach

The research was motivated by a competition organized by JetBrains Research, co-located with the Automated Software Engineering (ASE 2025) conference. Participants were tasked with creating an effective context collection strategy to supplement completion points with useful information from across a whole repository. The setup provided the full source code, recently used files, the completion file, and a prefix and suffix of a code snippet, with the goal for an LLM to ‘fill in the middle’.

The authors’ solution, which secured third place in the Kotlin track and fourth in the Python track of the competition, focuses on intelligently selecting and arranging relevant code snippets.

How the System Works

The process begins with **Preprocessing**. Each source file in the repository is split into ‘line chunks’ with a small overlap between consecutive chunks. Importantly, the system keeps track of ‘previous’ and ‘next’ pointers for each chunk within a file. These chunks are stored in a database for later retrieval. Simultaneously, vector embeddings are computed for each chunk using an embedding model (specifically, `sentence-transformers/all-MiniLM-L6-v2` for its efficiency), and these embeddings are stored in a vector database.

Next, a sophisticated **Retriever** system is employed. The paper utilizes an Ensemble Retriever from LangChain, which combines results from multiple individual retrievers. This hybrid approach leverages both syntactic and semantic similarity. For syntactic similarity, a BM25 retriever is used, which is effective at finding documents based on keywords. For semantic similarity, FAISS (Facebook AI Similarity Search) is used, designed for efficient similarity search of dense vectors. The ensemble retriever reranks the results, with a higher emphasis (4 times more) placed on semantic similarity, reflecting the intuition that understanding the meaning of code is often more crucial than just keyword matching.

The **Context Collection** phase is where the ‘relative positioning’ comes into play. When an LLM needs to fill in code between a prefix and a suffix:

The prefix is fed to the ensemble retriever to find top-k similar code chunks. Then, the system retrieves the *next* chunks for each of these similar prefix chunks. The idea is that these ‘next’ chunks represent what a user might have already written *after* the retrieved similar code.
Similarly, the suffix is fed to the retriever to find top-k similar code chunks. For these, the system retrieves the *previous* chunks. These ‘previous’ chunks represent what a user might have written *before* the retrieved similar code.

The **Final Context** provided to the LLM is a combination of four elements: the completion file itself (considered highly relevant), recently opened files by the user, the top-k similar chunks to the prefix along with their ‘next’ chunks, and the top-k similar chunks to the suffix along with their ‘previous’ chunks.

Evolution of the Solution

Before arriving at this final solution, the researchers explored several other approaches. They initially tried statically extracting variable and function names and passing them in a structured format, but this didn’t yield performance benefits. They also experimented with ‘prompt injection’ to ask the LLM if it needed additional context, inspired by the RepoFormer concept of selective retrieval, but this also didn’t improve accuracy. A further attempt involved using a locally hosted LLM to make retrieval decisions based on confidence scores, but this proved computationally expensive and time-consuming.

These explorations led them to the simpler, yet highly effective, strategy of chunking code and incorporating ‘next’ and ‘previous’ chunks based on relative positioning, building upon existing syntactic retrieval methods.

Also Read:

Future Directions

The authors acknowledge that context collection remains a challenging problem. They suggest future work could explore dynamic chunking strategies based on abstract syntax trees (AST) for more coherent code representation. Additionally, incorporating other repository-level information like inter-file and cross-file dependencies or call graphs could further enhance performance. An ablation study to understand the individual contribution of each context component, and an analysis of factors like chunk size and top-k values, are also proposed. Crucially, memory footprint and time taken for context aggregation are highlighted as important metrics for evaluating any context collection strategy in real-world scenarios.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Enhancing Code Completion with Smart Context Retrieval and Code Chunking

The Problem and Approach

How the System Works

Evolution of the Solution

Future Directions

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates