spot_img
HomeResearch & DevelopmentCODEGENLINK: Uncovering the Origins and Licenses of AI-Generated Code

CODEGENLINK: Uncovering the Origins and Licenses of AI-Generated Code

TLDR: CODEGENLINK is a GitHub Copilot extension for Visual Studio Code that helps developers determine the likely origin and license of automatically generated code. It combines LLMs with web search to find similar code snippets online, then performs similarity analysis and identifies associated licenses. The tool offers two modes for retrieving links—either from LLM-generated code in chat or selected code in the IDE—and aims to enhance trustworthiness and prevent copyright or licensing violations for AI-assisted software development.

Large Language Models (LLMs) have become indispensable in various software development tasks, from generating code to assisting with testing and documentation. While the reuse of code from the web often allows developers to infer its origin and trustworthiness, the same cannot always be said for code generated by LLMs. This lack of provenance information raises significant concerns about trustworthiness, potential copyright infringements, and licensing violations.

Addressing these critical issues, researchers from the University of Sannio, Italy, have introduced CODEGENLINK. This innovative tool is a GitHub Copilot extension for Visual Studio Code designed to help developers understand the origins and licensing of automatically generated code. CODEGENLINK aims to suggest links to code that is highly similar to the LLM-generated snippets and, whenever possible, indicate the license of that likely original source.

How CODEGENLINK Works

CODEGENLINK operates by combining the power of LLMs with their web search capabilities. When a developer uses the tool, it first retrieves a set of candidate links from the web. Following this, it performs a detailed similarity analysis between the LLM-generated code and the code found in the retrieved links. This two-step process helps filter out irrelevant links, ensuring that only those with a strong resemblance to the generated code are presented to the user.

The tool offers two primary operational modes:

  • Direct Code + Link Retrieval (Mode 1): In this mode, a developer uses the Copilot chat to ask the LLM to generate a code snippet. CODEGENLINK automatically takes this output and prompts the LLM to search the web for related links.
  • Link Retrieval (Mode 2): Here, the developer selects an existing code fragment directly within the IDE. CODEGENLINK then retrieves relevant links for this selected code. This mode is versatile and can be used to find the provenance of any code snippet, regardless of whether it was automatically generated.

CODEGENLINK employs both text similarity metrics and clone detection techniques to identify potential code origins. Once relevant links are identified, the tool inspects each one to extract associated license information. For GitHub repositories, it queries the GitHub API or scans for LICENSE files, using Google’s License Classifier for analysis. For sites like Stack Overflow, it applies known site-specific licensing policies (e.g., CC BY-SA 4.0). For other web pages, it scans the HTML content for common license keywords.

The results are then presented to the developer in a clear, aggregated view within the IDE, showing the URL of the likely origin and any identified license information. This helps developers make informed decisions about reusing and redistributing code.

Also Read:

Preliminary Evaluation and Impact

A preliminary evaluation of CODEGENLINK using coding tasks from the CODESEARCHNET and CODEREVAL datasets has shown promising results. The tool effectively filters out unrelated links through its similarity analysis, providing users with links that are highly likely to be the origin of the generated code. While LLMs don’t always provide relevant links initially, CODEGENLINK’s filtering mechanism significantly improves the quality of the suggestions.

Regarding license identification, the tool generally provides accurate suggestions. However, it acknowledges that for many links, particularly those pointing to blogs or tutorials, automatically inferring a clear license can be challenging. In such cases, the tool provides a suggestion, and developers may need to conduct further investigation.

CODEGENLINK represents a significant step forward in addressing the challenges of trustworthiness and compliance in the era of AI-generated code. By providing developers with insights into code provenance and licensing, it empowers them to reuse and redistribute code responsibly within their projects. For more details, you can read the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -