spot_img
HomeResearch & DevelopmentAutomating GitHub README Classification with Large Language Models

Automating GitHub README Classification with Large Language Models

TLDR: This research introduces an innovative approach to automatically classify sections of GitHub README files using fine-tuned Large Language Models (LLMs) such as BERT, DistilBERT, and RoBERTa. By leveraging a dataset of 4226 README sections, the study achieved an F1 score of 0.98, significantly outperforming existing state-of-the-art methods. The paper also demonstrates that Parameter-Efficient Fine-Tuning (PEFT) with LoRA offers a cost-effective alternative to full fine-tuning, maintaining strong performance while drastically reducing computational resources. This work highlights the potential of LLMs to improve the organization and utility of GitHub repositories.

GitHub stands as the world’s leading platform for storing, sharing, and managing code, boasting over 100 million developers and more than 420 million repositories. A crucial component of every GitHub repository is its README file, which ideally should provide comprehensive project-related information to facilitate usage and improvement. However, repository owners often overlook these recommendations, preventing their projects from reaching their full potential for widespread engagement and impact within the research community.

A survey conducted by GitHub revealed that approximately 93% of respondents expressed dissatisfaction with the quality of documentation and README files. Addressing this, previous research by Prana et al. created a dataset of 4226 GitHub README file sections and developed a classifier using heuristic features, achieving an F1 score of 0.746. While this study proved the practical effectiveness of labeling README sections for improving readability, its performance was limited by the high cost and time required to produce human-labeled training data.

The advent of Transformer architecture in 2017 revolutionized Natural Language Processing (NLP), enabling unprecedented capabilities in text classification, generation, summarization, and translation through Large Language Models (LLMs). These pre-trained models, trained on vast amounts of unlabeled data, can be further fine-tuned for specific tasks with smaller labeled datasets.

This new research, titled “LLM-based Content Classification Approach for GitHub Repositories by the README Files”, explores the potential of encoder-only LLMs to automate the classification of GitHub README file sections. The study utilized three prominent encoder-only LLMs: BERT, DistilBERT, and RoBERTa. These models were fine-tuned using the aforementioned gold-standard dataset of 4226 README file sections.

The methodology involved several key steps. Data preprocessing included content abstraction (replacing elements like code blocks and tables with placeholders), tokenization, stop words removal, and lemmatization. To address the imbalanced nature of the dataset, oversampling was applied. The data was then split into 70% for training and 30% for testing, with stratified sampling to maintain class distribution. Model construction involved two approaches: full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA).

The results of this study are highly promising. The fine-tuned LLMs demonstrated exceptional performance in categorizing GitHub README file sections, achieving an impressive overall F1 score of 0.98. This significantly surpasses the F1 score of 0.746 achieved by previous state-of-the-art methods that relied on traditional machine learning algorithms like SVM, Random Forest, and Naive Bayes.

Furthermore, the research investigated the effectiveness of PEFT techniques, specifically LoRA, as an economical alternative to full fine-tuning. While full fine-tuning generally yielded slightly higher accuracy, PEFT produced remarkably comparable results with a significantly reduced computational burden. For instance, LoRA-based DistilBERT achieved an F1 score of 0.908, demonstrating that substantial resource savings can be achieved without a drastic compromise in performance. This is crucial as full fine-tuning all parameters of large models is a resource-intensive task.

This study’s contributions are significant. It has not only achieved state-of-the-art results in classifying GitHub README file sections but also provided valuable insights into the effectiveness of PEFT methods compared to full fine-tuning. The findings underscore the immense potential of LLMs in enhancing the software engineering domain, particularly in improving the identification and potential usage of GitHub repositories.

Also Read:

Future directions for this research include increasing the size of the gold-standard dataset, exploring multilingual GitHub README files, and extending the proposed methodology to other code-sharing and version control platforms. Ultimately, these successful models could be integrated into automated tools to help practitioners categorize README content more efficiently. You can find the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -