Automating GitHub README Classification with Large Language Models

TLDR: This research introduces an innovative approach to automatically classify sections of GitHub README files using fine-tuned Large Language Models (LLMs) such as BERT, DistilBERT, and RoBERTa. By leveraging a dataset of 4226 README sections, the study achieved an F1 score of 0.98, significantly outperforming existing state-of-the-art methods. The paper also demonstrates that Parameter-Efficient Fine-Tuning (PEFT) with LoRA offers a cost-effective alternative to full fine-tuning, maintaining strong performance while drastically reducing computational resources. This work highlights the potential of LLMs to improve the organization and utility of GitHub repositories.

GitHub stands as the world’s leading platform for storing, sharing, and managing code, boasting over 100 million developers and more than 420 million repositories. A crucial component of every GitHub repository is its README file, which ideally should provide comprehensive project-related information to facilitate usage and improvement. However, repository owners often overlook these recommendations, preventing their projects from reaching their full potential for widespread engagement and impact within the research community.

A survey conducted by GitHub revealed that approximately 93% of respondents expressed dissatisfaction with the quality of documentation and README files. Addressing this, previous research by Prana et al. created a dataset of 4226 GitHub README file sections and developed a classifier using heuristic features, achieving an F1 score of 0.746. While this study proved the practical effectiveness of labeling README sections for improving readability, its performance was limited by the high cost and time required to produce human-labeled training data.

The advent of Transformer architecture in 2017 revolutionized Natural Language Processing (NLP), enabling unprecedented capabilities in text classification, generation, summarization, and translation through Large Language Models (LLMs). These pre-trained models, trained on vast amounts of unlabeled data, can be further fine-tuned for specific tasks with smaller labeled datasets.

This new research, titled “LLM-based Content Classification Approach for GitHub Repositories by the README Files”, explores the potential of encoder-only LLMs to automate the classification of GitHub README file sections. The study utilized three prominent encoder-only LLMs: BERT, DistilBERT, and RoBERTa. These models were fine-tuned using the aforementioned gold-standard dataset of 4226 README file sections.

The methodology involved several key steps. Data preprocessing included content abstraction (replacing elements like code blocks and tables with placeholders), tokenization, stop words removal, and lemmatization. To address the imbalanced nature of the dataset, oversampling was applied. The data was then split into 70% for training and 30% for testing, with stratified sampling to maintain class distribution. Model construction involved two approaches: full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) using Low-Rank Adaptation (LoRA).

The results of this study are highly promising. The fine-tuned LLMs demonstrated exceptional performance in categorizing GitHub README file sections, achieving an impressive overall F1 score of 0.98. This significantly surpasses the F1 score of 0.746 achieved by previous state-of-the-art methods that relied on traditional machine learning algorithms like SVM, Random Forest, and Naive Bayes.

Furthermore, the research investigated the effectiveness of PEFT techniques, specifically LoRA, as an economical alternative to full fine-tuning. While full fine-tuning generally yielded slightly higher accuracy, PEFT produced remarkably comparable results with a significantly reduced computational burden. For instance, LoRA-based DistilBERT achieved an F1 score of 0.908, demonstrating that substantial resource savings can be achieved without a drastic compromise in performance. This is crucial as full fine-tuning all parameters of large models is a resource-intensive task.

This study’s contributions are significant. It has not only achieved state-of-the-art results in classifying GitHub README file sections but also provided valuable insights into the effectiveness of PEFT methods compared to full fine-tuning. The findings underscore the immense potential of LLMs in enhancing the software engineering domain, particularly in improving the identification and potential usage of GitHub repositories.

Also Read:

Future directions for this research include increasing the size of the gold-standard dataset, exploring multilingual GitHub README files, and extending the proposed methodology to other code-sharing and version control platforms. Ultimately, these successful models could be integrated into automated tools to help practitioners categorize README content more efficiently. You can find the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Automating GitHub README Classification with Large Language Models

Gen AI News and Updates

Vesl AI Recognized for AI Infrastructure Innovation with ASOCIO Digital Summit Award

AT&T Unleashes Agentic AI Across Business Operations for Enhanced Efficiency and Innovation

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates