TLDR: This research paper explores the use of Large Language Models (LLMs) for automating the translation of Continuous Integration (CI) configurations, specifically from Travis CI to GitHub Actions. The study quantifies the substantial manual effort involved in current migrations and identifies key issues in LLM-generated translations, including logic inconsistencies, platform discrepancies, environment errors, and syntax errors. It demonstrates that combining guideline-based prompting with iterative refinement significantly enhances LLM performance, achieving a 75.5% build success rate, a notable improvement over basic LLM approaches and existing rule-based tools.
In the fast-paced world of software development, Continuous Integration (CI) and Continuous Delivery/Deployment (CD) are essential practices. They help teams integrate code changes frequently and reliably, ensuring software quality and accelerating development cycles. However, with numerous CI platforms available, organizations often find themselves needing to migrate from one platform to another. A central and often challenging part of this migration is translating CI configurations, which are typically written in YAML format. This process demands a deep understanding of both the source and target platforms’ unique rules and semantics.
A recent research paper, Exploring and Unleashing the Power of Large Language Models in CI/CD Configuration Translation, delves into how Large Language Models (LLMs) can simplify this complex task. Authored by Jiajun Wu, Chong Wang, Chen Zhang, Wunan Guo, Jianfeng Qu, Yewen Tian, and Yang Liu, the study focuses specifically on migrating configurations from Travis CI, a once-dominant platform, to GitHub Actions, which has largely supplanted it for open-source projects.
The Challenge of Manual Migration
The researchers first quantified the effort involved in manual CI configuration translation. Analyzing 811 migration records, they found that developers typically read about 38 lines of Travis CI configuration and write approximately 58 lines for GitHub Actions. Nearly half of these migrations required multiple attempts and commits to stabilize, indicating that it’s far from a straightforward process. This significant manual effort highlights the pressing need for automated solutions.
LLMs Step In: Initial Performance and Common Issues
The study then evaluated the fundamental ability of four representative LLMs—GPT-4o, GPT-4o mini, Qwen-3, and DeepSeek-Coder—to perform these translations. While LLMs showed promise, their initial performance was limited. The researchers identified 1,121 issues across the translated configurations, categorizing them into four main types:
- Logic Inconsistencies (38%): These were the most frequent issues, where the LLM failed to preserve the original workflow’s intended behavior. This could mean missing necessary tasks, adding redundant ones, or executing tasks in the wrong order.
- Platform Discrepancies (32%): Arising from the inherent differences between Travis CI and GitHub Actions, these issues included using unsupported keys, expressions, or architectures, or failing to explicitly define steps that were implicit in the source platform.
- Environment Errors (25%): These typically involved problems with the execution environment, such as referencing obsolete actions or, most commonly, failing to provide required credentials or secrets for external services.
- Syntax Errors (5%): The least common type, these were basic YAML syntax mistakes like incorrect indentation or missing symbols. While less frequent, they can still prevent a workflow from running.
Among the LLMs tested, GPT-4o performed best, achieving a Build Success Rate (BSR) of 25.8%, meaning about a quarter of its translations ran successfully without further intervention. This indicated that while LLMs could generate configurations, there was significant room for improvement.
Enhancing LLM Translation Capabilities
To boost performance, the researchers investigated three enhancement strategies:
- One-shot Prompting: Providing the LLM with a single example of a successful migration. Surprisingly, this strategy did not improve accuracy and sometimes even hindered it.
- Guideline-based Prompting: Guiding the LLM with explicit natural language rules derived from the identified issue taxonomy. This approach significantly improved the BSR to 40.2%, demonstrating the value of structured instructions.
- Iterative Refinement: Using error messages from failed workflow executions as feedback to allow the LLM to progressively correct and refine its generated configuration. This strategy proved highly effective, raising the BSR to 68.6%.
The most impactful finding was the combination of guideline-based prompting with iterative refinement. By first guiding GPT-4o with explicit rules and then allowing it to refine its output based on build feedback, the combined strategy achieved an impressive BSR of 75.5%. This represents nearly a threefold improvement over the basic LLM baseline and more than a fourfold improvement compared to GitHub’s official rule-based migration tool, Importer.
Also Read:
- Enhancing Decompilation for Executable Code with Contextual Learning
- Navigating the AI Frontier: A Vision for Generative AI in Software Engineering
Looking Ahead
This research underscores the significant potential of LLMs in automating complex software engineering tasks like CI configuration translation. By understanding their limitations and employing strategic prompting and feedback mechanisms, developers can leverage these powerful models to streamline migrations, reduce manual effort, and improve the reliability of CI/CD pipelines. The study’s findings pave the way for more intelligent and efficient software development workflows.


