TLDR: A new benchmark dataset, Multi-IaC-Bench, has been introduced to evaluate how Large Language Models (LLMs) generate and modify Infrastructure as Code (IaC) across AWS CloudFormation, Terraform, and Cloud Development Kit (CDK) formats. The dataset uses synthetic data (initial templates, natural language requests, updated templates) and rigorous validation. Experiments show that while LLMs can achieve high syntactic accuracy, semantic alignment and complex patterns remain challenging, with prompt engineering and retry mechanisms significantly improving performance.
Infrastructure as Code (IaC) is a cornerstone of modern cloud computing, allowing teams to define and manage their cloud resources using machine-readable configuration files. This approach brings the benefits of software development practices, such as version control, testing, and automated deployments, to infrastructure management. However, the cloud landscape is diverse, with different providers and tools utilizing various IaC formats like AWS CloudFormation, Terraform, and the Cloud Development Kit (CDK). This variety often requires cloud architects to be proficient in multiple languages, adding significant complexity to cloud deployments.
Large Language Models (LLMs) show immense promise in automating the creation and maintenance of IaC. Imagine being able to describe the infrastructure you need in natural language, and an AI generates the correct configuration files for you. While this vision is compelling, progress has been somewhat limited due to a lack of comprehensive benchmarks that can evaluate LLMs across these diverse IaC formats and use cases.
To address this critical gap, researchers from Amazon Web Services – Sam Davidson, Li Sun, Bhavana Bhasker, Laurent Callot, and Anoop Deoras – have introduced Multi-IaC-Bench. This novel benchmark dataset is designed specifically for evaluating LLM-based IaC generation and mutation across AWS CloudFormation, Terraform, and CDK formats. The goal is to establish standardized evaluation metrics for AI-assisted infrastructure management and facilitate further research in this crucial domain.
The Multi-IaC-Bench dataset is built around “triplets” of information. Each triplet includes an initial IaC template (which can even be empty for scenarios where a user wants to generate infrastructure from scratch), a natural language request describing a modification or addition, and the corresponding updated template that implements the requested change. These triplets are created through a sophisticated synthetic data generation pipeline. This pipeline starts by sourcing IaC templates from public GitHub repositories. An LLM then reviews these templates to determine realistic infrastructure changes a customer might request and generates a natural language description of that change. Subsequently, the LLM generates an updated IaC template that incorporates the described modification.
A key aspect of Multi-IaC-Bench is its rigorous validation process. All sourced and generated IaC templates undergo checks with static analysis tools like CFNLint, TFLint, and Checkov to ensure they meet best practices and security standards. Beyond syntactic correctness, an LLM judge is employed to verify the semantic alignment between the natural language request and the generated IaC, ensuring the changes accurately reflect the user’s intent. This LLM judge’s effectiveness was further validated through human review, showing strong correlation.
Handling CDK data presented a unique challenge due to its repository-based nature, where multiple files can define infrastructure. Instead of direct modification, the researchers developed a two-step approach: converting the initial CDK template to CloudFormation, prompting the LLM to update the CloudFormation, and then converting the modified CloudFormation back to CDK. This method proved more effective in mitigating LLMs’ tendency to generate incorrect CDK syntax.
The research team evaluated several state-of-the-art LLMs on Multi-IaC-Bench, including Llama 3.2 11B Instruct, Deepseek R1, and Sonnet 3.5 V2. The experiments focused on metrics such as lint and Checkov pass rates (for syntactic correctness and best practices), the number of LLM calls required, LLM judge scores (for semantic alignment), and edit distance. The results demonstrated that while modern LLMs can achieve high success rates (often exceeding 95%) in generating syntactically valid IaC across formats, significant challenges remain in achieving perfect semantic alignment and handling complex infrastructure patterns.
A crucial finding from their ablation studies highlighted the importance of prompt engineering and retry mechanisms. Simple prompts often led to lower performance. However, incorporating best practice guidelines into prompts and, more significantly, implementing a retry loop (where the LLM attempts to regenerate the code if initial attempts fail validation) dramatically improved compliance metrics across all IaC formats. For instance, in CloudFormation, the full prompt with a retry loop boosted CFN-Lint pass rates from 61.93% to 98.52%.
Also Read:
- Bridging the Gap: Large Language Models for Binary Security Patch Detection
- Evaluating AI-Generated 3D Models: A Quantitative Human-in-the-Loop Framework
The Multi-IaC-Bench dataset and the insights gained from these experiments have significant implications for the future of DevOps and cloud infrastructure management. The high success rates achieved by current LLMs suggest that automated IaC generation and modification are becoming increasingly viable. However, the need for iterative refinement and robust validation processes remains essential. As LLM capabilities continue to advance, these tools are expected to become invaluable for automating infrastructure tasks, though human oversight will likely remain important for the foreseeable future. For more details, you can read the full research paper here.


