TLDR: A new theoretical framework is proposed to standardize the design and reporting of empirical studies on LLM-based code generation. It organizes evaluation around core components like problem sources, quality attributes, and metrics, aiming to improve comparability and reproducibility in a field currently lacking consistent methodologies. The framework was developed through a bottom-up approach, combining author experience with a review of existing literature, and is designed to be extensible for future research.
The rapid advancement of large language models (LLMs) has opened up exciting new possibilities for automated code generation, promising to tackle a wide array of software engineering challenges. However, the way we evaluate these powerful AI tools in empirical studies has been largely inconsistent. Researchers often use different goals, tasks, and metrics, making it difficult to compare findings or reproduce experiments effectively.
To address this fragmentation, Nathalia Nascimento, Everton Guimaraes, and Paulo Alencar have proposed a new theoretical framework designed to standardize how empirical studies on LLM-based code generation are designed and reported. This framework aims to bring much-needed structure and systematicity to the evaluation process, ultimately enhancing the comparability and reproducibility of research in this critical area.
Developing the Framework
The creation of this framework was a thoughtful, bottom-up process. It draws heavily from the authors’ own extensive experience in conducting experiments with LLMs in software engineering. Additionally, it incorporates insights gained from a comprehensive comparative analysis of recent empirical studies. The researchers conducted a structured search across the ACM Digital Library, identifying 75 papers published between 2023 and 2025. After applying specific inclusion and exclusion criteria, 32 papers were retained, with 13 selected for in-depth analysis. This rigorous approach helped distill common patterns and recurring elements from existing research, forming the foundation of the generalizable structure.
Core Components of the Framework
The proposed framework organizes the evaluation of LLM-based code generation around six core components:
- Coding Task: Defines the nature of the programming problem.
- Quality and Metrics Evaluation: Specifies the attributes of code quality to be assessed and the metrics used to measure them.
- Empirical Research: Outlines the research methodology and experimental design.
- Environment: Details the computational resources and hardware constraints.
- LLM Model: Covers aspects like prompt engineering, parameter tuning, and model selection.
- Generated Output: Describes the type of code or behavior produced by the LLM.
Each of these components offers a configurable space, allowing researchers to tailor their experiments to specific goals and contexts. For instance, a study might focus on ‘Correctness’ and ‘Energy Efficiency’ as quality attributes, leading to the selection of metrics like ‘Pass@1’ and ‘Energy Consumption’. Problem sources could range from competitive programming platforms like LeetCode to real-world scenarios found on GitHub.
Putting the Framework into Practice
To demonstrate its applicability, the authors illustrated how an existing study, which compared ChatGPT’s performance and efficiency against human programmers on LeetCode-style tasks, could be mapped onto the framework. This exercise highlighted how the framework can systematically capture various aspects of an experiment, from the chosen LLM (ChatGPT) and quality attributes (correctness, execution time, memory usage) to the controlled experimental setup.
Furthermore, the framework’s adaptability was tested by mapping two representative studies not used in its initial construction. One study, by Ouyang et al., explored ChatGPT’s non-determinism in code generation, suggesting the need to formalize ‘stability’ as a quality attribute and include variance-based metrics. Another study, by Ren et al., investigated prompt-chaining methods for improving exception handling, indicating extensions for defining specialized task categories and modeling advanced prompting strategies.
Also Read:
- Enhancing Code Refactoring with Large Language Models: A Deep Dive into Instruction Strategies
- Assessing Code Quality: A Deep Dive into LLM-Generated Code Smells
Future Directions
The researchers envision this framework as a living artifact that will continue to evolve. Future plans include a systematic literature review to refine and expand its components, potentially adding dimensions like ‘Reproducibility Factors’ (e.g., seed control, model versioning) and ‘Data Collection Strategies’. They also aim to apply the framework to design new empirical studies, particularly focusing on identified gaps in the literature.
Ambitiously, the authors plan to evolve the framework into an interactive tool that could automatically generate research protocols based on specified application domains and research goals. This tool would recommend research questions, quality attributes, and evaluation metrics, streamlining the design of controlled experiments. They also foresee automating parts of the experimental pipeline, from scraping problems to performing statistical analyses, and extending the framework’s application to other software engineering tasks like unit test generation, bug fixing, and documentation synthesis.
This work represents a significant step towards bringing consistency and rigor to the empirical evaluation of LLM-based code generation, fostering a more standardized and comprehensive approach to research in this rapidly evolving field. You can read the full research paper here: Designing Empirical Studies on LLM-Based Code Generation: Towards a Reference Framework.


