Assessing Code Quality: A Deep Dive into LLM-Generated Code Smells

TLDR: A new research paper investigates the quality of code generated by Large Language Models (LLMs) by analyzing “code smells” – indicators of poor code quality. The study compares code from Gemini Pro, ChatGPT, Codex, and Falcon against human-written baselines. Findings show LLM-generated code has significantly more code smells (average 63.34% increase), with quality deteriorating on complex and advanced programming topics. While functional correctness often correlates with fewer smells, this isn’t always consistent across all LLMs. The research highlights critical areas for improving LLM code generation quality beyond just functional correctness.

Large Language Models (LLMs) are rapidly changing how software is developed, assisting programmers by generating code. While much attention has been paid to whether this generated code is functionally correct, a new study delves into a crucial, yet often overlooked, aspect: code quality. This research, titled “Investigating The Smells of LLM Generated Code”, by Debalina Ghosh Paul, Hong Zhu, and Ian Bayley from Oxford Brookes University, explores the prevalence of ‘code smells’ in LLM-generated programs compared to professionally written human code.

Understanding Code Smells

In software engineering, a “code smell” isn’t a bug, but rather an indicator of a deeper problem in the code that could lead to issues with maintenance, evolution, or reuse. These are suboptimal choices in implementation or design that make code unnecessarily complex, difficult to understand, or hard to modify. Examples include inconsistent naming, excessive complexity, or poor modularization. Traditionally, detecting code smells has been subjective, relying on human intuition. However, this study employs an objective, automated approach to benchmark LLM performance against a baseline of high-quality human-written code.

How the Study Was Conducted

The researchers developed a scenario-based method to evaluate code quality. They used the ScenEval benchmark, a dataset of over 12,000 Java programming tasks, each with a reference solution written by textbook authors or professional programmers. These reference solutions served as the baseline for “good quality” code. The test dataset for this study comprised 1000 randomly sampled tasks. Four state-of-the-art LLMs – Gemini Pro, ChatGPT, Codex, and Falcon – were prompted to generate Java code for these tasks. An automated test system, utilizing tools like PMD, Checkstyle, and DesigniteJava, then analyzed both the LLM-generated code and the human-written reference solutions for various types of code smells, categorizing them into implementation smells (e.g., inconsistent naming, magic numbers, documentation issues) and design smells (e.g., modularity, encapsulation, hierarchy issues).

Key Findings: LLMs and Code Quality

The study revealed several significant insights into the quality of LLM-generated code:

Overall Code Smell Prevalence

LLM-generated code consistently exhibited a higher incidence of code smells compared to human-written reference solutions. On average, LLM-generated code showed a 63.34% increase in code smells. Falcon performed the least badly with a 42.28% increase, while Codex showed the highest increase at 84.97%. Implementation smells saw a 73.35% average increase, and design smells increased by 21.42%.

Impact of Programming Topics

The quality of LLM-generated code varied significantly across different programming topics. LLMs performed best on basic coding tasks like “Basic Exercise,” “String,” and “DateTime.” However, their performance deteriorated significantly on more advanced topics such as “Encapsulation,” “Array,” “OOP,” “Inheritance,” and “Searching & Sorting.” For instance, code smells in “Encapsulation” tasks increased by an average of 138.53% across all LLMs, with ChatGPT showing a staggering 165.38% increase in this area. Interestingly, LLMs sometimes improved code quality on specific topics, such as “Regular Expression.”

Influence of Task Complexity

As the complexity of coding tasks increased, so did the prevalence of code smells in both human-written and LLM-generated code. However, LLMs struggled more disproportionately with highly complex tasks, indicating a challenge in maintaining code quality under demanding conditions. The study found a strong correlation between task complexity (measured by cyclomatic complexity and lines of code) and the increase in code smells.

Specific Types of Code Smells

The study identified that the least prevalent implementation smells in both human and LLM-generated code were “Incompleteness,” “Inconsistent Naming Convention,” and “Redundancy.” Conversely, the most prevalent and problematic implementation smells were “Magic Number,” “Documentation,” and “Improper Alignment and Placement.” While the overall prevalence of smell types in LLM code correlated strongly with human code, the largest *increases* in smells for LLMs occurred in types that were *least prevalent* in human code, such as “Inconsistent Naming Convention” (over 1000% increase) and “Incompleteness” (over 300% increase).

Correctness vs. Code Smells

Generally, code that was functionally correct tended to have fewer code smells. However, this wasn’t a universal rule for all LLMs. For example, Falcon’s incorrect code sometimes had fewer smells than its correct code, highlighting that functional correctness doesn’t always guarantee high code quality across the board.

Also Read:

Implications for Future Development

This research underscores that while LLMs are powerful tools for code generation, the quality of their output, particularly concerning code smells, is noticeably poorer than human-written code. This suggests a critical area for improvement in LLM development, especially for complex and advanced programming tasks. Future work could involve expanding these experiments to more LLM models and programming languages, and exploring how code smell detection can be integrated into iterative development processes to enhance the quality of LLM-generated code.