spot_img
HomeResearch & DevelopmentAssessing AI's Grasp of Software Design: A Study on...

Assessing AI’s Grasp of Software Design: A Study on LLMs and SOLID Principles

TLDR: A study evaluated leading LLMs (GPT-4o Mini, Qwen2.5 Coder, CodeLlama, DeepSeekCoder) on detecting SOLID design principle violations across Python, Java, C#, and Kotlin. Using a new 240-code-snippet dataset and four prompt strategies, researchers found GPT-4o Mini significantly outperformed others. Prompt strategy dramatically impacts accuracy, with no single best approach. Detection accuracy degrades sharply with code complexity and is higher in statically-typed languages (C#, Java) than dynamically-typed (Python). The findings highlight that effective AI-driven design analysis requires matching the right model and prompt to the specific context, emphasizing LLMs’ potential for AI-assisted code analysis and maintainability.

In the world of software development, creating high-quality, maintainable, and extensible code is a constant challenge. A set of guidelines known as the SOLID principles—Single Responsibility (SRP), Open/Closed (OCP), Liskov Substitution (LSP), Interface Segregation (ISP), and Dependency Inversion (DIP)—provide a robust foundation for good design. However, violations of these principles are common and can significantly degrade code quality over time.

Traditional methods for detecting these design flaws, such as static analysis, often fall short because they struggle with the semantic understanding required to identify violations across all five SOLID principles in various programming languages. With Large Language Models (LLMs) increasingly integrated into developer workflows, a crucial question arises: do these powerful AI models truly understand the principles of good software design, or do they merely generate functional but architecturally flawed code?

A recent research paper titled “Are We SOLID Yet? An Empirical Study on Prompting LLMs to Detect Design Principle Violations” by Fatih Pehlivan, Arc ¸in¨Ulk¨u Erg ¨uzen, Sahand Moslemi Yengejeh, Mayasah Lami, and Anil Koyuncu, addresses this critical gap. The study introduces a novel methodology that leverages tailored prompt engineering to evaluate LLMs on their ability to detect SOLID violations across multiple programming languages.

A New Benchmark for LLM Design Awareness

To conduct their evaluation, the researchers constructed a new benchmark dataset comprising 240 manually validated code examples. These examples cover all five SOLID principles across four popular languages: Python, Java, C#, and Kotlin. Each scenario was implemented at three difficulty levels—easy, moderate, and hard—to assess how code complexity impacts detection accuracy.

The study benchmarked four leading LLMs: CodeLlama:70B, DeepSeekCoder:33B, Qwen2.5 Coder:32B, and GPT-4o Mini. To systematically measure the impact of interaction methods, they tested four distinct prompt strategies inspired by established techniques like zero-shot, few-shot, and chain-of-thought. These strategies included a direct DEFAULT prompt, an EXAMPLE prompt providing hints, a two-step SMELL prompt, and an ENSEMBLE strategy that asks the model to score and justify multiple principles.

Key Findings: GPT-4o Mini Leads, Prompts Matter, Complexity Hurts

The emerging results reveal a clear hierarchy among the models. GPT-4o Mini decisively outperformed the others, demonstrating superior performance across most principles, particularly SRP, OCP, and ISP. Qwen2.5-Coder-32B secured a distant second, showing competence in SRP and OCP but struggling with more nuanced principles like DIP. CodeLlama-70B and DeepSeek-33B generally performed poorly, especially on challenging principles like DIP, where three out of four models were largely ineffective.

Crucially, the research highlights that prompt strategy has a dramatic impact on detection accuracy, though no single strategy is universally best. For instance, the deliberative ENSEMBLE prompt excelled at OCP detection, while a hint-based EXAMPLE prompt proved superior for DIP violations. The baseline DEFAULT strategy was effective for SRP and ISP, suggesting that for principles with clear structural patterns, a direct approach is sufficient. In contrast, the SMELL strategy consistently underperformed, indicating that indirect, two-step reasoning was ineffective for this task.

The study also found that detection accuracy is heavily dependent on programming language characteristics. Statically-typed languages like C# and Java generally yielded higher accuracy, especially for simpler principles, likely due to their explicit type declarations and formal class structures providing clearer signals for LLMs. Python, with its dynamic typing and syntactic flexibility, consistently presented the greatest challenge, resulting in the lowest accuracy for most principles.

A cross-cutting finding was the significant impact of code complexity. Across all models, prompts, and languages, increasing code complexity sharply degraded detection performance. Principles like OCP, LSP, and DIP saw accuracy plummet as samples moved from easy to hard, underscoring the difficulty LLMs face in untangling design violations from general code complexity.

Also Read:

Why Models Struggle and What’s Next

The researchers identified several reasons for model failures, including principle ambiguity (especially for DIP and LSP, which require abstract reasoning), the increased cognitive load of two-step prompting strategies like SMELL, and frequent non-adherence to requested output formats. The decline in performance with complexity suggests that incidental code complexity can obscure the design-relevant signals LLMs are meant to detect.

This study provides foundational insights into LLM capabilities and limitations in understanding software design principles. The authors emphasize that effective, AI-driven design analysis requires a tailored approach that matches the right model and prompt to the specific design context. This work has direct implications for assessing AI coding assistants, as an LLM’s “design awareness” is a crucial proxy for its ability to generate maintainable and extensible code.

Future work aims to move from mere violation detection to automated refactoring, tasking LLMs with generating corrections for identified violations. This would involve rigorous assessment of LLM-generated solutions through expert review and automated test cases. Further research will also expand the benchmark to include more models, real-world industrial code, and a wider range of design patterns. You can find the full research paper here.

Meera Iyer
Meera Iyerhttps://blogs.edgentiq.com
Meera Iyer is an AI news editor who blends journalistic rigor with storytelling elegance. Formerly a content strategist in a leading tech firm, Meera now tracks the pulse of India's Generative AI scene, from policy updates to academic breakthroughs. She's particularly focused on bringing nuanced, balanced perspectives to the fast-evolving world of AI-powered tools and media. You can reach her out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -