Assessing AI's Grasp of Software Design: A Study on LLMs and SOLID Principles

TLDR: A study evaluated leading LLMs (GPT-4o Mini, Qwen2.5 Coder, CodeLlama, DeepSeekCoder) on detecting SOLID design principle violations across Python, Java, C#, and Kotlin. Using a new 240-code-snippet dataset and four prompt strategies, researchers found GPT-4o Mini significantly outperformed others. Prompt strategy dramatically impacts accuracy, with no single best approach. Detection accuracy degrades sharply with code complexity and is higher in statically-typed languages (C#, Java) than dynamically-typed (Python). The findings highlight that effective AI-driven design analysis requires matching the right model and prompt to the specific context, emphasizing LLMs’ potential for AI-assisted code analysis and maintainability.

In the world of software development, creating high-quality, maintainable, and extensible code is a constant challenge. A set of guidelines known as the SOLID principles—Single Responsibility (SRP), Open/Closed (OCP), Liskov Substitution (LSP), Interface Segregation (ISP), and Dependency Inversion (DIP)—provide a robust foundation for good design. However, violations of these principles are common and can significantly degrade code quality over time.

Traditional methods for detecting these design flaws, such as static analysis, often fall short because they struggle with the semantic understanding required to identify violations across all five SOLID principles in various programming languages. With Large Language Models (LLMs) increasingly integrated into developer workflows, a crucial question arises: do these powerful AI models truly understand the principles of good software design, or do they merely generate functional but architecturally flawed code?

A recent research paper titled “Are We SOLID Yet? An Empirical Study on Prompting LLMs to Detect Design Principle Violations” by Fatih Pehlivan, Arc ¸in¨Ulk¨u Erg ¨uzen, Sahand Moslemi Yengejeh, Mayasah Lami, and Anil Koyuncu, addresses this critical gap. The study introduces a novel methodology that leverages tailored prompt engineering to evaluate LLMs on their ability to detect SOLID violations across multiple programming languages.

A New Benchmark for LLM Design Awareness

To conduct their evaluation, the researchers constructed a new benchmark dataset comprising 240 manually validated code examples. These examples cover all five SOLID principles across four popular languages: Python, Java, C#, and Kotlin. Each scenario was implemented at three difficulty levels—easy, moderate, and hard—to assess how code complexity impacts detection accuracy.

The study benchmarked four leading LLMs: CodeLlama:70B, DeepSeekCoder:33B, Qwen2.5 Coder:32B, and GPT-4o Mini. To systematically measure the impact of interaction methods, they tested four distinct prompt strategies inspired by established techniques like zero-shot, few-shot, and chain-of-thought. These strategies included a direct DEFAULT prompt, an EXAMPLE prompt providing hints, a two-step SMELL prompt, and an ENSEMBLE strategy that asks the model to score and justify multiple principles.

Key Findings: GPT-4o Mini Leads, Prompts Matter, Complexity Hurts

The emerging results reveal a clear hierarchy among the models. GPT-4o Mini decisively outperformed the others, demonstrating superior performance across most principles, particularly SRP, OCP, and ISP. Qwen2.5-Coder-32B secured a distant second, showing competence in SRP and OCP but struggling with more nuanced principles like DIP. CodeLlama-70B and DeepSeek-33B generally performed poorly, especially on challenging principles like DIP, where three out of four models were largely ineffective.

Crucially, the research highlights that prompt strategy has a dramatic impact on detection accuracy, though no single strategy is universally best. For instance, the deliberative ENSEMBLE prompt excelled at OCP detection, while a hint-based EXAMPLE prompt proved superior for DIP violations. The baseline DEFAULT strategy was effective for SRP and ISP, suggesting that for principles with clear structural patterns, a direct approach is sufficient. In contrast, the SMELL strategy consistently underperformed, indicating that indirect, two-step reasoning was ineffective for this task.

The study also found that detection accuracy is heavily dependent on programming language characteristics. Statically-typed languages like C# and Java generally yielded higher accuracy, especially for simpler principles, likely due to their explicit type declarations and formal class structures providing clearer signals for LLMs. Python, with its dynamic typing and syntactic flexibility, consistently presented the greatest challenge, resulting in the lowest accuracy for most principles.

A cross-cutting finding was the significant impact of code complexity. Across all models, prompts, and languages, increasing code complexity sharply degraded detection performance. Principles like OCP, LSP, and DIP saw accuracy plummet as samples moved from easy to hard, underscoring the difficulty LLMs face in untangling design violations from general code complexity.

Also Read:

Why Models Struggle and What’s Next

The researchers identified several reasons for model failures, including principle ambiguity (especially for DIP and LSP, which require abstract reasoning), the increased cognitive load of two-step prompting strategies like SMELL, and frequent non-adherence to requested output formats. The decline in performance with complexity suggests that incidental code complexity can obscure the design-relevant signals LLMs are meant to detect.

This study provides foundational insights into LLM capabilities and limitations in understanding software design principles. The authors emphasize that effective, AI-driven design analysis requires a tailored approach that matches the right model and prompt to the specific design context. This work has direct implications for assessing AI coding assistants, as an LLM’s “design awareness” is a crucial proxy for its ability to generate maintainable and extensible code.

Future work aims to move from mere violation detection to automated refactoring, tasking LLMs with generating corrections for identified violations. This would involve rigorous assessment of LLM-generated solutions through expert review and automated test cases. Further research will also expand the benchmark to include more models, real-world industrial code, and a wider range of design patterns. You can find the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing AI’s Grasp of Software Design: A Study on LLMs and SOLID Principles

A New Benchmark for LLM Design Awareness

Key Findings: GPT-4o Mini Leads, Prompts Matter, Complexity Hurts

Why Models Struggle and What’s Next

Gen AI News and Updates

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

AZTECH Introduces Comprehensive AI Training Series to Propel Regional Digital Transformation

HKU Spearheads AI Integration in Hong Kong’s Digital Education Future

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates