TLDR: A new study introduces RealClassEval, a benchmark of real-world class-level code, revealing that LLMs achieve only 25-34% correctness on these tasks, a significant drop from their 84-89% performance on synthetic benchmarks. This gap stems from LLMs struggling with object-oriented semantics, leading to dominant AttributeError and TypeError in real-world scenarios. Docstrings offer minor benefits, while Retrieval-Augmented Generation (RAG) improves performance by 4-7% specifically when documentation is partial, by providing implementation patterns, though it can introduce dependency conflicts. The research emphasizes the need for better real-world benchmarks and enhanced LLM understanding of complex code structures.
Large Language Models (LLMs) have shown impressive capabilities in generating code, especially for individual functions. Tools like GitHub Copilot and Amazon CodeWhisperer are becoming common aids for developers. However, a new study reveals a significant gap between how well these models perform on simplified, artificial benchmarks and their actual performance when faced with the complexities of real-world software projects, particularly at the class level.
A research paper titled Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation by Musfiqur Rahman, SayedHassan Khatoonabadi, and Emad Shihab from Concordia University, Canada, introduces a novel benchmark called RealClassEval. This benchmark is derived from actual open-source repositories, featuring real-world classes categorized into ‘seen’ and ‘unseen’ partitions. This approach allows for a more realistic evaluation of LLMs’ ability to generalize under practical conditions, moving beyond the limitations of existing benchmarks like CoderEval and ClassEval, which often focus on function-level tasks or manually crafted class-level problems that don’t capture real-world interdependencies and project-specific patterns.
The Stark Performance Disparity
The findings are quite revealing. LLMs, which achieve an impressive 84–89% correctness on established synthetic benchmarks, manage only 25–34% correctness on real-world class tasks. This represents a substantial performance drop of 53–62 percentage points. Interestingly, the study found negligible differences in performance between familiar (seen) and novel (unseen) codebases in the real-world context. This suggests that the models’ struggles are not primarily due to a lack of memorization of specific code, but rather a fundamental limitation in understanding the deeper semantic and object-oriented complexities inherent in real-world software.
The reason for this disparity lies in the nature of the tests. Synthetic benchmarks often rely on simple equality assertions, testing basic logical correctness. Real-world test suites, however, involve intricate type metadata checks, external system dependencies (like ‘numpy’ or ‘MCPContext’), and complex object hierarchies. LLMs, while mastering syntax, struggle with correctly implementing attribute access patterns, maintaining type consistency across methods, and navigating complex object relationships.
The Role of Documentation and Retrieval
The research also explored the impact of documentation (docstrings) and Retrieval-Augmented Generation (RAG) on LLM performance.
Comprehensive docstrings, which provide detailed descriptions of code functionality, yielded only modest gains of 1–3% in functional accuracy. While some models showed statistically significant, albeit small, improvements in specific conditions, the overall impact was negligible. This indicates that while docstrings can offer some help, they don’t fundamentally alter the types of errors LLMs make, nor do they provide a silver bullet for improving class-level code generation.
Retrieval-Augmented Generation (RAG), which involves supplying LLMs with relevant code examples from a ‘seen’ dataset, proved most effective when documentation was partial. In these scenarios, RAG improved correctness by 4–7%. This supports an “information gap hypothesis”: RAG’s value lies in compensating for missing context. When specifications lack concrete implementation patterns, retrieved examples can fill these gaps. However, RAG’s benefits were minimal when documentation was either complete (as the model already had sufficient information) or entirely absent (where the lack of structure made it hard for the model to effectively use the retrieved examples).
Understanding Error Patterns
A detailed error analysis identified AttributeError, TypeError, and AssertionError as the dominant failure modes, collectively accounting for 84% of all errors. Notably, SyntaxError was completely absent, confirming that modern LLMs have fully mastered Python syntax. The challenge has shifted entirely to semantic correctness.
The error profiles differed significantly between synthetic and real-world tasks. Synthetic tests predominantly highlighted assertion issues (71.8% of errors), reflecting their focus on logical correctness. Real-world scenarios, however, emphasized type and attribute mismatches (45-49% AttributeError, 22-24% TypeError), underscoring the models’ struggles with object-oriented semantics.
RAG’s impact on errors revealed an interesting “error substitution” mechanism. While it reduced logical flaws and object access errors (like AttributeError and AssertionError), it sometimes introduced new dependency-related failures, such as ImportError and KeyError. This happens when models blindly copy dependencies or data structures from retrieved examples without verifying their compatibility with the target class.
Also Read:
- Gistify: A New Challenge for AI Code Understanding
- SecureReviewer: Boosting AI’s Role in Automated Security Code Review
Implications for the Future of Code Generation
This study provides crucial insights for both practitioners and researchers. It highlights that current synthetic benchmarks offer a misleadingly optimistic view of LLM capabilities for complex code generation. Organizations deploying LLM-based tools should set realistic expectations, anticipating much lower success rates for real-world class-level tasks, and ensure mandatory human review and testing.
For researchers, the findings point to the need for new benchmarks that accurately reflect real-world complexities, including realistic object structures, dependencies, and testing practices. Future research should focus on enhancing LLMs’ understanding of object-oriented semantics, developing type-aware generation architectures, and creating context-aware retrieval systems that can filter incompatible dependencies. The goal is to move beyond mere syntactic correctness to achieve true semantic understanding and functional accuracy in complex software development.


