TLDR: MultiAIGCD is a new, extensive dataset designed to help detect AI-generated code. It includes over 150,000 code snippets in Python, Java, and Go, created by six different large language models under various scenarios like generating code from scratch, fixing errors, and correcting outputs, using different prompting methods. The research also benchmarks existing detection models, revealing that while they are good at identifying code generated from problem descriptions, their accuracy drops when dealing with AI-fixed code and varies significantly across programming languages.
The rapid advancement of large language models (LLMs) has transformed how code is generated, offering significant boosts in productivity for developers and students alike. However, this convenience comes with challenges, including concerns about academic integrity, potential security vulnerabilities, and the degradation of coding skills. To address these issues, the development of robust systems capable of detecting AI-generated code has become crucial.
A new research paper introduces MultiAIGCD, a comprehensive dataset specifically designed for AI-generated code detection. This dataset is a significant step forward in standardizing the evaluation of models that aim to distinguish between human-written and AI-generated code. The researchers, Basak Demirok, Mucahid Kutlu, and Selin Mergen, have meticulously compiled a resource that covers multiple programming languages, various LLMs, different prompting strategies, and diverse usage scenarios.
What Makes MultiAIGCD Unique?
MultiAIGCD stands out due to its extensive coverage. It includes code snippets in three popular programming languages: Python, Java, and Go. The dataset incorporates code generated by six different state-of-the-art LLMs, including Llama-3.3-70B-Instruct-Turbo, Qwen2.5-Coder-32B-Instruct, GPT-4o, OpenAI o3-mini, Claude 3.5 Sonnet v2, and DeepSeek-V3. This wide array of models ensures that the dataset reflects the diverse outputs of current AI code generators.
Beyond just generating code from problem descriptions, MultiAIGCD explores three critical usage scenarios: generating code from scratch, fixing runtime errors in human-written code, and correcting incorrect outputs. This focus on error correction scenarios is particularly important, as it represents a less explored but highly relevant application of LLMs in coding. Furthermore, the dataset incorporates three distinct prompting strategies—Lazy, Role, and Rephrase & Respond—to generate varied code samples, reflecting different ways users might interact with LLMs.
The dataset is substantial, comprising 121,271 AI-generated code snippets and 32,148 human-written code snippets. The human-authored code samples were sourced from IBM’s CodeNet dataset, ensuring they predate the widespread use of modern code-assistant LLMs, thus providing a clear baseline for human coding styles.
Also Read:
- Code Models Struggle with Imperfect Instructions: A New Study Reveals Robustness Gaps
- Bridging Language Gaps: How AI Models Learn to Spot Cross-Language Software Flaws
Key Findings from Benchmarking
The researchers also benchmarked three state-of-the-art AI-generated code detection models using MultiAIGCD: SVMAda (using OpenAI’s text-embedding-ada-002), SVMT5+ (using Salesforce’s CodeT5+), and CodeBERTa. Their experiments yielded several important observations:
- Detection accuracy for code generated from problem definitions was generally high across models.
- However, accuracy significantly decreased when models attempted to identify AI-fixed code samples (i.e., code where LLMs were used to correct runtime errors or incorrect outputs). This suggests that fixing existing code might make AI-generated code harder to distinguish from human code.
- Performance varied considerably across programming languages, highlighting the need for language-specific training.
- While detection accuracy was not significantly impaired in cross-model scenarios (where the detection model was trained on LLMs different from the one generating the test code), there was a substantial decline in performance in cross-language setups. This indicates that models struggle to generalize their detection capabilities to languages not seen during training.
- Qualitative analysis revealed differences in coding styles: human-written code tends to be longer, more creative, and often includes more blank lines and comments. Interestingly, some LLMs like Llama and Claude generated code that more closely resembled human-authored code, making it more challenging for detectors.
- OpenAI’s o3-mini, despite sometimes failing to produce a response, demonstrated high accuracy when it did generate code, suggesting strong reasoning capabilities.
This study underscores the critical need for comprehensive benchmark datasets like MultiAIGCD to advance research in AI-generated code detection. The dataset and associated code will be shared to support further exploration in this vital field. You can find more details about this research in the paper itself: MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models, Prompts, and Scenarios.
Future work plans include expanding the dataset to cover even more programming languages, LLMs, and usage scenarios, such as “blended codes” where LLMs generate only a portion of the code. The researchers also aim to conduct user studies to understand how students and software developers realistically use LLMs for code generation, further informing the development of more effective detection methods.


