Unmasking AI-Generated Code: A New Dataset for Detection Across Languages and Scenarios

TLDR: MultiAIGCD is a new, extensive dataset designed to help detect AI-generated code. It includes over 150,000 code snippets in Python, Java, and Go, created by six different large language models under various scenarios like generating code from scratch, fixing errors, and correcting outputs, using different prompting methods. The research also benchmarks existing detection models, revealing that while they are good at identifying code generated from problem descriptions, their accuracy drops when dealing with AI-fixed code and varies significantly across programming languages.

The rapid advancement of large language models (LLMs) has transformed how code is generated, offering significant boosts in productivity for developers and students alike. However, this convenience comes with challenges, including concerns about academic integrity, potential security vulnerabilities, and the degradation of coding skills. To address these issues, the development of robust systems capable of detecting AI-generated code has become crucial.

A new research paper introduces MultiAIGCD, a comprehensive dataset specifically designed for AI-generated code detection. This dataset is a significant step forward in standardizing the evaluation of models that aim to distinguish between human-written and AI-generated code. The researchers, Basak Demirok, Mucahid Kutlu, and Selin Mergen, have meticulously compiled a resource that covers multiple programming languages, various LLMs, different prompting strategies, and diverse usage scenarios.

What Makes MultiAIGCD Unique?

MultiAIGCD stands out due to its extensive coverage. It includes code snippets in three popular programming languages: Python, Java, and Go. The dataset incorporates code generated by six different state-of-the-art LLMs, including Llama-3.3-70B-Instruct-Turbo, Qwen2.5-Coder-32B-Instruct, GPT-4o, OpenAI o3-mini, Claude 3.5 Sonnet v2, and DeepSeek-V3. This wide array of models ensures that the dataset reflects the diverse outputs of current AI code generators.

Beyond just generating code from problem descriptions, MultiAIGCD explores three critical usage scenarios: generating code from scratch, fixing runtime errors in human-written code, and correcting incorrect outputs. This focus on error correction scenarios is particularly important, as it represents a less explored but highly relevant application of LLMs in coding. Furthermore, the dataset incorporates three distinct prompting strategies—Lazy, Role, and Rephrase & Respond—to generate varied code samples, reflecting different ways users might interact with LLMs.

The dataset is substantial, comprising 121,271 AI-generated code snippets and 32,148 human-written code snippets. The human-authored code samples were sourced from IBM’s CodeNet dataset, ensuring they predate the widespread use of modern code-assistant LLMs, thus providing a clear baseline for human coding styles.

Also Read:

Key Findings from Benchmarking

The researchers also benchmarked three state-of-the-art AI-generated code detection models using MultiAIGCD: SVMAda (using OpenAI’s text-embedding-ada-002), SVMT5+ (using Salesforce’s CodeT5+), and CodeBERTa. Their experiments yielded several important observations:

Detection accuracy for code generated from problem definitions was generally high across models.
However, accuracy significantly decreased when models attempted to identify AI-fixed code samples (i.e., code where LLMs were used to correct runtime errors or incorrect outputs). This suggests that fixing existing code might make AI-generated code harder to distinguish from human code.
Performance varied considerably across programming languages, highlighting the need for language-specific training.
While detection accuracy was not significantly impaired in cross-model scenarios (where the detection model was trained on LLMs different from the one generating the test code), there was a substantial decline in performance in cross-language setups. This indicates that models struggle to generalize their detection capabilities to languages not seen during training.
Qualitative analysis revealed differences in coding styles: human-written code tends to be longer, more creative, and often includes more blank lines and comments. Interestingly, some LLMs like Llama and Claude generated code that more closely resembled human-authored code, making it more challenging for detectors.
OpenAI’s o3-mini, despite sometimes failing to produce a response, demonstrated high accuracy when it did generate code, suggesting strong reasoning capabilities.

This study underscores the critical need for comprehensive benchmark datasets like MultiAIGCD to advance research in AI-generated code detection. The dataset and associated code will be shared to support further exploration in this vital field. You can find more details about this research in the paper itself: MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models, Prompts, and Scenarios.

Future work plans include expanding the dataset to cover even more programming languages, LLMs, and usage scenarios, such as “blended codes” where LLMs generate only a portion of the code. The researchers also aim to conduct user studies to understand how students and software developers realistically use LLMs for code generation, further informing the development of more effective detection methods.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unmasking AI-Generated Code: A New Dataset for Detection Across Languages and Scenarios

What Makes MultiAIGCD Unique?

Key Findings from Benchmarking

Gen AI News and Updates

Runloop.ai Launches Enterprise AI Infrastructure with Google Wallet Co-Founder Rob von Behren Joining Leadership

SecureVibes Unveils AI-Powered Multi-Language Code Vulnerability Scanner Leveraging Claude AI Agents

Microsoft Research Unveils BlueCodeAgent: AI-Powered Defense for Secure Code Generation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates