BIGCODEARENA: Elevating Code Generation Evaluation Through Execution

TLDR: BIGCODEARENA is an open platform for evaluating large language models (LLMs) in code generation by allowing human users to interact with and execute the generated code. This execution-based approach provides more reliable human preferences than static code review. The platform has collected over 14,000 conversation sessions across various LLMs, languages, and environments. It also introduces two new benchmarks, BIGCODEREWARD and AUTOCODEARENA, to systematically assess code understanding and generation, revealing that execution feedback significantly improves evaluation accuracy and that proprietary models like GPT-5 currently lead in performance.

Evaluating the quality of code generated by Large Language Models (LLMs) has traditionally been a complex task. Simply reading through lines of code can be mentally exhausting and often requires specialized expertise. More importantly, empirical studies have shown that humans frequently misjudge code correctness without actually running it. This challenge is at the heart of a new open human evaluation platform called BIGCODEARENA.

BIGCODEARENA, detailed in a recent technical report, addresses this critical need by enabling real-time evaluation of LLM-generated code through execution. Built upon the foundation of platforms like Chatbot Arena, BIGCODEARENA introduces a comprehensive, on-the-fly execution environment. This allows human evaluators to not only see the code but also interact with its execution process and outcomes, providing a far more reliable assessment of quality.

The motivation behind BIGCODEARENA is clear: static code inspection often fails to reveal the true quality and usefulness of generated programs. For instance, two models might produce code that looks syntactically correct, but only by executing them can a user discern which one delivers a functional and high-quality output, such as a visually appealing website frontend.

How BIGCODEARENA Works

The platform adopts a head-to-head evaluation setup. When a user provides a prompt, BIGCODEARENA presents two anonymized responses from different LLMs. Crucially, it also displays the execution results of the extracted code snippets. These results can be interactive applications, web pages, or static outputs like text and images. Users can test the execution, explore program behavior, and even edit the code to assess its correctness and robustness. This shifts the evaluation from subjective inspection to functionality-driven assessment.

The system comprises a lightweight web-based frontend, built with Gradio, and a secure, modular backend powered by E2B. The frontend offers syntax-highlighted code display, editing capabilities, and execution result rendering. The backend handles dependency resolution, installs necessary packages, and executes code in isolated sandboxed environments, returning the results to the user interface. To ensure fair comparisons, both model outputs are displayed simultaneously only after both models have completed generation and execution, preventing bias from response latency.

Extensive Data Collection

BIGCODEARENA has been deployed for over five months, collecting more than 14,000 crowdsourced conversation sessions. These sessions involve 10 widely used LLMs, span 10 programming languages (including Python, JavaScript, and Java), and utilize 8 different execution environments (such as React, PyGame, and Streamlit). The collected data covers diverse usage scenarios, including Web Design, Game Development, Diagram Creation, Creative Coding, Scientific Computing, and Problem Solving. A subset of over 4,700 high-quality multi-turn conversations with human preferences has been identified for further analysis.

Also Read:

Model Performance and Benchmarks

The platform uses an Elo rating system to rank models based on user preferences. Analysis of the collected data shows that proprietary LLMs like o3-mini and o1-mini consistently lead in performance across various programming topics, languages, and execution environments. Claude-3.5-Sonnet also performs strongly, particularly in language-matched settings. While open-source models are making progress, a performance gap still exists between them and the leading proprietary models.

To further advance the systematic evaluation of code generation, BIGCODEARENA introduces two new benchmarks:

BIGCODEREWARD: This benchmark measures how well reward models align with human judgments in code evaluation, especially when execution results are provided. It highlights that execution feedback generally improves the accuracy of reward models, with proprietary models achieving the highest scores.
AUTOCODEARENA: Designed to automate crowdsourced evaluation, this benchmark leverages strong LLMs to approximate human preferences by comparing model outputs against a baseline. It uses a local Docker-based execution system for efficiency. Initial results indicate that GPT-5 currently sets a new state-of-the-art in code generation quality, with Claude-Opus-4 and Claude-Sonnet-4 also performing strongly.

BIGCODEARENA represents a significant step towards more transparent and reliable evaluation of LLM-generated code. By integrating real-time execution and interactive testing, it provides a robust foundation for understanding and improving the capabilities of these advanced AI systems. The platform and its associated benchmarks are open-source, encouraging community contributions and further research into this rapidly evolving field. You can find the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

BIGCODEARENA: Elevating Code Generation Evaluation Through Execution

How BIGCODEARENA Works

Extensive Data Collection

Model Performance and Benchmarks

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

Gabriel Marketing Group Introduces Generative Engine Optimization (GEO) Content Services for B2B Technology Companies Amidst AI Evolution

OpenAI Unveils ‘Friendlier’ GPT-5.1 for ChatGPT, Emphasizing Enhanced User Experience and Adaptive Intelligence

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates