TLDR: LoCoBench is a new, comprehensive benchmark designed by Salesforce AI Research to evaluate long-context Large Language Models (LLMs) in complex software engineering tasks. It features 8,000 scenarios across 10 languages and 36 domains, with context lengths up to 1 million tokens. The benchmark assesses LLMs on 8 key software development tasks using 17 metrics, including new ones for architectural coherence and multi-session memory. Initial evaluations reveal significant performance gaps among leading models, with Gemini-2.5-Pro, GPT-5, and Claude-Sonnet-4 showing distinct strengths, and overall performance degrading with increased context length and task difficulty, particularly in systems programming languages.
Large Language Models (LLMs) are rapidly advancing, with their ability to process increasingly long texts, sometimes extending to millions of tokens. This expanded ‘context window’ opens up exciting possibilities for complex tasks, especially in software engineering. However, a new research paper highlights a significant gap: existing evaluation methods don’t adequately test these LLMs’ capabilities in real-world, intricate software development scenarios that demand understanding entire codebases and reasoning across multiple files.
To address this, researchers from Salesforce AI Research have introduced LoCoBench, a groundbreaking benchmark designed specifically to evaluate long-context LLMs in complex software engineering. This benchmark moves beyond simple code completion or short-context tasks, focusing on the sophisticated reasoning required for large-scale software systems.
What LoCoBench Offers
LoCoBench is a comprehensive evaluation framework built through a systematic five-phase pipeline. It generates an unprecedented scale of evaluation scenarios, ensuring a thorough assessment of LLMs:
- Vast Scenarios: It features 8,000 evaluation scenarios, systematically generated across 10 programming languages and 36 diverse domain categories.
- Extreme Context Lengths: Scenarios range from 10,000 to 1 million tokens, a 100-fold variation that allows for precise measurement of how performance changes with increasing context.
- Eight Key Task Categories: LoCoBench evaluates LLMs on critical software development tasks, including Architectural Understanding, Cross-File Refactoring, Feature Implementation, Bug Investigation, Multi-Session Development, Code Comprehension, Integration Testing, and Security Analysis. These tasks require deep understanding and reasoning across multiple files and architectural layers.
- Comprehensive Metrics: The benchmark introduces a robust evaluation framework with 17 metrics across four dimensions: Software Engineering Excellence, Functional Correctness, Code Quality Assessment, and Long-Context Utilization. Notably, it includes six new metrics specifically designed for long-context capabilities, such as the Architectural Coherence Score (ACS), Dependency Traversal Accuracy (DTA), and Multi-Session Memory Retention (MMR).
The LoCoBench Pipeline
The benchmark’s creation involves a meticulous five-phase process:
- Project Generation: Creating 1,000 diverse project specifications across various languages and domains.
- Codebase Synthesis: Generating realistic codebases with over 50,000 files and 15 million lines of code, ensuring architectural consistency.
- Scenario Creation: Transforming these codebases into 8,000 evaluation scenarios, carefully selecting file subsets to target specific long-context capabilities.
- Validation: Rigorous automated checks for compilation, execution, quality, and bias detection.
- LLM Evaluation: Assessing LLMs using the 17 comprehensive metrics, culminating in a unified LoCoBench Score (LCBS).
Key Findings from Model Evaluations
The researchers evaluated several state-of-the-art long-context models using LoCoBench, revealing significant insights:
- Performance Gaps: The evaluations showed substantial performance differences among models, indicating that long-context understanding in complex software development remains a significant challenge.
- Leading Models: Gemini-2.5-Pro emerged as the overall leader, demonstrating strong capabilities in cross-file refactoring, long-context utilization, integration testing, and multi-session development. GPT-5 showed particular strength in architectural understanding, while Claude-Sonnet-4 excelled in code comprehension.
- Difficulty and Context Length: Performance consistently degraded as task difficulty and context length increased, highlighting the compounding challenges these factors present for LLMs.
- Language and Domain Specificity: Models generally performed better on high-level languages like Python and PHP compared to systems programming languages such as C and Rust. Performance also varied significantly across different application domains and architectural patterns, suggesting that models might be specialized or have varying training data representations.
Also Read:
- GeoAnalystBench: A New Benchmark Reveals LLM Capabilities in Geospatial Analysis
- Assessing LLMs for Cloud Infrastructure Automation
Implications for the Future
LoCoBench provides crucial guidance for both AI model developers and software engineering practitioners. It underscores the need for more focused research on long-context capabilities in software engineering. For practitioners, the benchmark demonstrates that selecting an LLM should involve considering not just overall performance, but also its strengths in specific programming languages, application domains, architectural patterns, and consistency requirements for the intended use case. The findings suggest that while top models are becoming more capable, there’s still a long way to go in achieving truly robust long-context understanding for complex software development. You can find the full research paper here: LoCoBench Research Paper.


