Devstral-Small: A Lightweight Open-Source Model Excelling in Coding Agent Applications

TLDR: Devstral-Small is a 24-billion parameter open-source language model developed by Mistral AI and All Hands AI, specifically fine-tuned for coding agent applications. It achieves state-of-the-art performance among models under 100 billion parameters, outperforming much larger open and even some closed models in complex software engineering tasks. The model is based on Mistral Small 3, features a 128k token context, and was trained using supervised trajectories from SWE-Gym environments with a two-stage process and policy optimization. Its evaluation on the OpenHands scaffold and SWE-bench Verified demonstrated superior resolution rates, with analyses highlighting optimal iteration limits and the effectiveness of its iterative evaluation protocol. An updated version, Devstral-Small-2507, further improved performance through enhanced data curation.

In the rapidly evolving landscape of artificial intelligence, language models are increasingly being adapted for complex coding tasks. While many models excel at generating code snippets, the challenge of automating multi-step software engineering workflows – such as debugging, refactoring, or implementing features across multiple files – has largely remained the domain of larger, often closed-source, models. Addressing this gap, researchers from Mistral AI and All Hands AI have introduced Devstral-Small, a groundbreaking open-source model specifically fine-tuned for coding agent applications. This model promises to bring advanced agentic capabilities to a more accessible scale.

Devstral-Small stands out as a lightweight model, boasting 24 billion parameters. Despite its modest size, it achieves state-of-the-art performance among open models under 100 billion parameters, making it a fast and easy-to-serve solution. Its development marks a significant step towards enabling more effective and autonomous AI software engineers.

The Foundation of Devstral-Small

At its core, Devstral-Small is a dense Transformer model built upon Mistral Small 3. It features 40 layers and utilizes grouped query attention, a design choice that contributes to its efficiency. The model underwent extensive pre-training on a diverse dataset comprising both natural language and code. A crucial aspect of its design is a long context extension phase, which boosts its context size to an impressive 128,000 tokens. This extended context is particularly vital for coding agents, allowing them to reason over large codebases and complex project-specific information.

Crafting Agentic Intelligence: Data and Training

The development of Devstral-Small focused on fostering an interaction pattern where the AI agent alternates between ‘chain-of-thought’ reasoning and executing actions within a coding environment. To achieve this, supervised trajectories were generated by running an agent in SWE-Gym environments, utilizing the OpenHands CodeAct scaffold. This process involved executing unit tests to assess the quality of the generated code patches. Additionally, a carefully selected mixture of natural language data was included to ensure the model retained strong general natural language understanding capabilities.

The training process for Devstral-Small was structured in two stages. Initially, the model was trained on a larger subset of rollouts that met a baseline quality standard. In the second stage, it was fine-tuned using only the highest-quality trajectories. Further refinement involved additional rounds of rollouts with the fine-tuned model, followed by training with policy optimization, a technique that further enhances its decision-making and action capabilities.

Benchmarking Performance: Outperforming its Peers

Devstral-Small’s performance was rigorously evaluated using an agentic setup, where the model could access bash execution and file editing tools, mimicking a human software engineer’s workflow. The evaluation utilized the OpenHands scaffold, an open platform designed for developing and comparing AI agents in a secure, sandboxed environment, and the SWE-bench Verified benchmark.

The results are compelling: Devstral-Small significantly outperforms other prominent open-source models, such as Qwen 3 and Deepseek V3, despite being a fraction of their size. For instance, it surpasses models that are approximately 10 to 28 times larger. This highlights the effectiveness of specialized training for software engineering tasks, which differ fundamentally from traditional competitive programming challenges.

Furthermore, Devstral-Small demonstrates competitive performance against even closed models. It exceeds the performance of OpenAI’s recent GPT-4.1-mini by over 20% and also performs strongly against Anthropic’s Claude 3.5 Haiku, showcasing its robust capabilities across various evaluation scaffolds.

Insights from Experimental Analysis

The research paper also delves into several analyses to understand Devstral-Small’s behavior under different conditions:

Maximum Iteration Limits: Experiments revealed that 50 iterations represent an optimal balance between computational efficiency and performance. While performance increased substantially from 30 to 50 iterations, further increasing the limit to 100 iterations yielded no additional gains, suggesting that the model typically resolves problems or encounters fundamental challenges within 50 turns.
Temperature Scaling: Investigating the impact of sampling temperature on performance, the study found that lower temperatures (e.g., T=0.1, T=0.4) tended to perform better at higher Pass@K values (meaning success in at least one of K attempts). This counter-intuitive finding, compared to competitive programming exercises where higher temperatures often scale better with K, suggests unique characteristics in agentic coding tasks.
Iterative Evaluation Protocol: The iterative evaluation protocol, which allows up to three independent attempts with increasing temperatures (0, 0.1, 0.1), proved highly effective. It consistently improved resolution rates across iterations and significantly reduced instances of empty patches, leading to more stable and reliable performance metrics.

The Evolution to Devstral-Small-2507

Following the initial release, an updated version, Devstral-Small-2507, was developed. This iteration benefited from refined data generation and curation processes, including the creation of diverse pseudo-scaffolds and training with prompts in both XML and native function calling formats. These improvements led to a significant performance boost, underscoring the critical role of high-quality data in developing advanced language models.

Also Read:

A New Era for Open-Source Coding Agents

Devstral-Small represents a significant advancement in open-source AI for software development. As a high-performance, lightweight, and easily deployable 24 billion parameter model, it is poised to empower developers and researchers with powerful agentic capabilities. Its ability to inspect, edit, enhance, and fix code segments within codebases, combined with its competitive performance against much larger and closed models, positions Devstral-Small as a leading solution in its weight class. For more details, you can refer to the original research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Devstral-Small: A Lightweight Open-Source Model Excelling in Coding Agent Applications

The Foundation of Devstral-Small

Crafting Agentic Intelligence: Data and Training

Benchmarking Performance: Outperforming its Peers

Insights from Experimental Analysis

The Evolution to Devstral-Small-2507

A New Era for Open-Source Coding Agents

Gen AI News and Updates

Runloop.ai Launches Enterprise AI Infrastructure with Google Wallet Co-Founder Rob von Behren Joining Leadership

Microsoft Research Unveils BlueCodeAgent: AI-Powered Defense for Secure Code Generation

MathWorks Introduces MATLAB Copilot: A Generative AI Assistant for Accelerated Engineering and Scientific Development

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates