TLDR: Devstral-Small is a 24-billion parameter open-source language model developed by Mistral AI and All Hands AI, specifically fine-tuned for coding agent applications. It achieves state-of-the-art performance among models under 100 billion parameters, outperforming much larger open and even some closed models in complex software engineering tasks. The model is based on Mistral Small 3, features a 128k token context, and was trained using supervised trajectories from SWE-Gym environments with a two-stage process and policy optimization. Its evaluation on the OpenHands scaffold and SWE-bench Verified demonstrated superior resolution rates, with analyses highlighting optimal iteration limits and the effectiveness of its iterative evaluation protocol. An updated version, Devstral-Small-2507, further improved performance through enhanced data curation.
In the rapidly evolving landscape of artificial intelligence, language models are increasingly being adapted for complex coding tasks. While many models excel at generating code snippets, the challenge of automating multi-step software engineering workflows – such as debugging, refactoring, or implementing features across multiple files – has largely remained the domain of larger, often closed-source, models. Addressing this gap, researchers from Mistral AI and All Hands AI have introduced Devstral-Small, a groundbreaking open-source model specifically fine-tuned for coding agent applications. This model promises to bring advanced agentic capabilities to a more accessible scale.
Devstral-Small stands out as a lightweight model, boasting 24 billion parameters. Despite its modest size, it achieves state-of-the-art performance among open models under 100 billion parameters, making it a fast and easy-to-serve solution. Its development marks a significant step towards enabling more effective and autonomous AI software engineers.
The Foundation of Devstral-Small
At its core, Devstral-Small is a dense Transformer model built upon Mistral Small 3. It features 40 layers and utilizes grouped query attention, a design choice that contributes to its efficiency. The model underwent extensive pre-training on a diverse dataset comprising both natural language and code. A crucial aspect of its design is a long context extension phase, which boosts its context size to an impressive 128,000 tokens. This extended context is particularly vital for coding agents, allowing them to reason over large codebases and complex project-specific information.
Crafting Agentic Intelligence: Data and Training
The development of Devstral-Small focused on fostering an interaction pattern where the AI agent alternates between ‘chain-of-thought’ reasoning and executing actions within a coding environment. To achieve this, supervised trajectories were generated by running an agent in SWE-Gym environments, utilizing the OpenHands CodeAct scaffold. This process involved executing unit tests to assess the quality of the generated code patches. Additionally, a carefully selected mixture of natural language data was included to ensure the model retained strong general natural language understanding capabilities.
The training process for Devstral-Small was structured in two stages. Initially, the model was trained on a larger subset of rollouts that met a baseline quality standard. In the second stage, it was fine-tuned using only the highest-quality trajectories. Further refinement involved additional rounds of rollouts with the fine-tuned model, followed by training with policy optimization, a technique that further enhances its decision-making and action capabilities.
Benchmarking Performance: Outperforming its Peers
Devstral-Small’s performance was rigorously evaluated using an agentic setup, where the model could access bash execution and file editing tools, mimicking a human software engineer’s workflow. The evaluation utilized the OpenHands scaffold, an open platform designed for developing and comparing AI agents in a secure, sandboxed environment, and the SWE-bench Verified benchmark.
The results are compelling: Devstral-Small significantly outperforms other prominent open-source models, such as Qwen 3 and Deepseek V3, despite being a fraction of their size. For instance, it surpasses models that are approximately 10 to 28 times larger. This highlights the effectiveness of specialized training for software engineering tasks, which differ fundamentally from traditional competitive programming challenges.
Furthermore, Devstral-Small demonstrates competitive performance against even closed models. It exceeds the performance of OpenAI’s recent GPT-4.1-mini by over 20% and also performs strongly against Anthropic’s Claude 3.5 Haiku, showcasing its robust capabilities across various evaluation scaffolds.
Insights from Experimental Analysis
The research paper also delves into several analyses to understand Devstral-Small’s behavior under different conditions:
-
Maximum Iteration Limits: Experiments revealed that 50 iterations represent an optimal balance between computational efficiency and performance. While performance increased substantially from 30 to 50 iterations, further increasing the limit to 100 iterations yielded no additional gains, suggesting that the model typically resolves problems or encounters fundamental challenges within 50 turns.
-
Temperature Scaling: Investigating the impact of sampling temperature on performance, the study found that lower temperatures (e.g., T=0.1, T=0.4) tended to perform better at higher Pass@K values (meaning success in at least one of K attempts). This counter-intuitive finding, compared to competitive programming exercises where higher temperatures often scale better with K, suggests unique characteristics in agentic coding tasks.
-
Iterative Evaluation Protocol: The iterative evaluation protocol, which allows up to three independent attempts with increasing temperatures (0, 0.1, 0.1), proved highly effective. It consistently improved resolution rates across iterations and significantly reduced instances of empty patches, leading to more stable and reliable performance metrics.
The Evolution to Devstral-Small-2507
Following the initial release, an updated version, Devstral-Small-2507, was developed. This iteration benefited from refined data generation and curation processes, including the creation of diverse pseudo-scaffolds and training with prompts in both XML and native function calling formats. These improvements led to a significant performance boost, underscoring the critical role of high-quality data in developing advanced language models.
Also Read:
- Optimizing LLM Code Generation with Reinforcement Learning and Concise Reasoning
- PIPER: An Efficient AI Model for Automated Software Environment Setup
A New Era for Open-Source Coding Agents
Devstral-Small represents a significant advancement in open-source AI for software development. As a high-performance, lightweight, and easily deployable 24 billion parameter model, it is poised to empower developers and researchers with powerful agentic capabilities. Its ability to inspect, edit, enhance, and fix code segments within codebases, combined with its competitive performance against much larger and closed models, positions Devstral-Small as a leading solution in its weight class. For more details, you can refer to the original research paper here.


