TLDR: Mistral AI has unveiled Voxtral Mini and Voxtral Small, open-weight multimodal AI models capable of understanding both spoken audio and text. These models achieve state-of-the-art performance in speech recognition and translation, while retaining strong text capabilities. With a 32K context window, they can process audio files up to 40 minutes long and handle complex multi-turn conversations. Released under an Apache 2.0 license, Voxtral Small is notably efficient, capable of running locally while outperforming many closed-source alternatives.
Mistral AI has introduced Voxtral Mini and Voxtral Small, two innovative multimodal audio chat models designed to understand both spoken audio and text documents. These models represent a significant step forward in AI, achieving state-of-the-art performance across various audio benchmarks while maintaining robust text processing capabilities.
Voxtral Small, in particular, stands out by outperforming several closed-source models, yet it is compact enough to operate efficiently on local devices. A key feature of both Voxtral models is their impressive 32K context window, which allows them to handle audio files up to 40 minutes in length and engage in extensive multi-turn conversations. This extended context window is crucial for processing longer spoken interactions and complex dialogues.
The development of Voxtral also includes the contribution of three new benchmarks specifically designed to evaluate speech understanding models on knowledge and trivia, addressing a previous gap in the evaluation ecosystem which often focused primarily on transcription and translation quality. Both Voxtral models are openly released under the Apache 2.0 license, promoting accessibility and further research in the field.
At its core, Voxtral is built upon the Transformer architecture, comprising three main components: an audio encoder, an adapter layer, and a language decoder. The audio encoder, based on Whisper large-v3, processes speech inputs by attending to 30-second chunks of audio independently. The adapter layer then efficiently downsamples these audio embeddings, reducing the sequence length and enabling the model to handle longer audio inputs without excessive computational overhead. Finally, the language decoder is responsible for reasoning and generating text outputs based on the combined audio and text inputs.
Two distinct variants of Voxtral are available: Voxtral Mini, which is built on the Ministral 3B model, and Voxtral Small, which leverages the more powerful Mistral Small 3.1 24B backbone. These variants offer different scales of performance and memory footprint, catering to a range of applications from edge devices to more demanding tasks.
The training of Voxtral models involves a sophisticated three-phase process: pretraining, supervised finetuning, and preference alignment. During pretraining, the models learn to align speech and text through two patterns: audio-to-text repetition (for transcription) and cross-modal continuation (for deeper understanding and dialogue). Supervised finetuning enhances instruction-following behavior using a mix of real and synthetic data, including scenarios where audio provides context for text queries and tasks involving audio-only inputs. Preference alignment, utilizing Direct Preference Optimization (DPO) and its online variant, further refines the model’s response quality and helpfulness.
Evaluations show that Voxtral Small achieves state-of-the-art results in speech transcription and translation, surpassing both open and closed-source models on various benchmarks. In speech question-answering and summarization, it performs comparably to leading closed models like GPT-4o mini and Gemini 2.5 Flash. Importantly, Voxtral Small also maintains strong performance on text-only benchmarks, making it a versatile solution for both audio and text-based tasks.
The research paper delves into various analyses, including the impact of audio padding, the optimal downsampling factor for the adapter layer, and the importance of balancing pre-training patterns. These insights highlight the meticulous engineering behind Voxtral’s robust performance. For more technical details, you can refer to the full research paper here.
Also Read:
- Mistral Unveils Voxtral: An Open-Source Challenge to Proprietary Speech AI Models
- A Unified Approach to Boosting Keyword Recognition in Speech-to-Text Systems
Voxtral Mini and Voxtral Small represent a significant contribution to the field of multimodal AI, offering powerful, open-weight models that excel in understanding and processing both spoken and written language. Their strong instruction following and multilingual capabilities make them highly adaptable for a wide array of complex multimodal applications.


