BALDWHISPER: Making AI Speech Models Faster and Smaller for Languages with Limited Data

TLDR: The research paper “BALDWHISPER: FASTER WHISPER WITH HEAD SHEARING AND LAYER MERGING” introduces a novel method to compress the Whisper ASR model for low-resource languages, specifically Bambara. It addresses the challenge of pruning large pre-trained transformers without requiring massive retraining data. The approach involves two main stages: merging consecutive decoder layers to limit performance loss and compressing the embedding matrix using activation-aware low-rank decomposition and feature distillation. This results in a model that is 48% smaller and 2.15 times faster on edge devices (like a MacBook Air M1) while retaining over 90% of the original performance, all achieved with only 32 hours of speech-to-text data.

Large language models, while powerful, often present a significant challenge for deployment on edge devices, especially in regions with limited data for specific languages. Traditional methods for making these models smaller, known as pruning, typically demand vast amounts of retraining data to maintain performance. For instance, a method called Distill-Whisper prunes the Whisper model but requires retraining on 21,000 hours of speech, a resource often unavailable for many languages.

This challenge is particularly acute for low-resource languages, where collecting such extensive datasets is impractical. The research paper, titled “BALDWHISPER: FASTER WHISPER WITH HEAD SHEARING AND LAYER MERGING,” addresses this critical issue by proposing a novel approach to compress the Whisper Automatic Speech Recognition (ASR) model for data-scarce settings. The authors, Yaya Sy, Christophe Cerisara, and Irina Illina from LORIA, CNRS, Nancy, France, focused their work on Bambara, a low-resource language spoken primarily in Mali, using only 32 hours of speech-to-text data.

The core of their innovation lies in a two-stage pruning recipe that deviates from conventional methods. Instead of simply removing parts of the model, which can lead to significant performance drops, they introduce two key techniques: layer merging and activation-aware embedding decomposition.

Layer Merging for Performance Preservation

The first stage of the BALDWHISPER approach involves merging consecutive layers of the Whisper decoder. In standard pruning, layers are often removed entirely. However, this can severely impact the model’s ability to perform its task. The researchers observed that adjacent layers in the decoder often produce similar activations, suggesting they can be combined without substantial loss of information. They merge pairs of layers using a weighted average, effectively reducing the number of layers in the decoder. For example, Whisper-base, which has 6 decoder layers, is compressed to just 3 layers. This merged model then undergoes retraining with a combination of Cross-Entropy loss and Knowledge Distillation, where the original, uncompressed model acts as a ‘teacher’ to guide the student model’s learning.

Activation-Aware Embedding Decomposition for Further Compression

The second stage tackles another significant component of model size: the input/output embedding matrix. In multilingual models like Whisper, this matrix can account for over 50% of the decoder parameters due to its large vocabulary. When specializing the model for a single language, many features in this matrix become redundant. Instead of vocabulary pruning, which can be risky in scenarios with out-of-vocabulary words or code-switching (common for Bambara speakers who mix in French or English), the authors propose a safer alternative. They compress the embeddings using low-rank decomposition, specifically Singular Value Decomposition (SVD), combined with feature distillation. This technique reduces the embedding parameters significantly, for instance, by four times when the rank is reduced from 384 to 96.

Also Read:

Impressive Results on Edge Devices

The BALDWHISPER approach yielded remarkable results. When applied to Whisper-base, the final compressed model was 48% smaller and 2.15 times faster on a MacBook Air M1, all while preserving over 90% of the original model’s performance. This was achieved using only 32 hours of Bambara ASR training data, demonstrating its effectiveness in low-resource settings. The layer merging alone made the model 1.54 times faster, and the additional embedding decomposition further boosted the speedup. The resulting model, with 38 million parameters, is comparable in size to Whisper-tiny but significantly faster, achieving 142.82 tokens per second compared to Whisper-tiny’s 116.24 tokens per second.

This research offers a promising pathway for deploying high-performing ASR models on local edge devices for languages that lack extensive training data. By intelligently merging layers and compressing embeddings, BALDWHISPER provides a blueprint for making advanced AI more accessible and efficient globally. For more details, you can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

BALDWHISPER: Making AI Speech Models Faster and Smaller for Languages with Limited Data

Layer Merging for Performance Preservation

Activation-Aware Embedding Decomposition for Further Compression

Impressive Results on Edge Devices

Gen AI News and Updates

SymLight: Unlocking Interpretable and Deployable Traffic Signal Control

Strategic Language Selection Enhances Multilingual AI for Low-Resource Settings

Bridging the Linguistic Divide: New Dataset Advances NLP for Nigeria’s Minority Languages

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates