spot_img
HomeResearch & DevelopmentBALDWHISPER: Making AI Speech Models Faster and Smaller for...

BALDWHISPER: Making AI Speech Models Faster and Smaller for Languages with Limited Data

TLDR: The research paper “BALDWHISPER: FASTER WHISPER WITH HEAD SHEARING AND LAYER MERGING” introduces a novel method to compress the Whisper ASR model for low-resource languages, specifically Bambara. It addresses the challenge of pruning large pre-trained transformers without requiring massive retraining data. The approach involves two main stages: merging consecutive decoder layers to limit performance loss and compressing the embedding matrix using activation-aware low-rank decomposition and feature distillation. This results in a model that is 48% smaller and 2.15 times faster on edge devices (like a MacBook Air M1) while retaining over 90% of the original performance, all achieved with only 32 hours of speech-to-text data.

Large language models, while powerful, often present a significant challenge for deployment on edge devices, especially in regions with limited data for specific languages. Traditional methods for making these models smaller, known as pruning, typically demand vast amounts of retraining data to maintain performance. For instance, a method called Distill-Whisper prunes the Whisper model but requires retraining on 21,000 hours of speech, a resource often unavailable for many languages.

This challenge is particularly acute for low-resource languages, where collecting such extensive datasets is impractical. The research paper, titled “BALDWHISPER: FASTER WHISPER WITH HEAD SHEARING AND LAYER MERGING,” addresses this critical issue by proposing a novel approach to compress the Whisper Automatic Speech Recognition (ASR) model for data-scarce settings. The authors, Yaya Sy, Christophe Cerisara, and Irina Illina from LORIA, CNRS, Nancy, France, focused their work on Bambara, a low-resource language spoken primarily in Mali, using only 32 hours of speech-to-text data.

The core of their innovation lies in a two-stage pruning recipe that deviates from conventional methods. Instead of simply removing parts of the model, which can lead to significant performance drops, they introduce two key techniques: layer merging and activation-aware embedding decomposition.

Layer Merging for Performance Preservation

The first stage of the BALDWHISPER approach involves merging consecutive layers of the Whisper decoder. In standard pruning, layers are often removed entirely. However, this can severely impact the model’s ability to perform its task. The researchers observed that adjacent layers in the decoder often produce similar activations, suggesting they can be combined without substantial loss of information. They merge pairs of layers using a weighted average, effectively reducing the number of layers in the decoder. For example, Whisper-base, which has 6 decoder layers, is compressed to just 3 layers. This merged model then undergoes retraining with a combination of Cross-Entropy loss and Knowledge Distillation, where the original, uncompressed model acts as a ‘teacher’ to guide the student model’s learning.

Activation-Aware Embedding Decomposition for Further Compression

The second stage tackles another significant component of model size: the input/output embedding matrix. In multilingual models like Whisper, this matrix can account for over 50% of the decoder parameters due to its large vocabulary. When specializing the model for a single language, many features in this matrix become redundant. Instead of vocabulary pruning, which can be risky in scenarios with out-of-vocabulary words or code-switching (common for Bambara speakers who mix in French or English), the authors propose a safer alternative. They compress the embeddings using low-rank decomposition, specifically Singular Value Decomposition (SVD), combined with feature distillation. This technique reduces the embedding parameters significantly, for instance, by four times when the rank is reduced from 384 to 96.

Also Read:

Impressive Results on Edge Devices

The BALDWHISPER approach yielded remarkable results. When applied to Whisper-base, the final compressed model was 48% smaller and 2.15 times faster on a MacBook Air M1, all while preserving over 90% of the original model’s performance. This was achieved using only 32 hours of Bambara ASR training data, demonstrating its effectiveness in low-resource settings. The layer merging alone made the model 1.54 times faster, and the additional embedding decomposition further boosted the speedup. The resulting model, with 38 million parameters, is comparable in size to Whisper-tiny but significantly faster, achieving 142.82 tokens per second compared to Whisper-tiny’s 116.24 tokens per second.

This research offers a promising pathway for deploying high-performing ASR models on local edge devices for languages that lack extensive training data. By intelligently merging layers and compressing embeddings, BALDWHISPER provides a blueprint for making advanced AI more accessible and efficient globally. For more details, you can read the full research paper here.

Nikhil Patel
Nikhil Patelhttps://blogs.edgentiq.com
Nikhil Patel is a tech analyst and AI news reporter who brings a practitioner's perspective to every article. With prior experience working at an AI startup, he decodes the business mechanics behind product innovations, funding trends, and partnerships in the GenAI space. Nikhil's insights are sharp, forward-looking, and trusted by insiders and newcomers alike. You can reach him out at: [email protected]

- Advertisement -

spot_img

Gen AI News and Updates

spot_img

- Advertisement -