TLDR: This research introduces a novel weak-supervision framework to enhance Automatic Speech Recognition (ASR) models for industry-level Customer Relationship Management (CRM) systems. Facing challenges like limited labeled data and complex industry-specific speech, the framework uses Large Language Models (LLMs) and Text-to-Speech (TTS) models to generate large, high-quality synthetic datasets. This synthetic data, combined with a filtering process, is then used to fine-tune pre-trained ASR models. The paper also proposes a new evaluation metric, Integrated Error Rate (IER), for multilingual speech. Experimental results show significant performance improvements, particularly with the Whisper-large-v2 model, demonstrating the approach’s effectiveness in real-world industrial applications.
Customer Relationship Management (CRM) systems are vital tools for businesses, helping them manage customer interactions and data across various communication channels. These systems centralize customer information, leading to improved communication, personalized services, and enhanced customer satisfaction. A growing trend in CRM is the integration of voice technology, which significantly boosts user experience and operational efficiency. For instance, mobile CRM applications can use voice-to-text processing for navigation, record searching, and note-taking, making tasks quicker and more efficient.
However, integrating voice technologies, particularly Automatic Speech Recognition (ASR), into CRM systems comes with its own set of challenges. Ensuring compatibility with existing infrastructure and handling the technical complexities of accurate voice-to-text processing are crucial. ASR systems must precisely manage diverse accents, dialects, languages, and speech patterns. Furthermore, effectively managing errors from voice recognition, which can arise from background noise or unclear speech, without negatively impacting the user experience, remains a significant hurdle.
While ASR models have advanced considerably, they often struggle to meet the specific demands of particular industries. This limitation hinders their effectiveness in CRM pipelines, where precise customer insights are essential. Fine-tuning ASR models for these industry-specific needs is critical but complicated by the difficulty of acquiring large volumes of accurately labeled data in real-world scenarios. Voice data recorded by sales representatives is typically unlabeled, contains complex regional accents, and includes numerous proprietary brand names and colloquial terms. Such low-quality data is not directly usable or easily cleaned for effective model fine-tuning, posing a major obstacle to improving ASR accuracy in CRM applications.
To address these challenges, researchers have proposed an innovative weak-supervision framework for fine-tuning industry-specific ASR models. This solution significantly enhances the performance of ASR models in industrial applications. The core idea is to leverage existing small, high-quality labeled datasets along with advanced technologies like Large Language Models (LLMs) and Text-to-Speech (TTS) models. This approach allows for the generation of large, high-quality synthetic datasets with minimal human and financial cost, which can then be directly used for fine-tuning ASR models.
The framework operates in two main parts: data expansion and data filtering. For data expansion, an LLM (specifically DeepSeek V2) is used to generate synthetic text labels by imitating expressions from original data and incorporating industry-specific keywords. These keywords are obtained by crawling and cleaning relevant industry data from social media. Once the synthetic text labels are generated, an advanced TTS model (ChatTTS) synthesizes speech, simulating various complex real-world scenarios like dialects and accents. This process yields a large amount of synthetic speech data with corresponding labels.
Following data expansion, a crucial data filtering process ensures the high quality of the synthetic data. A pre-trained ASR model (like Whisper-large-v2) is used to infer tags for all synthetic data. These inferred tags are then compared with the originally generated tags using the Character Error Rate (CER) metric. Data that does not meet a predefined quality threshold is excluded, ensuring that only high-quality synthetic data is used for fine-tuning the ASR model. For the fine-tuning stage, LoRA (Low-Rank Adaptation) fine-tuning is employed to reduce computational resource overhead while maintaining performance.
The enhanced CRM pipeline, benefiting from this ASR technology, is designed for businesses with complex customer interactions, such as luxury goods retail, financial services, healthcare, and telecommunication. The system prioritizes obtaining accurate customer insights from sales personnel. The process involves: voice input capture by retail staff via a mobile app, speech-to-text conversion using ASR, data extraction and classification of customer portrait labels using Natural Language Processing (NLP), integration of organized customer data with product data, decision support for personalized pricing and recommendations, and continuous iterative learning and improvement. The research specifically focuses on optimizing the speech-to-text conversion step.
To evaluate the performance of ASR models in complex industrial scenarios, especially those involving multilingual mixed speech (e.g., Chinese with brand names), the researchers proposed a new metric called the Integrated Error Rate (IER). Unlike traditional Word Error Rate (WER) or Character Error Rate (CER), IER combines the advantages of both and incorporates keyword recognition accuracy. This provides a more objective and comprehensive evaluation for hybrid speech recognition tasks, where brand names might retain their original pronunciation without translation.
Experimental results demonstrated the effectiveness of the proposed framework. Three different versions of the Whisper model (medium, large-v2, large-v3) were fine-tuned on various datasets, including small real labeled datasets (GUCCI100, LV100) and large synthetic datasets generated by the framework (GUCCIChatTTS, LVChatTTS, and a merged version). The fine-tuned ASR models showed substantial performance improvements. Notably, the Whisper-large-v2 model, when fine-tuned on the merged synthetic datasets, achieved the best CER performance. Interestingly, the latest Whisper-large-v3 model did not perform as well as expected, sometimes showing abnormal repetition of characters or phrases, possibly due to learning biases from the synthetic data or inherent instability. The IER metric also confirmed Whisper-large-v2 as the top performer.
Also Read:
- Advancing Emotion Recognition in Conversations with Long-Short Distance Graph Neural Networks and Improved Curriculum Learning
- Advancing Text-to-Speech for Indian Languages with A2TTS
This research presents a significant step towards making ASR models more effective and accurate for industry-level CRM systems, especially in environments with limited labeled data and complex speech patterns. The proposed weak supervision framework offers a cost-effective way to generate high-quality training data, paving the way for more personalized and efficient customer service. For more detailed information, you can refer to the full research paper: Weak Supervision Techniques towards Enhanced ASR Models in Industry-level CRM Systems.


