TLDR: LIR-ASR is a novel framework that uses Large Language Models (LLMs) and a “Listening-Imagining-Refining” strategy, inspired by human auditory perception, to correct Automatic Speech Recognition (ASR) errors. It employs heuristic optimization with a finite state machine and rule-based constraints to generate and refine phonetic variants in context, preventing common correction pitfalls and maintaining semantic accuracy. Experiments show LIR-ASR significantly reduces Character Error Rate (CER) and Word Error Rate (WER) by up to 1.5 percentage points on English and Chinese ASR outputs, demonstrating substantial accuracy gains.
Automatic Speech Recognition (ASR) systems have become ubiquitous, but they still frequently make errors that can impact various applications. These inaccuracies often stem from environmental noise, overlapping speech, unusual words, and diverse speaker accents. While large audio models have significantly improved ASR robustness, the challenge of correcting these persistent errors remains.
A new research paper titled LISTENING, IMAGINING & REFINING: A HEURISTIC OPTIMIZED ASR CORRECTION FRAMEWORK WITH LLMS introduces LIR-ASR, a novel framework designed to tackle these ASR inaccuracies. Developed by Yutong Liu, Ziyue Zhang, Yongbin Yu, Xiangxiang Wang, Yuqing Cai, and Nyima Tashi, LIR-ASR draws inspiration from how humans process and correct misheard information.
The Human-Inspired Approach: Listening, Imagining, Refining
The core of LIR-ASR is its “Listening-Imagining-Refining” (LIR) strategy, which mimics human auditory perception. When we suspect we’ve misheard something, we instinctively consider phonetically similar alternatives and then evaluate them within the broader context to find the most accurate interpretation. LIR-ASR translates this into three phases:
- Listening: The system first interprets the initial, potentially erroneous ASR output.
- Imagining: It then generates plausible phonetic variants for words that are uncertain or likely incorrect. This phase incorporates a heuristic optimization with controlled randomness to explore a wider range of possible corrections.
- Refining: Finally, these variants are evaluated within the sentence’s context to identify the most accurate transcription. Rule-based constraints are applied here to ensure semantic consistency and prevent the system from introducing new, linguistically plausible but incorrect substitutions.
Overcoming Common Correction Challenges
Traditional ASR correction methods often struggle with errors that appear contextually plausible, interdependent mistakes, and the generation of semantically inconsistent corrections. LIR-ASR addresses these by:
- Employing an iterative heuristic optimization strategy guided by a finite state machine (FSM). This FSM dynamically controls the search for corrections, preventing the process from getting stuck in suboptimal solutions and allowing interdependent errors to be resolved progressively.
- Integrating rule-based constraints that filter out unreliable candidates. These rules ensure phonetic similarity to the original words and maintain consistency in length and structure, guiding the Large Language Model (LLM) to produce more faithful corrections.
How LIR-ASR Works
The framework consists of two main architectural components: a Finite State Machine (FSM) and a heuristic optimization module. The FSM manages the search strategy, alternating between ‘No Search’, ‘Search’, and ‘Search++’ states based on whether improvements are found. This ensures a balance between exploring new possibilities and refining existing ones.
The heuristic optimization process involves several steps: neighbor generation (creating candidate transcripts through phonetic conversions and similar-sounding substitutions), correction (LLMs correct each candidate), candidate fusion (combining the best aspects of multiple corrected candidates), rule constraints (filtering out inconsistent options), and scoring (LLMs assign a score and reasoning to each candidate). This iterative process guarantees that the transcript quality monotonically improves until a stable solution is reached.
Experimental Results and Impact
Experiments were conducted on the FLEURS dataset, using Whisper-medium and Whisper-large-v3 as base ASR recognizers, and Qwen3-235B and DeepSeek-V3.1 as the LLMs for correction. LIR-ASR demonstrated significant improvements in accuracy, achieving average reductions in Character Error Rate (CER) and Word Error Rate (WER) of up to 1.5 percentage points compared to baseline methods.
Notably, LIR-ASR combined with DeepSeek-V3.1 yielded the most substantial gains. For instance, LIR-ASR on Whisper-medium even surpassed the performance of the uncorrected Whisper-large-v3 baseline in some cases, highlighting its effectiveness. An ablation study confirmed the critical role of each component, especially rule-based constraints, multi-candidate generation, the FSM, and neighbor searching, in achieving these performance gains.
The convergence analysis showed that LIR-ASR progressively corrects errors over iterations, with improvements plateauing as it reaches a stable, accurate solution. This indicates a reliable and stable correction behavior across different languages (English and Chinese) and ASR backbones.
Also Read:
- Advancing Speech Recognition: FunAudio-ASR’s Approach to Real-World Performance
- Enhancing Speech Recognition in Multimodal AI with Semantic In-Context Learning
Conclusion
LIR-ASR represents a significant step forward in ASR error correction. By integrating human-inspired auditory processing with advanced LLMs, heuristic optimization, and robust rule-based constraints, it effectively handles complex recognition errors and maintains semantic fidelity. The researchers plan to extend LIR-ASR to low-resource languages like Tibetan, further validating its adaptability and potential for widespread impact.


