LIR-ASR: Enhancing Speech Recognition Accuracy with a Human-Inspired LLM Framework

TLDR: LIR-ASR is a novel framework that uses Large Language Models (LLMs) and a “Listening-Imagining-Refining” strategy, inspired by human auditory perception, to correct Automatic Speech Recognition (ASR) errors. It employs heuristic optimization with a finite state machine and rule-based constraints to generate and refine phonetic variants in context, preventing common correction pitfalls and maintaining semantic accuracy. Experiments show LIR-ASR significantly reduces Character Error Rate (CER) and Word Error Rate (WER) by up to 1.5 percentage points on English and Chinese ASR outputs, demonstrating substantial accuracy gains.

Automatic Speech Recognition (ASR) systems have become ubiquitous, but they still frequently make errors that can impact various applications. These inaccuracies often stem from environmental noise, overlapping speech, unusual words, and diverse speaker accents. While large audio models have significantly improved ASR robustness, the challenge of correcting these persistent errors remains.

A new research paper titled LISTENING, IMAGINING & REFINING: A HEURISTIC OPTIMIZED ASR CORRECTION FRAMEWORK WITH LLMS introduces LIR-ASR, a novel framework designed to tackle these ASR inaccuracies. Developed by Yutong Liu, Ziyue Zhang, Yongbin Yu, Xiangxiang Wang, Yuqing Cai, and Nyima Tashi, LIR-ASR draws inspiration from how humans process and correct misheard information.

The Human-Inspired Approach: Listening, Imagining, Refining

The core of LIR-ASR is its “Listening-Imagining-Refining” (LIR) strategy, which mimics human auditory perception. When we suspect we’ve misheard something, we instinctively consider phonetically similar alternatives and then evaluate them within the broader context to find the most accurate interpretation. LIR-ASR translates this into three phases:

Listening: The system first interprets the initial, potentially erroneous ASR output.
Imagining: It then generates plausible phonetic variants for words that are uncertain or likely incorrect. This phase incorporates a heuristic optimization with controlled randomness to explore a wider range of possible corrections.
Refining: Finally, these variants are evaluated within the sentence’s context to identify the most accurate transcription. Rule-based constraints are applied here to ensure semantic consistency and prevent the system from introducing new, linguistically plausible but incorrect substitutions.

Overcoming Common Correction Challenges

Traditional ASR correction methods often struggle with errors that appear contextually plausible, interdependent mistakes, and the generation of semantically inconsistent corrections. LIR-ASR addresses these by:

Employing an iterative heuristic optimization strategy guided by a finite state machine (FSM). This FSM dynamically controls the search for corrections, preventing the process from getting stuck in suboptimal solutions and allowing interdependent errors to be resolved progressively.
Integrating rule-based constraints that filter out unreliable candidates. These rules ensure phonetic similarity to the original words and maintain consistency in length and structure, guiding the Large Language Model (LLM) to produce more faithful corrections.

How LIR-ASR Works

The framework consists of two main architectural components: a Finite State Machine (FSM) and a heuristic optimization module. The FSM manages the search strategy, alternating between ‘No Search’, ‘Search’, and ‘Search++’ states based on whether improvements are found. This ensures a balance between exploring new possibilities and refining existing ones.

The heuristic optimization process involves several steps: neighbor generation (creating candidate transcripts through phonetic conversions and similar-sounding substitutions), correction (LLMs correct each candidate), candidate fusion (combining the best aspects of multiple corrected candidates), rule constraints (filtering out inconsistent options), and scoring (LLMs assign a score and reasoning to each candidate). This iterative process guarantees that the transcript quality monotonically improves until a stable solution is reached.

Experimental Results and Impact

Experiments were conducted on the FLEURS dataset, using Whisper-medium and Whisper-large-v3 as base ASR recognizers, and Qwen3-235B and DeepSeek-V3.1 as the LLMs for correction. LIR-ASR demonstrated significant improvements in accuracy, achieving average reductions in Character Error Rate (CER) and Word Error Rate (WER) of up to 1.5 percentage points compared to baseline methods.

Notably, LIR-ASR combined with DeepSeek-V3.1 yielded the most substantial gains. For instance, LIR-ASR on Whisper-medium even surpassed the performance of the uncorrected Whisper-large-v3 baseline in some cases, highlighting its effectiveness. An ablation study confirmed the critical role of each component, especially rule-based constraints, multi-candidate generation, the FSM, and neighbor searching, in achieving these performance gains.

The convergence analysis showed that LIR-ASR progressively corrects errors over iterations, with improvements plateauing as it reaches a stable, accurate solution. This indicates a reliable and stable correction behavior across different languages (English and Chinese) and ASR backbones.

Also Read:

Conclusion

LIR-ASR represents a significant step forward in ASR error correction. By integrating human-inspired auditory processing with advanced LLMs, heuristic optimization, and robust rule-based constraints, it effectively handles complex recognition errors and maintains semantic fidelity. The researchers plan to extend LIR-ASR to low-resource languages like Tibetan, further validating its adaptability and potential for widespread impact.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

LIR-ASR: Enhancing Speech Recognition Accuracy with a Human-Inspired LLM Framework

The Human-Inspired Approach: Listening, Imagining, Refining

Overcoming Common Correction Challenges

How LIR-ASR Works

Experimental Results and Impact

Conclusion

Gen AI News and Updates

Google DeepMind Unveils SIMA 2: An Advanced AI Agent for Virtual 3D Worlds

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates