TLDR: A new research paper reveals that while Large Language Models (LLMs) are increasingly used for accessibility guidance, they often fail to provide comprehensive and balanced support across all disability types. The study, which audited 17 LLMs, found that Vision, Hearing, and Mobility impairments are frequently addressed, but categories like Speech, Genetic/Developmental, Sensory-Cognitive, and Mental Health remain significantly underserved in terms of both coverage and depth of information. The research introduces a framework to systematically evaluate these gaps and demonstrates that taxonomy-aware prompting can improve LLM inclusivity.
Large Language Models (LLMs) are rapidly changing how we access information and interact with technology. They hold immense potential for the over 1.3 billion people globally living with disabilities, making accessible design not just an option, but a necessity. As LLMs become more integrated into our daily lives through assistants and content tools, they are increasingly shaping accessibility guidance.
However, a crucial question arises: do these powerful AI systems provide comprehensive and balanced support across the full spectrum of disability needs? A recent research paper, “Who Gets Left Behind? Auditing Disability Inclusivity in Large Language Models”, by Deepika Dash, Yeshil Bangera, Mithil Bangera, Gouthami Vadithya, and Srikant Panda, delves into this very question, revealing significant inclusivity gaps in current LLM offerings.
The Challenge: Uneven Support for Disability Groups
While existing research has touched upon NLP fairness concerning demographics like gender and race, or focused on detecting toxic language, a systematic audit of LLMs’ inclusivity across various disability types has been largely missing. Previous studies often centered on narrow contexts or single disability categories, failing to capture the broader picture of balanced support.
The authors argue that a truly inclusive system should offer recommendations for a wide range of needs, including vision, hearing, speech, mobility, neurological, genetic/developmental, learning, sensory-cognitive, and mental health. An incomplete response, for instance, to a question like “How can hospitals be made more accessible?” that overlooks several categories, risks providing inequitable guidance.
A New Framework for Auditing Inclusivity
To address this gap, the researchers developed a taxonomy-aligned benchmark of human-validated, general-purpose accessibility questions. This framework systematically audits inclusivity across nine distinct disability categories, aligning with established accessibility frameworks like the WHO International Classification of Functioning, Disability and Health (ICF).
The benchmark evaluates LLMs along three key dimensions:
- Question-Level Coverage (QLCS): This measures the breadth of an LLM’s answer to a single question, indicating how many different disability categories are addressed within that response.
- Disability-Level Coverage (DLCS): This assesses the balance across the nine disability categories, showing how frequently a specific disability type is covered across all evaluation questions.
- Depth: This dimension evaluates the specificity and quality of support provided. A score from 0 (not mentioned) to 3 (multiple details, examples, or nuanced explanation) is assigned to each category mention.
Key Findings: Who Gets Left Behind?
Applying this framework to 17 proprietary and open-weight LLMs revealed persistent inclusivity gaps:
- Uneven Coverage: Vision, Hearing, and Mobility impairments are frequently addressed by LLMs, often showing high coverage rates.
- Underserved Categories: Speech, Genetic/Developmental, Sensory-Cognitive, and Mental Health categories consistently remain underserved. Responses related to these areas are often sparse or entirely absent.
- Lack of Depth: Even when multiple disabilities are mentioned, the depth of explanation is often concentrated in a few categories (primarily Vision and Mobility), while others receive minimal to no detailed guidance. Categories like Neuro, Gen/Dev, and Mental Health are almost never addressed with significant depth.
For example, the study found that the best-performing models still only covered about half of the relevant disability categories per question. Smaller models, in particular, showed weaker coverage and negligible depth outside of Vision and Mobility.
Also Read:
- Unpacking How Large Language Models Interpret External Definitions
- Unveiling Hidden Biases: A New Framework for Fair AI in Clinical Decisions
Mitigation Strategies and Future Directions
The research also explored mitigation strategies, specifically the effect of prompt design. By introducing “accessibility awareness prompting” – essentially instructing the LLM to act as an accessibility expert and cover all major disability categories – the researchers observed consistent gains in Question-Level Coverage Scores across mid-sized models. This structured prompting helped narrow gaps in previously underrepresented categories while maintaining strong performance in well-covered domains.
The findings underscore the critical need for taxonomy-aware training and evaluation practices for LLMs. By understanding who gets left behind in current AI-based accessibility support, developers can focus on building more responsible and inclusive language technologies that truly align with global accessibility standards and equitably serve all users.
The authors acknowledge limitations, including the dataset’s scale and focus on English-language, general-purpose questions. Future work aims to explore multilingual benchmarks, conversational accessibility scenarios, and human-in-the-loop validation to further strengthen reliability and factual accuracy.


