TLDR: A new study reveals significant demographic and geographic biases in open-source Large Language Models (LLMs) used for academic recommendations. Analyzing LLaMA-3.1-8B, Gemma-7B, and Mistral-7B with 360 simulated user profiles, researchers found LLMs disproportionately favor institutions in the Global North, reinforce gender stereotypes, and filter opportunities based on economic status. The paper introduces a novel evaluation framework, including Demographic Representation Score (DRS) and Geographic Representation Score (GRS), to quantify these biases. It concludes that simple user-side prompt engineering is insufficient to overcome these systemic issues, highlighting an urgent need for bias mitigation in educational AI to ensure equitable access to higher education globally.
Large Language Models (LLMs) are increasingly becoming a part of our daily lives, even extending their reach into critical areas like education planning. They promise personalized academic advice, but a recent study from the Indian Institute of Technology Madras raises a crucial question: are these AI systems perpetuating societal biases in their university and program recommendations?
The research paper, titled “Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations,” empirically examines geographic, demographic, and economic biases present in three popular open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Authored by Krithi Shailya, Akhilesh Kumar Mishra, Gokul S Krishnan, and Balaraman Ravindran, the study highlights a pressing concern for equitable access to higher education globally.
The team conducted an extensive analysis using 360 simulated user profiles, varying by gender, nationality, and economic status, generating over 25,000 recommendations. The findings reveal strong and concerning biases. Institutions in the Global North, particularly in the United States and the United Kingdom, are disproportionately favored, accounting for 52–80% of all recommendations. This creates a significant Western-centric bias, effectively making vast regions of the world, including major education hubs like India and Brazil, almost invisible in the LLMs’ recommendations.
Beyond geography, the study uncovered pervasive gender stereotypes. Female profiles were frequently steered toward social sciences and development studies, while male profiles received suggestions predominantly in engineering and computer science. Transgender users faced an even starker bias, often recommended programs like gender studies and social work, even when their stated interest was in fields like computer science. This suggests that the models’ stereotypical associations override a user’s defined skillset, undermining the core purpose of a recommendation system.
Economic status also played a significant role, with recommendations correlating directly with institutional prestige. High-class profiles were often directed to universities with high reputation scores but low accessibility, while low-class profiles saw a substantial drop in the reputation of recommended institutions. The researchers term this “digital gatekeeping,” where LLMs preemptively filter out top-tier opportunities for lower-income backgrounds, despite the existence of numerous scholarship opportunities.
To quantify these complex issues, the researchers propose a novel, multi-dimensional evaluation framework. This framework goes beyond simple accuracy, measuring fairness through two key metrics:
Demographic Representation Score (DRS)
The DRS assesses how well recommendations align with a student’s background. It comprises three sub-metrics:
- Socio-Economic Accessibility: Measures the fit between a student and a university based on geographic distance and economic class, reflecting the decay of educational opportunity over socio-economic distance.
- Reputation Alignment: Quantifies institutional prestige using global ranking systems like QS World University Rankings.
- Academic Alignment: Measures the curricular fit between a student’s interests and the university’s program offerings, using a subject-tag taxonomy.
Also Read:
- Guiding AI to Fairer Representations in Occupational Stories
- DeepTRACE: A Framework for Auditing AI Search and Research Systems
Geographic Representation Score (GRS)
The GRS evaluates the overall set-level representation and quality of recommended universities within the global higher education landscape. Its components include:
- Normalized Representation: A ratio that measures the proportion of a country’s universities recommended by a model, adjusted for the relative size of its higher education sector. This helps prevent the dominance of countries with large academic systems.
- Reputational Coverage: Ensures that the representation of a country is not achieved by recommending only low-quality institutions, rewarding models that suggest reputable universities within a given nation.
The study also explored whether simple user-side prompt engineering could mitigate these systemic biases. For instance, adding a “regionally-accessible” constraint to the prompt. The results were mixed and often unpredictable. While some previously underrepresented nations gained visibility, the overall Demographic Representation Score often decreased due to a significant drop in university reputation. Crucially, major developing nations like India and Brazil still received a GRS of zero across all models, even with regional constraints. This indicates that user-side prompts alone are insufficient to overcome deep-seated knowledge gaps and biases within these models.
Among the tested models, LLaMA-3.1-8B achieved the highest diversity, recommending 481 unique universities across 58 countries, making it the most globally representative. However, systemic disparities persisted even in this model, and Gemma-7B performed the worst in terms of global representation.
The findings underscore the urgent need for bias consideration in educational LLMs to ensure equitable global access to higher education. The proposed framework offers a replicable method for practitioners to understand and address what a model lacks, guiding future research in bias mitigation strategies like fairness-aware losses or enriched non-Western training corpora. The principles of this framework are also adaptable to other high-stakes recommendation domains, such as job matching services or healthcare provider selection, where balancing user constraints, domain expertise, and population-level diversity is equally critical.
For a deeper dive into the methodology and detailed results, you can read the full research paper here.


