Assessing AI's Role as Student Simulators in Education

TLDR: A study investigates if Large Language Models (LLMs) can accurately simulate real student abilities in mathematics and reading comprehension. Using Item Response Theory (IRT) on NAEP data from grades 4, 8, and 12, the research found that without specific guidance, LLMs often outperform average students. While grade-enforced prompts can influence performance, reliable alignment with average student abilities is highly model- and prompt-specific, with no single model-prompt pair consistently fitting the bill across subjects and grades. The paper concludes by providing guidelines for selecting viable LLM ‘proxy students’ and highlights the need for new training and evaluation strategies.

Large Language Models (LLMs) are rapidly transforming various fields, and education is no exception. These powerful AI systems are increasingly being explored as ‘proxy students’ for developing Intelligent Tutoring Systems (ITSs) and for piloting new test questions. However, a crucial question remains: how accurately can these AI proxies truly mimic the behavior and characteristics of real students?

The Challenge of Traditional Evaluation

Traditionally, evaluating educational tools like tutors or assessments requires testing them on diverse student populations. This process is incredibly resource-intensive, especially in areas with limited teachers and infrastructure. Current methods, such as teacher-led evaluations or static logs, often struggle to scale or capture the dynamic interactions needed for new educational materials. This has led to a growing interest in using LLMs as a scalable alternative for rigorous and equitable evaluation.

A Novel Approach: Item Response Theory (IRT)

To address the reliability of LLMs as student simulators, a recent study employed Item Response Theory (IRT), a well-established framework in educational measurement. IRT allows researchers to measure both the ability of test-takers (in this case, LLMs and real students) and the difficulty of individual test items on a shared scale. This provides a quantitative way to compare LLM performance directly with human student performance.

The researchers compiled a unique dataset of 489 multiple-choice questions from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension for grades 4, 8, and 12. This dataset included anonymized aggregate response patterns from real students, allowing for a direct comparison with LLM responses.

Exploring LLM Performance Under Different Conditions

The study evaluated 11 diverse LLMs under two main conditions:

Unenforced Prompting: LLMs were given questions without any specific instructions to mimic human behavior.
Grade-Level Mimicking: LLMs were explicitly instructed to act as an average student from a specific grade (4, 8, or 12) using various prompting strategies, ranging from minimal guidance to more detailed ‘Chain of Thought’ (CoT) prompts.

Key Findings: Do LLMs Align with Real Students?

The results revealed several interesting insights:

Without Guidance: Strong, general-purpose LLMs consistently outperformed the average student at every grade level in both mathematics and reading. They often scored much higher than the 50th percentile (representing the average student). Weaker or domain-mismatched models sometimes aligned incidentally, but this was not a reliable pattern.

With Grade-Enforced Prompts: While grade-specific prompting did change LLM performance, whether they aligned with the average grade-level student was highly dependent on the specific model and the prompt used. No single model-prompt combination consistently achieved average student performance across all subjects and grades. Some models improved their alignment, moving closer to the 50th percentile, while others overshot the target or showed little change.

The study categorized the alignment outcomes into four types: models that were aligned both with and without prompts, models misaligned in both cases, models that became aligned with prompting, and models that became misaligned with prompting. This highlights the complexity of achieving reliable student simulation.

Guidelines for Selecting AI Proxy Students

Based on their findings, the researchers provided essential guidelines for selecting viable LLM ‘proxy students’:

Grade Alignment: The LLM’s estimated ability should fall within the normative range of the target grade.
Developmental Ordering: The LLM’s ability should increase monotonically with grade, mirroring real student development.
Prompt Stability: It’s crucial to verify that grade-enforcing prompts consistently improve or maintain accuracy across all grades, or to use unenforced prompts if the model is already well-aligned.

Also Read:

Conclusion

In conclusion, while LLMs offer promising avenues for educational technology, this research underscores that they are not yet universally reliable simulators of real student abilities. Achieving faithful grade-level emulation requires careful selection of models, tailored prompting strategies, and potentially dedicated fine-tuning with explicit alignment objectives. The study emphasizes the need for continued research and more robust evaluation datasets to fully realize the potential of LLMs as effective proxy students in education. You can read the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Assessing AI’s Role as Student Simulators in Education

The Challenge of Traditional Evaluation

A Novel Approach: Item Response Theory (IRT)

Exploring LLM Performance Under Different Conditions

Key Findings: Do LLMs Align with Real Students?

Guidelines for Selecting AI Proxy Students

Conclusion

Gen AI News and Updates

Microsoft Research Unveils Project Gecko to Advance Equitable Multilingual AI for Global Communities

New Jersey Educators Navigate the Integration of AI in Classrooms with Caution and Optimism

Artificial Intelligence Revolutionizes Educator Development and Personalized Learning, New Studies Reveal

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates