Unpacking AI's Performance in University STEM Exams: A Look at Visual Challenges

TLDR: A research paper introduces a dataset of 201 university-level STEM questions with images to compare AI and student performance. It finds that while AI performs well overall, it struggles significantly with questions involving crucial visual components, complex problems, and multiple-choice questions with multiple answers, areas where human students often maintain consistent performance. The study offers insights for educators on designing assessments that challenge AI without increasing student burden.

Generative AI systems have made remarkable strides, particularly with their ability to process multimodal inputs, allowing them to reason beyond simple text-based tasks. This advancement holds significant implications for education, especially in assessment design and question answering, presenting both exciting opportunities and notable challenges.

A recent research paper titled “Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison” delves into these effects by introducing a high-quality dataset of 201 university-level STEM questions. These questions, manually annotated with features like image type, role, problem complexity, and question format, provide a robust framework for comparing generative AI performance against that of human students. The study evaluated four model families using five different prompting strategies, comparing their results to the average of 546 student responses per question.

The findings reveal that while the best AI model achieved an average of 58.5% correct answers, human participants consistently outperformed AI, particularly on questions that involved visual components. Interestingly, human performance remained stable across various question features but varied by subject. In contrast, AI performance was susceptible to both subject matter and specific question features.

The researchers observed that AI models, despite their overall advancements, struggled significantly with questions requiring crucial images—meaning the image contained essential information not present in the text. Students, however, maintained consistent accuracy regardless of whether an image was crucial or merely supplemental. When looking at specific image types, AI found diagrams and line plots particularly challenging, whereas students struggled most with algorithm-based questions.

Regarding question features, AI models performed slightly better than students on “compound” questions (multiple sub-questions linked by a common topic) but struggled considerably with “multiple choice questions multiple answers” (MCQMA), often failing to identify all correct choices. Furthermore, AI performance declined when questions involved more than two concepts, a factor that did not significantly impact student performance. The study also highlighted that AI excels in subjects like Astronomy, Computer Science, and Microfabrication, likely due to the structured nature of these questions, but falters in Quantum Physics, Chemistry, Neuroscience, and Electromagnetism, where complex, content-rich images pose greater hurdles.

The error analysis provided deeper insights. Humans demonstrated a superior ability to integrate common sense, domain-specific intuition, and experiential learning, especially in physics-based reasoning and real-world conventions. They could interpret implicit relationships and complex diagrams more effectively. Conversely, AI models excelled in problems demanding structured reasoning, precise pattern recognition, and large-scale knowledge retrieval, such as algorithmic network problems or interpreting simple electrical schematics. These are tasks that follow well-defined rules, where AI’s ability to process extended contexts effortlessly gives it an advantage over humans, who might experience cognitive load with multi-step reasoning.

Also Read:

In conclusion, the research suggests that questions designed with crucial images and multiple concepts, while remaining concise, can effectively challenge current AI systems without increasing the cognitive burden for students. This offers actionable insights for educators aiming to enhance academic integrity in an era of rapidly advancing AI. The paper also acknowledges limitations, such as the dataset size and the strict grading method for MCQMA questions, which did not account for partial credit. For more details, you can refer to the full research paper here.

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Financial Sector Fortifies Against Surging AI-Powered Scams

Deloitte’s 2025 Outlook: Navigating Escalating AI Challenges in Human Capital

Salesforce Study Reveals Data Quality is Pivotal for Employee Trust in AI Adoption

Top Executives Sidestep Company AI Guidelines, Fueling Shadow AI Risks

Intel’s Evolving IP Strategy: A Calculated Shift Towards Core AI Innovation

Generative AI Prompts Increased Workforce Surveillance in Indian IT Sector

Unpacking AI’s Performance in University STEM Exams: A Look at Visual Challenges

Gen AI News and Updates

AI’s Hyper-Growth Unlocked: OpenAI’s $500B Valuation Forces a Capital Re-evaluation for Investors

PASA Unveils New ‘Data for AI’ Guidance to Foster Responsible Innovation in Pensions Administration

Ghana Navigates Complexities in AI Regulatory Development Amidst Coordination Challenges

Boosting Business Efficiency: A New AI and Big Data Model for Process Optimization

AlphaCast: A New Approach to Time Series Prediction Through Human-AI Collaboration

New Graph Neural Networks Improve Reasoning in Assumption-Based Argumentation

Enhancing AI Reasoning: How Recursive Refinement and Multi-Agent Systems Improve Language Model Performance

ARGUS: A Proactive Framework for Enhancing Autonomous Driving Safety

Generative AI Powers Next-Gen Autonomous Emergency Response

OR-R1: Advancing Automated Optimization with Smart, Data-Efficient AI

Enhancing GUI Agents with Memory: A New Framework for History-Aware Reasoning

ProBench: A Deeper Look into How We Evaluate AI Agents for Mobile Apps

Enhancing Large Language Model Reasoning with Concise Outputs

Ensuring Trust in Autonomous AI: A Two-Layered Monitoring Approach for Agentic Systems

MedFuse: A Multiplicative Approach to Understanding Irregular Clinical Time Series Data

HyperD: A New Framework for More Accurate and Robust Traffic Predictions

Beyond Training: Researchers Propose ‘Model Raising’ for AI with Intrinsic Values

Bridging the Divide: Why AI Needs a Qualitative Revolution

Language Models Enhance Safety Certificate Synthesis for Dynamic Systems

Subscribe to get the latest news and updates