TLDR: SID is a new benchmark for evaluating large language models (LLMs) in guiding interdisciplinary STEM education through Socratic dialogues. It features a large dataset of 10,000+ dialogue turns, a detailed annotation system, and new evaluation metrics. Experiments show that current LLMs, even advanced ones like GPT-4o, struggle with dynamic pedagogical adaptation, deep interdisciplinary integration, and effectively fostering knowledge transfer, highlighting the need for more pedagogically-aware AI tutors.
Modern education aims to equip students with the ability to integrate and transfer knowledge across different subjects, especially in complex problem-solving scenarios within STEM (Science, Technology, Engineering, and Mathematics). Interdisciplinary STEM education is a crucial pathway to achieve this, but it often requires expert guidance that is difficult to provide at scale. While large language models (LLMs) show promise in this area, their actual capability for guided instruction has been unclear due to a lack of effective evaluation tools.
To address this critical gap, researchers have introduced SID, the first benchmark specifically designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. This benchmark is a significant step towards understanding how well AI can truly act as a pedagogical guide.
What is SID and Why is it Important?
SID stands for Socratic Interdisciplinary Dialogues. It’s a novel benchmark grounded in established educational theories like Constructivism and the Zone of Proximal Development (ZPD). Constructivism emphasizes that learning is an active process where students build their own understanding, while ZPD theory suggests that effective instruction happens at the edge of a student’s independent abilities, using scaffolding to help them advance.
The Socratic method, with its structured questioning approach, provides an ideal framework for this evaluation. It guides students to clarify ideas, challenge assumptions, and make connections, fostering both systematic cognitive structures and the ability to transfer knowledge to new contexts. SID aims to measure if LLMs can truly implement this complex Socratic guidance, moving beyond simple Q&A to support deep knowledge application.
Building the Benchmark: Dataset and Annotation
The SID benchmark includes a large-scale dataset with over 10,000 dialogue turns across 48 complex STEM projects. These dialogues are generated by simulating interactions between a “Socratic Teacher” agent and various “Student” agent personas, designed to mimic common learning challenges. Crucially, all automatically generated dialogues undergo rigorous human expert review to ensure instructional coherence, strategic validity, and factual accuracy.
To make these complex interactions measurable, SID employs a novel nine-field multi-dimensional annotation schema. This schema captures deep pedagogical features, including the teacher’s intent (e.g., guiding reasoning, triggering transfer), teaching strategy (e.g., follow-up questions, analogies), the disciplines involved, whether knowledge transfer between disciplines occurs, the student’s inferred cognitive state (e.g., confusion, clear understanding), and the cognitive level of the interaction based on Bloom’s Taxonomy.
Evaluating LLMs: Objective and Subjective Measures
SID uses a two-tier evaluation framework for a comprehensive assessment of LLM performance. This combines quantifiable pedagogical behaviors with holistic dialogue quality:
- Objective Behavioral Indicators: Seven automatically computable metrics measure aspects like Strategy Density (frequency of strategies), Strategy Variety (number of unique strategies), Interdisciplinary Knowledge Transfer (IKT), Bloom Progression (BP, increase in student cognitive level), and Cognitive Correction Count (3C, instances of correcting misconceptions). The evaluation is student-centric, balancing teacher process with student learning outcomes.
- Subjective Quality Rubrics: Five rubric-based indicators, such as Interdisciplinary Scientific Reasoning Grading (X-SRG) and Multi-turn Reasoning Coherence (M-RCC), are used for automated evaluation via an “LLM-as-a-Judge” approach (using DeepSeek-V3). These rubrics capture higher-order capabilities that are difficult to quantify automatically.
Key Findings and the Gap with Human Teachers
Experiments with state-of-the-art LLMs, including general-purpose models like GPT-4o and education-oriented models like InnoSpark, reveal significant challenges. While some models might score highly on subjective fluency and structural completeness, they often fall short on objective pedagogical effectiveness, particularly in interdisciplinary knowledge integration and deep transfer.
For instance, models might generate coherent dialogues but struggle to proactively guide students to make interdisciplinary connections (low IKT scores) or effectively promote a sustained increase in students’ cognitive levels (low Bloom Progression). Case studies show that LLMs often exhibit rigid guiding strategies, either ignoring student errors or correcting them in a non-Socratic, direct manner. They also tend to be passive in fostering interdisciplinary connections, often relying on the student to initiate such links.
In contrast, expert human teachers demonstrate dynamic adaptability, diagnosing and leveraging student misconceptions as teaching opportunities, and proactively connecting topics across disciplines. They can flexibly switch strategies and ultimately guide students to construct knowledge independently, enabling true knowledge integration and transfer.
Also Read:
- Evaluating Trust in AI: A New Benchmark for Multimodal Model Confidence
- Fostering LLM Teamwork: A Reinforcement Learning Approach to Collaborative AI
Conclusion
The SID benchmark highlights that despite advancements, current LLMs struggle with dynamic pedagogical adaptation, deep interdisciplinary integration, and effective scaffolding of students’ knowledge transfer. This work, detailed further in the research paper available at arxiv.org/pdf/2508.04563, serves as a foundational tool to drive progress in developing AI tutors that can genuinely foster students’ ability to integrate and transfer knowledge, moving beyond mere fluency to true pedagogical effectiveness.


