TLDR: ParaStudent is a new framework that teaches Large Language Models (LLMs) to generate realistic, imperfect, and iterative code, mimicking how human students learn. By fine-tuning LLMs on real student submissions and evaluating code across semantic, functional, and stylistic dimensions, the research found that fine-tuning is crucial for capturing authentic learning dynamics, unlike simple prompting which tends to produce overly perfect code. This work has implications for generating realistic educational data and developing more effective AI tutor agents.
Large Language Models, or LLMs, have shown impressive capabilities in generating code. However, a key question remains: can these advanced AI models truly mimic the way human students learn to code, including their struggles, iterative improvements, and unique stylistic quirks? A new research paper introduces ParaStudent, a framework designed to explore and achieve just that.
ParaStudent is a systematic study focused on enabling LLMs to generate “student-like” code within the context of an introductory programming course. Unlike professional-grade code, student code is often imperfect, undergoes multiple revisions, and exhibits diverse styles. The researchers utilized a comprehensive dataset of timestamped student code submissions from multiple semesters at the University of California, Berkeley, to train and evaluate their models.
Understanding “Student-Like” Code
The core idea behind ParaStudent is to capture the distinct characteristics of novice programmer code. This includes functional errors, unpolished and verbose styles, non-standard structures, and the incremental revisions students make as they learn. To evaluate how well AI models replicate these traits, ParaStudent employs a multi-dimensional evaluation system that looks beyond just correctness. It assesses code based on its semantics (meaning), functionality (whether it runs and passes tests, including error types), and style (verbosity, code structure, and adherence to style guidelines like PEP 8).
The ParaStudent Approach: Fine-tuning vs. Prompting
The study compared two main strategies for generating student code: fine-tuning and prompting. Fine-tuning involved adapting a powerful coding LLM, Qwen-2.5 Coder 7B, specifically on the real student submission data. This fine-tuned model, dubbed “qwen-student,” was then compared against its instruction-tuned version (“qwen-inst”) and a leading proprietary model, GPT-4.1, which were used with simple prompting techniques.
Experiments were conducted at two temporal resolutions: low-resolution, which looked at code snapshots from the beginning, middle, and end of a student’s problem-solving process, and high-resolution, which modeled the step-by-step generation of code submissions over time. The researchers also investigated the impact of providing student-specific context, such as prior submissions on different problems, to help the models learn individual student patterns.
Key Findings: Fine-tuning is Crucial for Realism
The results of the ParaStudent study highlight several important conclusions. Firstly, fine-tuning proved essential for generating realistic student behavior. The “qwen-student” model consistently outperformed prompt-based models in capturing diverse error patterns, realistic stylistic variations, and the incremental edits typical of human learners. Prompt-based models, in contrast, tended to produce overly correct and polished code that didn’t reflect the learning process.
Secondly, the study emphasized the importance of multi-dimensional evaluation. Relying solely on functional correctness is insufficient to determine if code is truly “student-like.” By evaluating across semantics, functionality, and style, ParaStudent provides a more holistic view. The granularity of the data also mattered; fine-tuned models were better at simulating student trajectories even in the more variable middle stages of problem-solving.
Finally, the research demonstrated that even smaller, open-source models, when appropriately fine-tuned, can effectively simulate realistic student code. This opens up new possibilities for educational applications.
Also Read:
- Generative AI in Computer Science Education: Navigating Accuracy, Authenticity, and Assessment
- Smart AI for Smarter Peer Learning: Mimicking Errors for Better Outcomes
Implications and Future Directions
The ParaStudent framework has significant implications for the future of AI in education. It can enable the generation of realistic student data, which is invaluable for benchmarking educational models, especially when real student data is scarce. It also paves the way for training more sophisticated AI tutor agents that can understand and reason about intermediate student attempts, rather than just focusing on final correct answers.
While promising, the researchers acknowledge limitations, such as the study being confined to a single introductory programming course and the use of a specific LLM for fine-tuning. Future work will explore generalization to other courses, languages, and difficulty levels, as well as different fine-tuning techniques. The paper also discusses potential risks, including the misuse of such models for academic dishonesty and the importance of privacy safeguards if these systems are deployed in real educational settings.
For more detailed information, you can read the full research paper here.


