AIED 2026

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

Dingjie Song¹, Tianlong Xu², Yi-Fan Zhang⁴, Hang Li⁵, Zhiling Yan¹, Xing Fan³, Haoyang Li³, Lichao Sun¹, Qingsong Wen^2*

¹Lehigh University ²Squirrel Ai Learning (USA) ³Squirrel Ai Learning (China) ⁴Chinese Academy of Sciences ⁵Michigan State University
* Corresponding author

Paper Dataset Code Leaderboard

ScratchMath benchmark overview showing the framework for error cause explanation and classification on student handwritten scratchwork

Abstract

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors.

To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. We systematically evaluate 16 leading MLLMs, revealing significant performance gaps relative to human experts (83.9% avg.) — the best model (o4-mini) achieves only 57.2%.

1,720 Student Samples

2 Evaluation Tasks

7 Error Categories

16 Models Evaluated

83.9% Human Performance

57.2% Best Model (o4-mini)

Tasks

Error Cause Explanation (ECE)

Given a math problem, its correct answer, reference solution, the student's incorrect answer, and an image of the student's handwritten scratchwork, the model must generate a free-form explanation of the specific cause of the student's error.

Evaluation: LLM-as-a-Judge (using o3-mini, 88.6% agreement with human judges).

Error Cause Classification (ECC)

Using the same inputs, the model must classify the error into one of 7 predefined categories:

#	Category (Chinese)	Category (English)
1	计算错误	Calculation Error
2	题目理解错误	Question Comprehension Error
3	知识点错误	Knowledge Gap Error
4	答题技巧错误	Problem-Solving Strategy Error
5	手写諜抄错误	Handwriting Transcription Error
6	逻辑推理错误	Logical Reasoning Error
7	注意力与细节错误	Attention & Detail Error

Evaluation: Weighted-average accuracy.

Dataset

The dataset is hosted on HuggingFace: songdj/ScratchMath. It consists of 1,720 authentic handwritten scratchwork samples annotated through rigorous human-machine collaboration involving multiple stages of expert labeling, review, and verification.

Subset	Samples	Grade Level
`primary`	1,479	Grades 1–6
`middle`	241	Grades 7–9

Each sample contains: question_id, question, answer, solution, student_answer, student_scratchwork (image), error_category, error_explanation.

from datasets import load_dataset

ds_primary = load_dataset("songdj/ScratchMath", "primary", split="train")
ds_middle  = load_dataset("songdj/ScratchMath", "middle",  split="train")

Dataset construction pipeline with data collection, annotation, quality verification — Figure 2. Dataset construction pipeline: data collection, MLLM-assisted annotation, expert review, and quality verification.

Error type distribution across primary and middle school subsets — Figure 3. Error type distribution comparison between primary and middle school subsets.

Leaderboard

Performance of state-of-the-art MLLMs on ScratchMath. Human expert performance averages 83.9% across all metrics.

Model	#Params	ECE Primary	ECE Middle	ECC Primary	ECC Middle	Average
Human Expert	—	93.2	89.0	80.1	73.4	83.9

o4-mini*	—	71.8	69.7	40.1	47.3	57.2
Gemini 2.0 Flash Thinking*	—	65.9	61.0	43.9	47.3	54.5
Gemini 2.0 Flash	—	52.2	46.9	38.6	49.0	46.7
QVQ*	72B	57.5	56.8	12.7	17.0	36.0
Qwen2.5-VL	72B	40.0	34.0	32.5	49.4	39.0
Gemma-3	27B	38.9	26.1	32.2	46.1	35.8
Skywork-R1V*	38B	37.5	33.6	27.7	43.2	35.5
GPT-4o	—	47.7	44.8	26.1	22.0	35.2
InternVL2.5	78B	27.1	24.5	30.7	44.8	31.8

* denotes reasoning models. Bold blue indicates best performance per column. Full results available in the paper.

Radar chart showing model performance across different error categories — Figure 4. Model performance across error categories.

Model performance across different grade levels — Figure 5. Performance by grade level.

Performance across mathematical knowledge categories — Figure 6. Performance across math topics.

Case Studies

Representative failure cases from the best-performing model (o4-mini), illustrating key challenges in multimodal error diagnosis:

Visual Recognition Failure

Problem: Solve for x: 4x − 3(20 − x) = 6x − 7(9 − x)

Correct: x = 1/2 Student: x = −1/2

The student miscalculated −63 + 60 as +3 instead of −3, leading to −6x = 3 and thus x = −1/2. The model failed to visually recognize the sign error in the scratchwork.

Formatting Misinterpretation

Problem: If the two roots of a(x+m)²+b=0 are −1 and 4, find the solutions to a(x+m−3)²+b=0.

Correct: x = 2 or 7 Student: x = 2 and 7

The student mistakenly treated the shifted equation as a(x−3)²+b=0, ignoring the +m term. Additionally, the student wrote "and" instead of "or" for the solution format.

Misaligned Misinterpretation

Problem: A brick has dimensions 20×11×6 cm and density 1.5 g/cm³. Find the weight in kilograms.

Correct: 1.98 kg Student: 1980

The student correctly computed the volume and mass (1980 g) but failed to convert grams to kilograms. The model misidentified this as a calculation error rather than a unit conversion oversight.

Citation

If you find this work useful, please cite:

@inproceedings{song2026scratchmath,
  title={Can MLLMs Read Students' Minds? Unpacking Multimodal
         Error Analysis in Handwritten Math},
  author={Song, Dingjie and Xu, Tianlong and Zhang, Yi-Fan
          and Li, Hang and Yan, Zhiling and Fan, Xing
          and Li, Haoyang and Sun, Lichao and Wen, Qingsong},
  booktitle={Proceedings of the 27th International Conference
             on Artificial Intelligence in Education (AIED)},
  year={2026}
}