ScratchMath Banner
AIED 2026

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

Dingjie Song1, Tianlong Xu2, Yi-Fan Zhang4, Hang Li5, Zhiling Yan1, Xing Fan3, Haoyang Li3, Lichao Sun1, Qingsong Wen2*
1Lehigh University   2Squirrel Ai Learning (USA)   3Squirrel Ai Learning (China)   4Chinese Academy of Sciences   5Michigan State University
* Corresponding author
ScratchMath benchmark overview showing the framework for error cause explanation and classification on student handwritten scratchwork

Abstract

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors.

To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. We systematically evaluate 16 leading MLLMs, revealing significant performance gaps relative to human experts (83.9% avg.) — the best model (o4-mini) achieves only 57.2%.

1,720 Student Samples
2 Evaluation Tasks
7 Error Categories
16 Models Evaluated
83.9% Human Performance
57.2% Best Model (o4-mini)

Tasks

Overview of two tasks (ECE and ECC) and three research questions
Figure 1. Overview of this work. (Top) An example illustrating the two proposed tasks. (Bottom) Summary of the three research questions addressed.

Error Cause Explanation (ECE)

Given a math problem, its correct answer, reference solution, the student's incorrect answer, and an image of the student's handwritten scratchwork, the model must generate a free-form explanation of the specific cause of the student's error.

Evaluation: LLM-as-a-Judge (using o3-mini, 88.6% agreement with human judges).

Error Cause Classification (ECC)

Using the same inputs, the model must classify the error into one of 7 predefined categories:

#Category (Chinese)Category (English)
1计算错误Calculation Error
2题目理解错误Question Comprehension Error
3知识点错误Knowledge Gap Error
4答题技巧错误Problem-Solving Strategy Error
5手写諜抄错误Handwriting Transcription Error
6逻辑推理错误Logical Reasoning Error
7注意力与细节错误Attention & Detail Error

Evaluation: Weighted-average accuracy.

Dataset

The dataset is hosted on HuggingFace: songdj/ScratchMath. It consists of 1,720 authentic handwritten scratchwork samples annotated through rigorous human-machine collaboration involving multiple stages of expert labeling, review, and verification.

SubsetSamplesGrade Level
primary1,479Grades 1–6
middle241Grades 7–9

Each sample contains: question_id, question, answer, solution, student_answer, student_scratchwork (image), error_category, error_explanation.

from datasets import load_dataset

ds_primary = load_dataset("songdj/ScratchMath", "primary", split="train")
ds_middle  = load_dataset("songdj/ScratchMath", "middle",  split="train")
Dataset construction pipeline with data collection, annotation, quality verification
Figure 2. Dataset construction pipeline: data collection, MLLM-assisted annotation, expert review, and quality verification.
Error type distribution across primary and middle school subsets
Figure 3. Error type distribution comparison between primary and middle school subsets.

Leaderboard

Performance of state-of-the-art MLLMs on ScratchMath. Human expert performance averages 83.9% across all metrics.

Model #Params ECE Primary ECE Middle ECC Primary ECC Middle Average
Human Expert 93.2 89.0 80.1 73.4 83.9
o4-mini* 71.8 69.7 40.1 47.3 57.2
Gemini 2.0 Flash Thinking* 65.9 61.0 43.9 47.3 54.5
Gemini 2.0 Flash 52.2 46.9 38.6 49.0 46.7
QVQ* 72B 57.5 56.8 12.7 17.0 36.0
Qwen2.5-VL 72B 40.0 34.0 32.5 49.4 39.0
Gemma-3 27B 38.9 26.1 32.2 46.1 35.8
Skywork-R1V* 38B 37.5 33.6 27.7 43.2 35.5
GPT-4o 47.7 44.8 26.1 22.0 35.2
InternVL2.5 78B 27.1 24.5 30.7 44.8 31.8

* denotes reasoning models. Bold blue indicates best performance per column. Full results available in the paper.

Case Studies

Representative failure cases from the best-performing model (o4-mini), illustrating key challenges in multimodal error diagnosis:

Visual Recognition Failure
Case study: Visual Recognition Failure

Problem: Solve for x: 4x − 3(20 − x) = 6x − 7(9 − x)

Correct: x = 1/2   Student: x = −1/2

The student miscalculated −63 + 60 as +3 instead of −3, leading to −6x = 3 and thus x = −1/2. The model failed to visually recognize the sign error in the scratchwork.

Formatting Misinterpretation
Case study: Formatting Misinterpretation

Problem: If the two roots of a(x+m)2+b=0 are −1 and 4, find the solutions to a(x+m−3)2+b=0.

Correct: x = 2 or 7   Student: x = 2 and 7

The student mistakenly treated the shifted equation as a(x−3)2+b=0, ignoring the +m term. Additionally, the student wrote "and" instead of "or" for the solution format.

Misaligned Misinterpretation
Case study: Misaligned Misinterpretation

Problem: A brick has dimensions 20×11×6 cm and density 1.5 g/cm3. Find the weight in kilograms.

Correct: 1.98 kg   Student: 1980

The student correctly computed the volume and mass (1980 g) but failed to convert grams to kilograms. The model misidentified this as a calculation error rather than a unit conversion oversight.

Citation

If you find this work useful, please cite:

@inproceedings{song2026scratchmath,
  title={Can MLLMs Read Students' Minds? Unpacking Multimodal
         Error Analysis in Handwritten Math},
  author={Song, Dingjie and Xu, Tianlong and Zhang, Yi-Fan
          and Li, Hang and Yan, Zhiling and Fan, Xing
          and Li, Haoyang and Sun, Lichao and Wen, Qingsong},
  booktitle={Proceedings of the 27th International Conference
             on Artificial Intelligence in Education (AIED)},
  year={2026}
}