Share

I Let AI Mark My Students' Homework for a Month. Here's the Honest Result.

Wong Sir tried AI marking with his students for a month. What it got right, what it missed, what it revealed about his own habits — and an honest verdict.

Wong Sir
Wong SirChief Editor & Maths
6 min read
#AI#marking#homework#teachers#feedback#maths

I should be transparent about something before I start: I built Tutor Wong. An article from me about AI marking is not exactly neutral. But I think that makes the honest account more interesting, not less — because I went into this experiment with strong prior beliefs about what AI marking could do, and I was still surprised by what I found.

The experiment ran for four weeks with two classes: a P5 maths group and an S1 maths group. Both were classes I teach directly. I assigned homework as normal and ran the submissions through AI marking in parallel with my own marking, then compared results before returning feedback to students. I did not tell students their work was being double-marked. I recorded disagreements and my reasons for overriding or accepting the AI assessment.

Here's what I found.

What AI marking got right

Arithmetic errors. The AI was better than me at catching careless arithmetic mistakes, full stop. This surprised me because I thought of myself as a careful marker. But I mark 30-odd scripts in a sitting, often late in the evening, after a day of teaching. I miss things. The AI doesn't miss things. On a set of P5 long multiplication and division problems, I had missed three carrying errors that the AI flagged. They were genuine errors. I would have given those students credit they hadn't earned.

Missing steps. For any problem type where the expected working is well-defined — standard algorithm, show-all-steps instructions — the AI was reliable at identifying missing steps. "Student calculated the final answer correctly but did not show the intermediate conversion" is exactly the kind of structural feedback that takes time to write for every script and that teachers often shorthand into a tick with a note at the bottom. The AI caught and flagged it on every instance.

Consistency across a set. The AI applied the marking scheme identically to script 1 and script 30. I do not. By script 25, my standards shift in ways I'm aware of but can't fully control. I become slightly more lenient when I've seen the same error repeatedly and am tired of marking it, or slightly more strict when I've hit a run of particularly careless work. The AI's consistency is, in this respect, fairer than mine.

Pattern identification across submissions. This is the capability I find genuinely most valuable. After marking a class set, the AI could tell me that seven students had made a specific type of fraction comparison error, that the error was concentrated in questions requiring unlike denominators, and that four of those seven students had made the same error the previous week. A human marker can see patterns within a single script. Seeing them across a whole class set, automatically, is different.

What AI marking missed

Invented methods. The most significant failure category. I had a student who solved a geometry area problem using an approach I hadn't taught — his own informal decomposition method that was mathematically valid but not part of the standard algorithm. The AI marked it wrong because it didn't match the expected method. The student had demonstrated genuine mathematical thinking and been penalised for it.

This happened three times across the experiment, in different forms. A student who calculated a percentage correctly via an unconventional route. A student who solved a simultaneous equations problem with a substitution sequence that was unusual but valid. In each case, the AI had been calibrated on standard methods and couldn't recognise valid departures from them. I overrode these assessments every time. An AI that marks your child down for thinking creatively is not a useful educational tool.

Partial understanding that deserved credit. This is harder to describe precisely, but experienced teachers will recognise it. There is a class of response where the student has clearly grasped the core concept but made a procedural error in applying it. The correct pedagogical response is to give partial credit and feedback that acknowledges the understanding while identifying the procedural gap. The AI tended to give binary credit: right method, right answer = full marks; wrong answer, regardless of method quality = low marks. This undervalued a real kind of progress.

Context-specific leniency. A student who normally performs at a high level and makes an uncharacteristic slip deserves different feedback than a student who makes the same slip consistently. Human teachers calibrate this. AI doesn't — it doesn't know the student's history, doesn't know that this is a student whose family was dealing with something difficult last week, doesn't know the difference between "this child understands and was careless once" and "this child does not understand." These are different situations. The feedback should be different.

The mark that's technically wrong but right for the student. I gave a struggling P5 student a mark she hadn't quite earned on a problem where she'd made significant conceptual progress from her previous attempt. The delta mattered more than the absolute mark. I wanted the feedback to reinforce that progress. The AI gave her the technically correct lower mark. I overrode it. There's a reason teachers are humans.

What it revealed about my own marking habits

This is the part I wasn't expecting.

Going through the disagreements systematically — every case where my mark differed from the AI's — I identified patterns in my own marking that I wasn't conscious of. I give more generous partial credit to students whose working is presented neatly. I mark more strictly at the start of a set than at the end. I give more benefit of the doubt to students I perceive as high-effort, regardless of the specific answer in front of me.

None of these biases are entirely indefensible. They're the kind of human adjustments that experienced teachers make. But seeing them made explicit — seeing the record of where I diverged from a consistent standard and why — was genuinely uncomfortable. It prompted me to be more deliberate about when I'm making a contextual judgment (which is legitimate) versus when I'm just being inconsistent (which is not).

The verdict

AI marking is a useful tool that should be supervised by a teacher who knows the students. Used without supervision, it will penalise creative thinking, undervalue partial understanding, and miss the contextual dimensions of assessment that good teaching requires. Used as a first pass by a teacher who reviews the outputs and applies judgment, it catches errors that human markers miss and provides consistent feedback at a speed and scale that isn't otherwise achievable.

The right model is not AI instead of the teacher. It's AI plus the teacher — with the teacher spending less time on routine error-checking and more time on the cases that require human judgment.

After a month, I'm still using it. I'm also still marking everything myself.

Tutor Wong's feedback is reviewed against each student's working — because catching the right answer via the wrong method is as important as catching the wrong answer.

Wong Sir
Wong Sir
Chief Editor & Maths

Former Hong Kong primary maths teacher with 15 years in the classroom. Built Tutor Wong after seeing the same homework mistakes thousands of times. Believes every error is a learning opportunity — if you know where to look.

All articles by Wong Sir

Get Wong's Tips Weekly

One practical tip every week — no spam, just useful stuff.

We'll only send tips. Unsubscribe anytime.

Disclaimer: The opinions expressed in this article are those of the author alone and do not represent the views or positions of 補習天王 (Tutor Wong), its founders, staff, or team. This article is provided for informational purposes only and does not constitute professional advice.