Article
Gautam Thapar
8 minutes
Jun 25, 2024

Introduction
Every year, teachers in the U.S. spend over 400 million hours grading student work. As this task grows, many educators are understandably turning to AI solutions like ChatGPT and EnlightenAI to streamline the process. However, with the significant impact that grading and feedback have on student learning, it's crucial to choose a tool you can trust. Recent research showed that ChatGPT, especially earlier, cheaper models that many AI grading tools are built atop, is not ready for use for grading.
At EnlightenAI, we understand these concerns, which is why we set out to rigorously test the accuracy and reliability of our AI grading assistant against both seasoned human graders and other AI technologies. We'll be releasing a white paper in the next few weeks sharing more detail on the findings we preview here.
The study setup
In our study, we selected 437 student work samples that had previously been evaluated by educators at DREAM Charter Schools in New York. To explore what the grading might look like if DREAM had used EnlightenAI from the start, we simulated this scenario. We input the context of each assignment into EnlightenAI, graded five papers, and then generated scores and feedback for the remaining. The whole grading process took less than an hour, and was done using the exact same technology we offer to our users for free.
The Result: EnlightenAI met or exceeded the accuracy benchmarks for well-trained human scorers
Yes, you read that right. Researchers have taken a team of human scorers, put them through a 3-hour calibration training on scoring essays using a holistic rubric, and then measured their consistency with one another.
EnlightenAI met or beat these human benchmarks, while exceeding ChatGPT’s performance by a wide margin. For the first time, ever, teachers and school leaders have access to a personalized scoring and feedback tool that competes with well-calibrated human graders, as well as state-of-the-art automated essay scoring tools.
How often did EnlightenAI give the exact same score as DREAM graders?
How effective was EnlightenAI at producing a ‘perfect match’ with its human scorers? In assessments using a 5-point New York State constructed response rubric (scoring range 0 to 4), EnlightenAI matched the exact score assigned by DREAM educators in 53% of cases. Comparatively, in studies using a 6-point rubric (scoring range 1 to 6), well-trained human graders—after receiving 3 hours of training and undergoing continuous monitoring—agreed on the exact same score 51% of the time. ChatGPT performed worse than both, matching human scorers between 20-42% of the time on the same 6-point rubric mentioned above. You can view the distribution of errors by EnlightenAI below. An error of 0 equates to a perfect match, while errors of +1 or -1 signal that EnlightenAI missed the mark by one point in either direction.
How often did EnlightenAI assign scores within 1 point of DREAM graders?
Grading is an imperfect science, and on many rubrics score differences of a single point are largely subjective. A highly reliable and consistent grading system not only produces exact matches with calibrated human scorers, but also minimizes the size of the error when it fails to produce an exact match. On this measure, EnlightenAI really shines.
EnlightenAI assigned scores within one point of the human scores in 98% of assessments. It never missed by more than 2 points in a sample of 437 papers. Notably, this is an area where human scoring consistency lags behind. Well-trained human graders assign scores within 1 point of each other just 74% of the time. Interestingly, ChatGPT outperforms humans in this regard, assigning scores within one point of human scorers 76-89% of the time, but varies widely depending on the samples of student work and the task assigned. On these measures, EnlightenAI outperformed both humans and ChatGPT.