Comparative Analysis of ChatGPT Versus Human Raters as an Automated Essay Scorer

Jeffrey Aaron Baldwin (Gwangju Institute of Science and Technology, Korea)
Natasha Powell (Pohang Institute of Science and Technology, Korea)

Abstract

As ChatGPT continues to reshape student engagement and instructional design, it is crucial to examine its practical implications. In this study, we aimed to evaluate the effectiveness of ChatGPT-3 and ChatGPT-4 as potential automated essay scoring (AES) systems for English language teaching (ELT) practitioners. In this research, we evaluated 50 authentic student writings using three human raters and both ChatGPT-3 and ChatGPT-4, each performing three rounds of grading. We conducted inter-rater reliability tests to determine if the AI assessments could consistently replicate the evaluations of human raters. The findings reveal that although AI-generated ratings occasionally aligned with individual human graders, ratings amongst the human evaluators demonstrated a higher consistency between evaluations. In contrast, the AI systems struggled to provide consistently reliable grades within the same range as the human raters.

Research Paper (In person; 25 minutes)

Technology / Online Learning / AI / CALL / MALL

Primarily of interest to teachers of university students

About the Presenters

Jeffrey Baldwin is an instructor at Gwangju Institute of Science and Technology. He has over ten years of classroom experience as an English language instructor specializing in EAP. His research interests include English language for STEM courses and the integration of technology into language classrooms.

Natasha Powell instructs students at the Pohang University of Science and Technology. She has an engineering/design background and a constant interest in design and the design process. As a language and technical writing instructor for over a decade, she has been merging her interests to find ways to enhance others’ education through the design thinking process.

Share this page

You are here

Comparative Analysis of ChatGPT Versus Human Raters as an Automated Essay Scorer

Jeffrey Aaron Baldwin (Gwangju Institute of Science and Technology, Korea) Natasha Powell (Pohang Institute of Science and Technology, Korea)

Jeffrey Aaron Baldwin (Gwangju Institute of Science and Technology, Korea)
Natasha Powell (Pohang Institute of Science and Technology, Korea)