Jeffrey Aaron Baldwin (Gwangju Institute of Science and Technology, Korea)
Natasha Powell (Pohang Institute of Science and Technology, Korea)
Abstract
As ChatGPT continues to reshape student engagement and instructional design, it is crucial to examine its practical implications. In this study, we aimed to evaluate the effectiveness of ChatGPT-3 and ChatGPT-4 as potential automated essay scoring (AES) systems for English language teaching (ELT) practitioners. In this research, we evaluated 50 authentic student writings using three human raters and both ChatGPT-3 and ChatGPT-4, each performing three rounds of grading. We conducted inter-rater reliability tests to determine if the AI assessments could consistently replicate the evaluations of human raters. The findings reveal that although AI-generated ratings occasionally aligned with individual human graders, ratings amongst the human evaluators demonstrated a higher consistency between evaluations. In contrast, the AI systems struggled to provide consistently reliable grades within the same range as the human raters.
Research Paper (In person; 25 minutes)
Technology / Online Learning / AI / CALL / MALL
Primarily of interest to teachers of university students
About the Presenters
Jeffrey Baldwin is an instructor at Gwangju Institute of Science and Technology. He has over ten years of classroom experience as an English language instructor specializing in EAP. His research interests include English language for STEM courses and the integration of technology into language classrooms.
Natasha Powell instructs students at the Pohang University of Science and Technology. She has an engineering/design background and a constant interest in design and the design process. As a language and technical writing instructor for over a decade, she has been merging her interests to find ways to enhance others’ education through the design thinking process.