AI in radiography education: Evaluating multiple-choice questions difficulty and discrimination


Emekli E., Karahan B. N.

JOURNAL OF MEDICAL IMAGING AND RADIATION SCIENCES, cilt.56, sa.4, ss.101896, 2025 (ESCI)

Özet

Background

High-quality multiple-choice questions (MCQs) are essential for effective student assessment in health education. However, the manual creation of MCQs is labour-intensive, requiring significant time and expertise. With the increasing demand for large and continuously updated question banks, artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT, has emerged as a potential tool for automating question generation. While AI-assisted question generation has shown promise, its ability to match human-authored MCQs in terms of difficulty and discrimination indices remains unclear. This study aims to compare the effectiveness of AI-generated and faculty-authored MCQs in radiography education, addressing a critical gap in evaluating AI's role in assessment processes. The findings will be beneficial for educators and curriculum designers exploring AI integration into health education.

Methods

This study was conducted in Turkey during the 2024–2025 academic year. Participants included 56 students enrolled in the first year of the Medical Imaging Programme. Two separate 30-question MCQ exams were developed—one generated by ChatGPT-4o and the other by a faculty member. The questions were derived from radiographic anatomy and positioning content, covering topics such as cranial, vertebral, pelvic, and lower extremity radiographs. Each exam contained six questions per topic, categorised into easy, medium, and difficult levels. A quantitative research design was employed. Students took both exams on separate days, without knowing the source of the questions. Difficulty and discrimination indices were calculated for each question, and student feedback was collected using a 5-point Likert scale to evaluate their perceptions of the exams.

Results

A total of 56 out of 80 eligible students participated, yielding a response rate of 70 %. The mean number of correct answers are similar for ChatGPT (14.91 ± 4.25) and human expert exams (15.82 ± 4.73; p = 0.089). Exam scores showed moderate positive correlation (r = 0.628, p < 0.001). ChatGPT achieved an average difficulty index of 0.50 versus 0.53 for human experts. Discrimination indices were acceptable for 73.33 % of ChatGPT questions and 86.67 % of human expert questions.

Conclusion

LLMs like ChatGPT can generate MCQs of comparable quality to human expert questions, though slight limitations in discrimination and difficulty alignment remain. These models hold promise for supplementing assessment processes in health education.