Artificial intelligence in radiology examinations: a psychometric comparison of question generation methods


Creative Commons License

Emekli E., Karahan B. N.

DIAGNOSTIC AND INTERVENTIONAL RADIOLOGY, ss.1, 2025 (SCI-Expanded, Scopus, TRDizin) identifier

Özet

PURPOSE

This study aimed to evaluate the usability of artificial intelligence (AI)-based question generation methods–Chat Generative Pre-trained Transformer (ChatGPT)-4o (a non-template-based large language model) and a template-based automatic item generation (AIG) method–in the context of radiology education. The primary objective was to compare the psychometric properties, perceived quality, and educational applicability of generated multiple-choice questions (MCQs) with those written by a faculty member.

METHODS

Fifth-year medical students who participated in the radiology clerkship at Eskişehir Osmangazi University were invited to take a voluntary 15-question examination covering musculoskeletal and rheumatologic imaging. The examination included five MCQs from each of three sources: a radiologist educator, ChatGPT-4o, and the template-based AIG method. Student responses were evaluated in terms of difficulty and discrimination indices. Following the examination, students rated each question using a Likert scale based on clarity, difficulty, plausibility of distractors, and alignment with learning goals. Correlations between students’ examination performance and their theoretical/practical radiology grades were analyzed using Pearson’s correlation method.

RESULTS

A total of 115 students participated. Faculty-written questions had the highest mean correct response rate (2.91 ± 1.34), followed by template-based AIG (2.32 ± 1.66) and ChatGPT-4o (2.3 ± 1.14) questions (P < 0.001). The mean difficulty index was 0.58 for faculty, and 0.46 for both template-based AIG and ChatGPT-4o. Discrimination indices were acceptable (≥0.2) or very good (≥0.4) for template-based AIG questions. In contrast, four of the ChatGPT-generated questions were acceptable, and three were very good. Student evaluations of questions and the overall examination were favorable, particularly regarding question clarity and content alignment. Examination scores showed a weak correlation with practical examination performance (P = 0.041), but not with theoretical grades (P = 0.652).

CONCLUSION

Both the ChatGPT-4o and template-based AIG methods produced MCQs with acceptable psychometric properties. While faculty-written questions were most effective overall, AI-generated questions–especially those from the template-based AIG method–showed strong potential for use in radiology education. However, the small number of items per method and the single-institution context limit the robustness and generalizability of the findings. These results should be regarded as exploratory, and further validation in larger, multicenter studies is required.

CLINICAL SIGNIFICANCE

AI-based question generation may potentially support educators by enhancing efficiency and consistency in assessment item creation. These methods may complement traditional approaches to help scale up high-quality MCQ development in medical education, particularly in resource-limited settings; however, they should be applied with caution and expert oversight until further evidence is available, especially given the preliminary nature of the current findings.