Validity of AI-generated multiple-choice questions in medical education: a systematic review

Kıyak, Yavuz; Kaya, Abdullah; Emekli, EMRE

doi:10.1093/postmj/qgag057

Validity of AI-generated multiple-choice questions in medical education: a systematic review

Kıyak Y. S., Kaya A. B., Emekli E.

POSTGRADUATE MEDICAL JOURNAL, cilt.2026, ss.1-9, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 2026
Basım Tarihi: 2026
Doi Numarası: 10.1093/postmj/qgag057
Dergi Adı: POSTGRADUATE MEDICAL JOURNAL
Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), EMBASE
Sayfa Sayıları: ss.1-9
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Eskişehir Osmangazi Üniversitesi Adresli: Evet

Özet

Large language models (LLMs) are increasingly used to generate multiple-choice questions (MCQs) in medical education. We conducted a systematic review following PRISMA 2020, searching PubMed, Web of Science, Scopus, and ERIC through 15 February 2026. Two reviewers independently screened 1352 records, extracted data and assessed methodological quality using the Joanna Briggs Institute checklist. Findings were synthesized according to Messick’s five sources of validity evidence. Seventy-one studies from 24 countries were included. Most relied on expert review rather than learner-based testing. All studies reported content evidence, whereas fewer addressed relations to other variables (40/71), response process (35/71), internal structure (31/71) or consequences (25/71). Error rates ranged from <1% to 45%. Median item difficulty was 0.67 and discrimination 0.28, with reliability between 0.51–0.81. Studies reported substantial efficiency gains, including up to 31-fold time savings. LLMs appear useful drafting tools but current evidence does not yet support unsupervised use in summative assessment.