Validity of AI-generated multiple-choice questions in medical education: a systematic review


Creative Commons License

Kıyak Y. S., Kaya A. B., Emekli E.

POSTGRADUATE MEDICAL JOURNAL, cilt.2026, ss.1-9, 2026 (SCI-Expanded, Scopus)

Özet

Large language models (LLMs) are increasingly used to generate multiple-choice questions (MCQs) in medical education. We conducted a systematic review following PRISMA 2020, searching PubMed, Web of Science, Scopus, and ERIC through 15 February 2026. Two reviewers independently screened 1352 records, extracted data and assessed methodological quality using the Joanna Briggs Institute checklist. Findings were synthesized according to Messick’s five sources of validity evidence. Seventy-one studies from 24 countries were included. Most relied on expert review rather than learner-based testing. All studies reported content evidence, whereas fewer addressed relations to other variables (40/71), response process (35/71), internal structure (31/71) or consequences (25/71). Error rates ranged from <1% to 45%. Median item difficulty was 0.67 and discrimination 0.28, with reliability between 0.51–0.81. Studies reported substantial efficiency gains, including up to 31-fold time savings. LLMs appear useful drafting tools but current evidence does not yet support unsupervised use in summative assessment.