Applications and Outcomes of Large‑Language‑Model‑Generated Feedback in Undergraduate Medical Education: A Scoping Review


Kıyak Y. S., İş-Kara T., Emekli E.

Medical Science Educator, cilt.2026, ss.1-10, 2026 (ESCI, Scopus)

  • Yayın Türü: Makale / Derleme
  • Cilt numarası: 2026
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1007/s40670-025-02621-3
  • Dergi Adı: Medical Science Educator
  • Derginin Tarandığı İndeksler: Scopus, Emerging Sources Citation Index (ESCI)
  • Sayfa Sayıları: ss.1-10
  • Eskişehir Osmangazi Üniversitesi Adresli: Evet

Özet

Large language models (LLMs) are increasingly integrated into undergraduate medical education, particularly for generating learner feedback. While early LLM studies show promise, their educational impact and usage patterns remain unclear. The objective of this study is to systematically map how LLMs are being used to generate feedback for undergraduate medical students and to examine reported educational outcomes. A scoping review was conducted following Arksey and O’Malley’s framework and reported using PRISMA-ScR. We searched PubMed and Web of Science and identified 4325 records. After screening/review, 42 studies were included. Data were charted using a structured form and outcomes were classified using Kirkpatrick Levels. The 42 included studies originated mostly from Global North countries, with nearly all using OpenAI’s GPT models. Feedback was delivered in two main contexts: simulated clinical encounters and text-based assessment tasks. Only 8 studies (19%) used randomized controlled trial designs. Educational outcomes were: 22 studies (52%) included no student data (Level 0); 10 reported student reaction (Level 1); 10 assessed learning gains (Level 2); none addressed behavior change or patient-level effects (Levels 3–4). LLM-generated feedback often matched expert feedback in short-term effectiveness but showed variable accuracy. LLM-generated feedback is being explored across a range of educational settings, showing early signs of feasibility and perceived utility. However, the evidence base is limited in rigor and generalizability. Future research should assess behavioral and patient-level outcomes.