The Role of Large Language Models in Pediatric Emergency Medicine: Accuracy and Decision-Support Potential of ChatGPT Pediatrik Acil Tıpta Büyük Dil Modellerinin Rolü: ChatGPT'nin Doğruluğu ve Karar Destek Potansiyeli

KUAS, ÇAĞLAR; Günsoy, Ertuğ; ÇANAKÇI, MUSTAFA; Çetin, Murat; ERCAN, VOLKAN; ÖZAKIN, ENGİN; ACAR, NURDAN

doi:10.54996/anatolianjem.1776853

The Role of Large Language Models in Pediatric Emergency Medicine: Accuracy and Decision-Support Potential of ChatGPT Pediatrik Acil Tıpta Büyük Dil Modellerinin Rolü: ChatGPT'nin Doğruluğu ve Karar Destek Potansiyeli

KUAS Ç., Günsoy E., ÇANAKÇI M. E., Çetin M., ERCAN V., ÖZAKIN E., ...Daha Fazla

Anatolian journal of emergency medicine, cilt.9, sa.1, ss.39-43, 2026 (Scopus, TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 9 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.54996/anatolianjem.1776853
Dergi Adı: Anatolian journal of emergency medicine
Derginin Tarandığı İndeksler: Scopus, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.39-43
Anahtar Kelimeler: artificial intelligence, ChatGPT, decision support, large language models, pediatric emergency medicine
Eskişehir Osmangazi Üniversitesi Adresli: Evet

Özet

Aim: To assess the accuracy and decision-support potential of ChatGPT in pediatric emergency practice by comparing its performance with human responses to structured multiple-choice questions. Material and Methods: This cross-sectional study used 100 randomly selected questions from Pediatric Emergency Medicine: Just the Facts, Second Edition. The GPT-4o model was tested without prior prompts, and its answers were compared with the reference solutions and human accuracy rates reported in the source. Accuracy rates were calculated and compared using z-tests. Correlation between ChatGPT and human performance was analyzed with Spearman’s test. Results: ChatGPT answered 85 of 100 questions correctly, achieving an accuracy of 85% (95% CI: 78.0–92.0), which was significantly higher than the mean human accuracy of 54% (95% CI: 50.8–57.4) (p < 0.001). Topic-based analysis showed that ChatGPT’s accuracy ranged from 75% to 100%, while human accuracy ranged from 30% to 65%, with higher variance. Among the 15 questions answered incorrectly by ChatGPT, 60% were case-based; the average correct human response rate for these was 35 ± 17%. A moderate positive correlation was observed between human and ChatGPT performance (ρ = 0.40, p < 0.001). Conclusion: ChatGPT demonstrated high accuracy on structured pediatric emergency questions, suggesting potential as a supportive tool in decision-making and education. While its strengths lie in knowledge-based tasks, limitations remain in complex case-based reasoning. These findings indicate that LLMs could complement, but not replace, human expertise. Prospective studies are warranted to evaluate real-world integration in pediatric emergency care.