The Role of Large Language Models in Pediatric Emergency Medicine: Accuracy and Decision-Support Potential of ChatGPT Pediatrik Acil Tıpta Büyük Dil Modellerinin Rolü: ChatGPT'nin Doğruluğu ve Karar Destek Potansiyeli


Creative Commons License

KUAS Ç., Günsoy E., ÇANAKÇI M. E., Çetin M., ERCAN V., ÖZAKIN E., ...Daha Fazla

Anatolian journal of emergency medicine, cilt.9, sa.1, ss.39-43, 2026 (Scopus, TRDizin) identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 9 Sayı: 1
  • Basım Tarihi: 2026
  • Doi Numarası: 10.54996/anatolianjem.1776853
  • Dergi Adı: Anatolian journal of emergency medicine
  • Derginin Tarandığı İndeksler: Scopus, TR DİZİN (ULAKBİM)
  • Sayfa Sayıları: ss.39-43
  • Anahtar Kelimeler: artificial intelligence, ChatGPT, decision support, large language models, pediatric emergency medicine
  • Eskişehir Osmangazi Üniversitesi Adresli: Evet

Özet

Aim: To assess the accuracy and decision-support potential of ChatGPT in pediatric emergency practice by comparing its performance with human responses to structured multiple-choice questions. Material and Methods: This cross-sectional study used 100 randomly selected questions from Pediatric Emergency Medicine: Just the Facts, Second Edition. The GPT-4o model was tested without prior prompts, and its answers were compared with the reference solutions and human accuracy rates reported in the source. Accuracy rates were calculated and compared using z-tests. Correlation between ChatGPT and human performance was analyzed with Spearman’s test. Results: ChatGPT answered 85 of 100 questions correctly, achieving an accuracy of 85% (95% CI: 78.0–92.0), which was significantly higher than the mean human accuracy of 54% (95% CI: 50.8–57.4) (p < 0.001). Topic-based analysis showed that ChatGPT’s accuracy ranged from 75% to 100%, while human accuracy ranged from 30% to 65%, with higher variance. Among the 15 questions answered incorrectly by ChatGPT, 60% were case-based; the average correct human response rate for these was 35 ± 17%. A moderate positive correlation was observed between human and ChatGPT performance (ρ = 0.40, p < 0.001). Conclusion: ChatGPT demonstrated high accuracy on structured pediatric emergency questions, suggesting potential as a supportive tool in decision-making and education. While its strengths lie in knowledge-based tasks, limitations remain in complex case-based reasoning. These findings indicate that LLMs could complement, but not replace, human expertise. Prospective studies are warranted to evaluate real-world integration in pediatric emergency care.