Can large language models provide accurate and quality information to parents regarding chronic kidney diseases?

Naz, Rüya; Akacı, Okan; Erdoğan, Hakan; AÇIKGÖZ, AYFER

doi:10.1111/jep.14084

Can large language models provide accurate and quality information to parents regarding chronic kidney diseases?

Naz R., Akacı O., Erdoğan H., AÇIKGÖZ A.

Journal of Evaluation in Clinical Practice, cilt.30, sa.8, ss.1556-1564, 2024 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 30 Sayı: 8
Basım Tarihi: 2024
Doi Numarası: 10.1111/jep.14084
Dergi Adı: Journal of Evaluation in Clinical Practice
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, CAB Abstracts, CINAHL, MEDLINE, Psycinfo
Sayfa Sayıları: ss.1556-1564
Anahtar Kelimeler: artificial intelligence, ChatGPT, child, chronic kidney disease, Copilot, Gemini
Eskişehir Osmangazi Üniversitesi Adresli: Evet

Özet

Rationale: Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human-like text responses to user queries across topics. The use of these language models in various medical contexts is currently being studied. However, the performance and content quality of these language models have not been evaluated in specific medical fields. Aims and objectives: This study aimed to compare the performance of AI LLMs ChatGPT, Gemini and Copilot in providing information to parents about chronic kidney diseases (CKD) and compare the information accuracy and quality with that of a reference source. Methods: In this study, 40 frequently asked questions about CKD were identified. The accuracy and quality of the answers were evaluated with reference to the Kidney Disease: Improving Global Outcomes guidelines. The accuracy of the responses generated by LLMs was assessed using F1, precision and recall scores. The quality of the responses was evaluated using a five-point global quality score (GQS). Results: ChatGPT and Gemini achieved high F1 scores of 0.89 and 1, respectively, in the diagnosis and lifestyle categories, demonstrating significant success in generating accurate responses. Furthermore, ChatGPT and Gemini were successful in generating accurate responses with high precision values in the diagnosis and lifestyle categories. In terms of recall values, all LLMs exhibited strong performance in the diagnosis, treatment and lifestyle categories. Average GQ scores for the responses generated were 3.46 ± 0.55, 1.93 ± 0.63 and 2.02 ± 0.69 for Gemini, ChatGPT 3.5 and Copilot, respectively. In all categories, Gemini performed better than ChatGPT and Copilot. Conclusion: Although LLMs provide parents with high-accuracy information about CKD, their use is limited compared with that of a reference source. The limitations in the performance of LLMs can lead to misinformation and potential misinterpretations. Therefore, patients and parents should exercise caution when using these models.