Karahan B. N., Emekli E., Altın M. A.
EUROPEAN JOURNAL OF THERAPEUTICS, cilt.31, sa.1, ss.28-34, 2025 (ESCI, TRDizin)
Özet
Objectives: The aim of this study is to compare the ability of artificial intelligence-based chatbots, ChatGPT-4o and Claude 3.5, to interpret mammography images. The study focuses on evaluating their accuracy and consistency in BI-RADS classification and breast parenchymal type assessment. It also aims to explore the potential of these technologies to reduce radiologists’ workload and identify their limitations in medical image analysis.
Methods: A total of 53 mammography images obtained between January and July 2024 were analyzed, focusing on BI-RADS classification and breast parenchymal type assessment. The same anonymized mammography images were provided to both chatbots under identical prompts.
Results: The results showed accuracy rates for BI-RADS classification ranging from 18.87% to 26.42% for ChatGPT-4o and 18.7% for Claude 3.5. When BI-RADS categories were grouped into benign group(BI-RADS 1,2) and malignant group(BI-RADS 4,5), the combined accuracy was 57.5% for ChatGPT-4o (initial evaluation) and 55% (second evaluation), compared to 47.5% for Claude 3.5. Breast parenchymal type accuracy rates were 30.19% and 22.64% for ChatGPT-4o, and 26.42% for Claude 3.5.
Conclusions: The findings indicate that chatbots demonstrate limited accuracy and reliability in interpreting mammography images. These results highlight the need for further optimization, larger datasets, and advanced training processes to improve their performance in medical image analysis.