REVISTA ESPAÑOLA DE EDUCACIÓN MÉDICA, cilt.2025, sa.1, ss.1-8, 2024 (Hakemli Dergi)
We aimed to determine the quality of AI-generated (ChatGPT-4 and Claude 3) Script ConcordanceTest (SCT) items through an expert panel.We generated SCT items on abdominal radiology usinga complex prompt in large language model (LLM) chatbots (ChatGPT-4 and Claude 3 (Sonnet) inApril 2024) and evaluated the items’ quality through an expert panel of 16 radiologists. Expertpanel, which was blind to the origin of the items provided without modifications, independentlyanswered each item and assessed them using 12 quality indicators. Data analysis includeddescriptive statistics, bar charts to compare responses against accepted forms, and a heatmap toshow performance in terms of the quality indicators. SCT items generated by chatbots assessclinical reasoning rather than only factual recall (ChatGPT: 92.50%, Claude: 85.00%). The heatmapindicated that the items were generally acceptable, with most responses favorable across qualityindicators (ChatGPT: 71.77%, Claude: 64.23%). The comparison of the bar charts with acceptableand unacceptable forms revealed that 73.33% and 53.33% of the questions in the items can beconsidered acceptable, respectively, for ChatGPT and Claude. The use of LLMs to generate SCTitems can be helpful for medical educators by reducing the required time and effort. Although theprompt provides a good starting point, it remains crucial to review and revise AI-generated SCTitems before educational use. The prompt and the custom GPT, “Script Concordance TestGenerator”, available at https://chatgpt.com/g/g-RlzW5xdc1-script-concordance-test-generator,can streamline SCT item development.