Technology is good at carrying out professional medical exams, but when they talk in the form of a chatbot with “patients”, the story changes. What are you missing?
Researchers carried out a test to investigate AI’s ability to perform medical screenings. The “patients” were 2000 medical cases taken primarily from US medical board professional examinations.
When it comes to the professional exams themselves, they did well. The problem was the rest, says what was published this Thursday in Nature.
“Although great linguistic models show impressive results on multiple-choice tests, your accuracy decreases significantly in dynamic conversations“, he tells Pranav Rajpurkar from Harvard University. “Models particularly struggle with open-ended diagnostic reasoning.”
4 of the leading large language models — OpenAI’s GPT-3.5 and GPT-4 models, Meta’s Llama-2-7b model, and Mistral AI’s Mistral-v2-7b model — performed considerably worse on conversation-based benchmarking than when they made diagnoses based on written case summaries.
When there were multiple choice options, GPT-4 was able to identify 82% of the diseases, but when they were not, its ability to identify the disease was negative, at 49%.
E qWhen conversations between the patient and the chatbot were simulated, accuracy dropped even further, to 26%.
O GPT-4 was the best performing AI model in the study, with the GPT-3.5 often coming in second place, the Mistral AI model sometimes coming in second or third place and the Llama da Meta model generally getting the lowest score.
Rajpurkar points out that medical practice in the real world is “messier” than in simulations, and the technology does not yet seem ready for real life, in which there is “complex social and systemic factors“.
“The good performance in our benchmark test suggests that AI can be a powerful tool to support clinical work, but not necessarily a substitute for holistic evaluation by experienced doctors”, concludes the researcher.
In 2021, the opening of the University of Porto for Medicine students was questioned.
In the curricular unit, it is written that “it is not the objective of the course for students to learn how to write a poem” and refers to capacity for “interpretation” and “interactivity” as main objectives of discipline.
Should we, after all, give AI some poetry lessons, so that it knows how to better interpret its patients’ messages, or are there fields in which humans are truly irreplaceable?
Carolina Bastos Pereira, ZAP //