A recent study conducted by the MESH Incubator team at Massachusetts General Hospital on the clinical reasoning capabilities of generative artificial intelligence (AI) shows that although AI is increasingly penetrating the medical field, there are still significant shortcomings in the logical chain of simulated real-world clinical diagnosis. The relevant research findings have been published in the authoritative journal "JAMA Network Open," clearly indicating that current mainstream models are not yet capable of independently performing clinical diagnostic tasks.
The study selected 21 large language models, including ChatGPT, DeepSeek, Claude, Gemini, and Grok, and tested them with 29 known clinical cases through multiple rounds. The experiment gradually released patient symptoms, laboratory data, and imaging results, highly simulating the dynamic diagnostic process of a physician. The data showed that, given complete information, the accuracy rate of all models providing the correct final diagnosis exceeded 90%. However, in the core aspect of clinical reasoning—“differential diagnosis”—more than 80% of the models performed poorly, failing to systematically analyze and screen for multiple potential diseases.
To quantify this difference, the research team introduced the PrIME-LLM comprehensive evaluation index, covering the entire process from initial diagnosis, examination decision-making, to treatment plan development. The evaluation results showed that the comprehensive scores of various models ranged between 64% and 78%, reflecting that AI is better at "revealing answers" when information is complete, rather than performing open-ended logical reasoning when information is incomplete.
Although the new generation of models has shown significant improvements in handling complex data compared to older versions, the research team emphasized that large language models are currently still positioned as auxiliary tools, and using them directly in clinical practice without professional supervision still carries risks. This finding provides a rational benchmark for the future development of AI in healthcare: transitioning from simple "result fitting" to complex "logical reasoning" will be a critical threshold for medical large models to achieve professional application.
