Is AI better than humans to perform Meta-analyses?
A research team from the UNIGE Faculty of Medicine and the HUG has developed a method combining several artificial intelligence systems to automate the sorting of scientific articles, a crucial part of systematic medical reviews. This innovative approach achieves an accuracy of over 97%, surpassing human accuracy. These results can be found in the journal Research Synthesis Methods.
Image: Istock
Meta-analyses are essential in research because they gather and combine the results of several studies on the same subject, providing a more reliable and comprehensive view than a single isolated study. By increasing the size of the data analysed, they strengthen the statistical power and accuracy of the conclusions. The results of such studies form the basis of the standards and regulations that apply in public health and clinical practice. However, the work of identifying and selecting studies that address a specific research question can take up to several years.
The challenges of automation in medical research
"Although large language models (LLMs) seem promising for automating certain tasks, their tendency to produce erroneous information or 'hallucinate' can compromise the reliability of the results," explains Denis Mongin, a researcher specialising in data science at the Department of Medicine of the Faculty of Medicine at UNIGE, who directed this work. "That's why we decided to compare the results of several LLMs, based on the principle that the accuracy of the results increases when the responses from several LLMs are similar."
To test the AIs, the research team developed a system where a response is only accepted if several models give the same result, then tested this method on 1,020 abstracts of articles in rheumatology using different combinations of openly accessible models and relatively small: llama3 from Meta, granite from IBM, qwen from Alibaba, Ministral from Mistral, Yi from 01.ai, gemma 2 from Google, deepseek from Deepseek AI, Phi 3 from Microsoft, and Aya expanse from Cohere for AI.
Results that exceed expectations
The same abstracts had previously been evaluated by two people separately, then by a third in case of disagreement, in accordance with meta-analysis best practices. "Our system achieved an accuracy of over 97%, surpassing the human benchmark," enthuses Delphine Courvoisier, professor at the UNIGE Faculty of Medicine and epidemiologist at the HUG Healthcare Quality Department. "And it identified some errors in the initial human evaluations."
This breakthrough could radically transform the systematic review process. By automating the initial sorting of articles without any loss of quality, it would allow researchers to focus their expertise on complex analyses and ambiguous cases. The time required to complete systematic reviews, currently several months or even years, could be significantly reduced, enabling much faster publication of data that can guide decisions in public health and clinical practice.