Using machine learning as a study design filter for systematic reviews of RCTs




Poster session 2 Thursday: Evidence synthesis - methods / improving conduct and reporting


Thursday 14 September 2017 - 12:30 to 14:00


All authors in correct order:

Marshall I1, Noel-Storr A2, Kuiper J3, Thomas J4, Wallace BC5
1 King's College London, United Kingdom
2 University of Oxford, United Kingdom
3 Doctor Evidence, Netherlands
4 Institute of Education, UCL, United Kingdom
5 Northeastern University, United States
Presenting author and contact person

Presenting author:

Iain Marshall

Contact person:

Abstract text
Background: Machine learning (ML) algorithms have proven highly accurate for identifying randomised-controlled trials (RCTs), but string-based study-design filters remain the predominant approach used in practice for systematic reviews and guidelines.

Objectives: We compared the performance of ML models for identifying RCTs against a range of traditional database study-design filters, including the Cochrane Highly Sensitive Search Strategy (HSSS) and the PubMed publication type tag.

Methods:We evaluated Support Vector Machines (SVMs), Convolutional Neural Networks (CNNs), and ensemble approaches. We trained these models on titles and abstracts labelled as part of the Cochrane Crowd project. We evaluated the models on the Clinical Hedges dataset, which comprises 49 028 articles manually labeled (based on full texts).

Results: ML discriminates between RCTs and non-RCTs better than widely used traditional database search filters at all sensitivity levels (see Figure); our best-performing model achieved the best published results to date for ML in this task (Area under the Receiver Operating Characteristics curve 0.987, 95% CI 0.984 to 0.989). The best performing model (a hybrid SVM model incorporating information from the PT tag) improved specificity compared with the Cochrane HSSS search filter, with identical sensitivity (difference in specificity +10.8%, 95% CI 10.5% to 11.2%), which corresponds to a precision of 21.0% versus 12.5%, and a number “needed to screen” of 4.8 versus 8.0. We have made software implementing these ML approaches freely available under the GPL v3.0 license (at

Conclusions: ML performs better than traditional database filters, with improved specificity at all sensitivity levels. We recommend that users of the medical literature move toward using ML as the method for study-design filtering.