Background: Systematic reviews (SRs) are widely used to support the development of clinical guidelines and other documents driving decisions in healthcare. Suboptimal SRs can be harmful and a reliable assessment of their validity is essential. A widely used tool is the AMSTAR checklist, while the ROBIS tool was recently launched to specifically assess risk of bias of SRs.
Objectives: To evaluate the inter-rater reliability (IRR) of AMSTAR and ROBIS for individual domains and overall judgment, the concurrent validity, and the time required to apply the tools.
Methods: Five raters with different levels of expertise assessed 31 SRs on pharmacological thromboprophylaxis using AMSTAR and ROBIS. For each question, domain and overall risk of bias, we calculated the Fliess’ k for multiple IRR (for AMSTAR, low risk of bias: eight yes-answers or more, high risk of bias: three yes-answers or less). We assessed the concurrent validity of the two tools by comparing different domains addressing similar items (Table). We recorded the time to complete each tool as mean time spent by each reviewer on each review. We classified agreement as: poor (≤0.00), slight (0.01-0.20), fair (0.21-0.40), moderate (0.41-0.60), substantial (0.61-0.80), almost perfect (0.81-1.00).
Results: The kappa for the agreement on individual domains ranged from 0.28 to 1 for AMSTAR and from 0.49 to 0.61 for ROBIS; kappa for overall risk of bias was 0.65 for both tools (Figure). We found a fair correlation between AMSTAR and ROBIS in the overall judgment (ρ=0.38), mainly because of discordances in the classification of SRs at intermediate risk of bias. The mean time to complete ROBIS was about twice that of AMSTAR (mean±standard deviation: 12.6±4.6 vs. 5.8±31.9; mean difference: 6.7±3.2). Concurrent validity on single domains will be presented.
Conclusions: We found a similar substantial IRR for both tools in the judgment of overall risk of bias. ROBIS requires more time to complete. Reasons for low correlation between AMSTAR and ROBIS may be differences in judgments or genuine differences in what the tools aimed to measure (methodological quality vs. risk of bias and appropriateness).