Assessing heterogeneity in replication research

Last updated on Jul 24, 2020

Abstract: Recent empirical research has questioned the replicability of published findings in psychology. However, analyses of these studies have proceeded with conflicting definitions of what it means for a finding to replicate. In this paper, we use a meta-analytic approach to highlight different ways to define “replication failure,” and argue that analyses can focus on exploring variation among replication studies or assess whether their results contradict the findings of the original study. We then apply this framework to experiments that have been subject to systematic replications in psychology. Among these experiments, we find that fewer studies conclusively failed to replicate than previous reporting would suggest. However, this finding must be interpreted with an important caveat: even the most powerful tests for replication failure tend to have low power for these data, which means that for the majority of experiments, these analyses are inconclusive. Further, while common interpretations of the replication crisis in psychology involve underpowered initial studies overestimating an effect, in half of the findings in this data, this is reversed: the original study understates the magnitude of the effect relative to the replications and would have been well powered to detect the effects estimated by the replication studies. We conclude by suggesting that efforts to assess replication would benefit from further methodological work on designing replication studies to ensure analyses are sufficiently sensitive.