Power Failure
Recent meta-science studies find that
psychology is typically 4 times more powerful than medical research, and its
median power is twice as large as economics.1, 2, 3 Yet, only 8% of
psychological studies are adequately powered.
Statistical power is the probability that a study of a given precision
(or sample size) will find a statistical significant effect. For a half century
following Cohen, adequate power (80%) has been deemed a pre-requisite of
reliable research (see, for example, the APA Publication Manual). With
statistical power so low how is it possible that the majority of published findings
are statistically significant? 4 Something does not add up.
The Incredible Shrinking Effect
When 100 highly-regarded psychological
experiments were replicated by the Open Science Collaboration, the average
effect size shrank by half.5 It shrank in half, yet again, when 21
experiments published in Nature and Science were replicated.6 Size matters. In economics, a simple weighted average of
adequately-powered results is typically one-half the size of the average
reported economic effect, and one-third of all estimates are exaggerated by a factor of 4.3
However, low power and research inflation are the least of social
sciences’ replication problems.
On the Unreliability of Science—Heterogeneity
“What meta-analyses reveal
about the replicability of psychological research” demonstrates that high heterogeneity is the more stubborn
barrier to successful replication in psychology. Even if a replication study were huge,
involving millions of experimental subjects and, thereby, having 100% power,
typical heterogeneity (74%) makes close replication unlikely. Then, the probability that the replicated experiment will roughly
reproduce some previous study’s ‘small’ effect (that is, one between .2 and .5
SMD (standardized mean difference)) is still less than 50%.1 Heterogeneity is the variation among ‘true’
effects; in other words, it measures the differences in experimental results
not attributable to sampling error. Supporters of the status quo are likely to point
out that the high heterogeneity that this survey uncovers includes ‘conceptual’
replication as well as ‘direct’ replication. True enough, but
large-scale replication efforts that closely control experimental and methods
factors (e.g. the Registered
Replication Reports and the Many Labs projects) still report sufficient
heterogeneity to make close replication unlikely.1,7
This is not to argue that large-scale, careful
replication should not be undertaken.
Indeed, they should because they often provide the best scientific
evidence available to the social and medical sciences. Unfortunately, such large-scale multi-lab
replication projects are feasible for only a relatively few areas of research
where studies can be conducted cheaply and quickly.
Enter Meta-Analysis
For some decades, meta-analyses that collect
and analyze all relevant research evidence were seen to be the best summaries
for research evidence and the very foundation of evidence-based practice (think
the Cochrane and Campbell Collaborations).
As reported in a recent Science article, meta-analysis has also been dragged into the credibility
crisis and can no longer be relied upon to settle all disputes. After all, that’s a pretty high bar! Unfortunately, conventional meta-analysis is
easily overwhelmed by high heterogeneity when accompanied with some degree of
selective reporting for statistical significance. Even when the investigated
social science phenomenon does not truly exist, conventional meta-analysis is
virtually guaranteed to report a false positive.8 And, no single
publication bias correction method is entirely satisfactory.8,9
The Way Forward
With crisis comes opportunity. In a recent authoritative survey of the credibility
of economics research, Christensen and Miguel
(2018) emphasize transparency and replication as the way forward.10 We believe that the current discussion of ‘crisis’ can be
transformed into a credibility revolution if a consensus can be formed about
taking a few feasible steps that harden and clarify our research practices. For the sake of brevity, permit us to sketch
such steps:
1. Carefully distinguish between
exploratory and confirmatory research studies.
Both types of investigations are quite valuable. The central
problem of the decades-long statistical significance controversy is that exploratory
research is presented in terms of statistical hypothesis testing as if it were
confirmatory. Yet, early research that identifies where, how, and under which
conditions some new phenomenon is expressed is essential. If only it could be presented and published
for what it is without the pretense of hypothesis testing. After some years of exploration, a
meta-analysis could be used to access whether the phenomenon in question merits
further confirmatory study. If so, a
confirmatory research stage should be undertaken where adequately-powered and
pre-registered studies that employ classical hypothesis testing are highly
valued and encouraged. During the confirmatory research stage, transparency would
be quite helpful.
2. Support large-scale, pre-registered
replications of mature areas of research.
Large-scale, pre-registered replications are especially valuable
during the confirmatory stage of social science research. These efforts have already begun and need to
be more highly encouraged and supported through greater funding and by the prestigious
publication of multiple-authored reports in our best scholarly journals.
3. Emphasize practical significance over
statistical significance.
Much of the debates across the social sciences would disappear if
researchers agreed upon how large some effect needed to be in order be worthy
of scientific or practical notice—i.e.
‘practical significance.’ The problem is
that the combination of high heterogeneity and some selective reporting of
statistically significant findings (because the current paradigm values them) makes
it impossible for social science research, no matter how rigorous and
well-conducted, to distinguish some quite small effect from nothing. Identifying ‘very small’ effects reliably is
simply beyond social science. However,
meta-analysis can often reliably distinguish a ‘practically significant’ effect
(say, 0.1 SMD or 0.1 elasticity) from a zero effect even under the severe challenges of high
heterogeneity and notable selective reporting bias.
With a few modest, but real, changes, genuine
scientific progress can be made.
Researchers of the World, unite.
—T.D. Stanley and Chris Doucouliagos
References:
1. Stanley, T.D.,
Cater, E. and Doucouliagos, H. (2018). What meta-analyses reveal about the replicability of psychological research. Psychological Bulletin. http://psycnet.apa.org/doi/10.1037/bul0000169
2.
Lamberick et al. (2018) Statistical power of clinical trials increased while
effect size remained stable: an empirical analysis of 136,212 clinical trials
between 1975 and 2014. Journal of
Clinical Epidemiology, 102:123-128.
3. Ioannidis, J. P.
A, Stanley, T. D., & Doucouliagos, C(H). (2017). The power of bias in
economics research. The Economic Journal, 127:
F236-265. doi:10.1111/ecoj.12461
4. Brodeur, A., Le, M.,
Sangnier, M., and Zylberberg, Y. (2016). Star Wars: The empirics strike back. American Economic Journal: Applied
Economics, 8:1-32.
5. Open Science
Collaboration (2015). Estimating the reproducibility of psychological science. Science,
349(6251), aac4716–aac4716.
doi:10.1126/science.aac4716
6. Camerer et al. (2018). Evaluating the
replicability of social science experiments in Nature and Science
between 2010 and 2015. Nature Human Behaviour, https://www.nature.com/articles/s41562-018-0399-z .
7. McShane et al. (2018). Large scale replication
projects in contemporary psychological research. The American Statistician, forthcoming.
8. Stanley, T. D. (2017). Limitations of PET-PEESE and other meta-analysis methods. Social Psychology and Personality Science,
8: 581–591.
9. McShane, B. B., Böckenholt, U. & Hansen, K.
T. (2016). Adjusting for publication bias in meta-analysis: An evaluation of
selection methods and some cautionary notes. Perspectives on Psychological Science, 11: 730–749.
10. Christensen, G. and Miguel, E. (2018). Transparency,
reproducibility, and the credibility of economics research. Journal of Economic Literature, 56: 920–80.
Past Discussions on Common
Pitfalls in conducting Meta-Regression Analysis in Economics
using t-values as effect sizes
reducing economic effects or tests
to categories of statistical significance for the purpose of probit (or logit)
meta-regression analysis (MRA).
There is a
consensus among MAER-Net members that these are ‘pitfalls’ in the sense they
are often misinterpreted and/or poorly modelled. MAER-Net does not wish to ‘prohibit’ the use
of logit/probit or t-values in meta-analysis.
We merely caution those who choose to do so to exercise greater care
interpreting the results from their MRAs.
Why issue
this caution? A full justification is
beyond the scope of any internet post; however, a brief sketch might look
something like,
Probit/Logit MRAs:
- reducing any
statistical effect or test to crude categories such as: statistically
significant and positive, stat insig, stat sig and negative or similar ones
will necessarily lose much information that is needed to identify the main
drivers of reported research findings reliably.
This loss of information is often fatal and almost always unnecessary.
- doing so
inextricably conflates selective reporting bias with evidence of a genuine
economic effect. It is not possible to
separate out whether a statistically significant result is due to the
researchers’ desire to find such an effect or some underlying genuine economic
phenomenon. Logit/probit MRAs are just
as likely to be identifying factors related to bad science as they are to
understand the economic phenomenon under investigation. However, this is not how Logit/probit MRAs
are interpreted, but rather are claimed to identify structure in the underlying
economic phenomenon.
- using better
statistical methods is almost always possible
whenever the research that is being systematically reviewed is the result of a
statistical test or estimate.
- conducting
these logit/probit MRA is little more than sophisticated ‘vote-counting,’
which is considered to be bad practice in the broader community of
meta-analysts. For example, Hedges and
Olkin (1985) prove that vote counts are more likely to come to the wrong
conclusion as more research accumulates, just the opposite of the desirable
statistical property, consistency.
t-values
- When t-values are
used as the dependent variable, all the moderator variables need to be divided
by SE. If not, then their MRA
coefficient reflects differential publication bias, not some genuine economic
effect.
- t-values cannot be
considered to be an ‘effect size.’ Doing
so, inevitably runs into any number of paradoxes or problems with
interpretations. As long as the
underlying economic effect is anything other than 0, t-values must increase
proportionally with the sqrt(n) and precision (1/SE). So which value of
precision or the sqrt(n) should the meta-analyst choose? The perfect study has precision and the
sqrt(n) approaching infinity. But here, the t-value will also approach
infinity, even when the effect is tiny. Nor
is the average t-value a meaningful summary of a research literature. For example, suppose the average t-value of the
price elasticity of prescription drugs is -2 (or -1, -3, or any number). Can
we infer that prescription drugs are highly sensitive (or insensitive) to prices? Depending on the typical sample size any of
these average t-values in consistent with an elastic or an inelastic demand for
prescription drugs. Worse still, any average
absolute t-value a little larger or smaller than 2 is compatible with
a perfectly inelastic demand for prescription drugs and some degree of
selection for a statistically significant price effect. Nothing about this important economic phenomenon
can be inferred from the typical, or the ideal, t-value.