P-hacking and scientific reproducibility

Credit: Wikimedia

The reproducibility crisis in science

Recent public reports have underscored a crisis of reproducibility in numerous fields of science. Here are just a few of recent cases that have attracted widespread publicity:

  • In 2012, Amgen researchers reported that they were able to reproduce fewer than 10 of 53 cancer studies.
  • In 2013, in the wake of numerous recent instances of highly touted pharmaceutical products failing or disappointing when fielded, researchers in the field began promoting the All Trials movement, which would require participating firms and researchers to post the results of all trials, successful or not.
  • In March 2014, physicists announced with great fanfare that they had detected evidence of primordial gravitational waves from the “inflation” epoch shortly after the big bang. However, other researchers subsequently questioned this conclusion, arguing that the twisting patterns in the data could be explained more easily by dust in the Milky Way.
  • Also in 2014, backtest overfitting emerged as a major problem in computational finance and is now thought be a principal reason why investment funds and strategies that look good on paper often fail in practice.
  • In 2015, in a study by the Reproducibility Project, only 39 of 100 psychology studies could be replicated, even after taking extensive steps such as consulting with the original authors.
  • Also in 2015, a study by the U.S. Federal Reserve was able to reproduce only 29 of 67 economics studies.
  • In an updated 2018 study by the Reproducibility Project, only 14 out of 28 classic and contemporary psychology experimental studies were successfully replicated.
  • In 2018, the Reproducibility Project was able to replicate only five of ten key studies in cancer research, with three inconclusive and two negative; eight more studies are in the works but incomplete.


The p-test, which was introduced by the British statistician Ronald Fisher in the 1920s, assesses whether the results of an experiment are more extreme that what would one have given the null hypothesis. The smaller this p-value is, argued Fisher, the greater the likelihood that the null hypothesis is false. However, even Fisher never intended for the p-test to be a single figure of merit; rather it was intended to be part of a continuous, nonnumerical process that combined experimental data with other information to reach a scientific conclusion.

Indeed, the p-test, used alone, has significant drawbacks. To begin with, the typically used level of p = 0.05 is not a particularly compelling result. In any event, it is highly questionable to reject a result if its p-value is 0.051, whereas to accept it as significant if its p-value is 0.049.

The prevalence of the classic p = 0.05 value has led to the egregious practice that Uri Simonsohn of the University of Pennsylvania has termed p-hacking: proposing numerous varied hypotheses until a researcher finds one that meets the 0.05 level. Note that this is a classic multiple testing fallacy of statistics: perform enough tests and one is bound to pass any specific level of statistical significance. Such suspicions are justified given the results of a study by Jelte Wilcherts of the University of Amsterdam, who found that researchers whose results were close to the p = 0.05 level of significance were less willing to share their original data than were others that had stronger significance levels (see also this summary from Psychology Today).

Along this line, it is clear that a sole focus on p-values can muddle scientific thinking, confusing significance with size of the effect. For example, a 2013 study of more than 19,000 married persons found that those who had met their spouses online are less likely to divorce (p < 0.002) and more likely to have higher marital satisfaction (p < 0.001) than those who met in other ways. Impressive? Yes, but the divorce rate for online couples was 5.96%, only slightly down from 7.67% for the larger population, and the marital satisfaction score for these couples was 5.64 out of 7, only slightly better than 5.48 for the larger population (see also this Nature article).

Perhaps a more important consideration is that p-values, even if reckoned properly, can easily mislead. Consider the following example, which is taken from a paper by David Colquhoun: Imagine that we wish to screen persons for potential dementia. Let’s assume that 1% of the population has dementia, and that we have a test for dementia that is 95% accurate (i.e., it is accurate with p = 0.05), in the sense that 95% of persons without the condition will be correctly diagnosed, and assume also that the test is 80% accurate for those who do have the condition. Now if we screen 10,000 persons, 100 presumably will have the condition and 9900 will not. Of the 100 who have the condition, 80% or 80 will be detected and 20 will be missed. Of the 9900 who do not, 95% or 9405 will be cleared, but 5% or 495 will be incorrectly tested positive. So out of the original population of 10,000, 575 will test positive, but 495 of these 575, or 86%, are false positives.

Needless to say, a false positive rate of 86% is disastrously high. Yet this is entirely typical of many instances in scientific research where naive usage of p-values leads to surprisingly misleading results.

The American Statistical Association takes aim at p-values

In light of such problems, the American Statistical Association (ASA) has issued a Statement on statistical significance and p-values. The ASA did not recommend that p-values be banned outright, but it strongly encouraged that the p-test be used in conjunction with other methods and not solely relied on as a measure of statistical significance, and certainly not viewed as a probability value. The ASA’s key points are the following:

  1. P-values can indicate how incompatible the data are with a specified statistical model.
  2. P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  3. Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  4. Proper inference requires full reporting and transparency.
  5. A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  6. By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

The ASA statement concludes,

Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.

Jeff Leek’s recommendations

Along this line, as Jeff Leek wrote on the Simply Statistics blog, the problem is not that p-values are fundamentally unusable, instead it is that they are all too often used by persons who are not highly skilled in statistics. Indeed, any given statistic can be misused in the wrong hands. Thus the only long-term solution is “to require training in both statistics and data analysis for anyone who uses data but particularly journal editors, reviewers, and scientists in molecular biology, medicine, physics, economics, and astronomy.”

Leek proposed the following alternatives (these items and the listed comments are condensed from his article):

  1. Statistical methods should only be chosen and applied by qualified data analytic experts. Comment: This is the best solution, but it may not be practical given the shortage of such experts.
  2. Where possible, report the full prior, likelihood and posterior details, together with results of a sensitivity analysis. Comment: In cases where this can be done it provides much more information about the model and uncertainty; but it does require advanced statistical expertise.
  3. Replace p-values with direct Bayesian approach, reporting credible intervals and Bayes estimators. Comment: In cases where the model can be properly fit, this provides scientific measures such as confidence intervals; however it requires advanced expertise, and the results are still sample size dependent.
  4. Replace p-values with likelihood ratios. Comment: In cases where this can be done it would reduce the confusion with the null hypothesis; however likelihood ratios can be exactly computed only with relatively simple models and designs.
  5. Replace p-values with confidence intervals. Comment: Confidence intervals are sample size dependent and can be misleading for large samples.
  6. Replace p-values with Bayes factors. Comment: These require advanced expertise, depend on sample size and may still lead to unwanted false positives.

No royal road

In summary, while p-values are often misused and are potentially misleading, nonetheless they are in the literature and are probably not completely going away anytime soon. What’s more, even the more sophisticated alternatives are prone to misuse.

Thus the only real long-term solution is for all scientific researchers and others who perform research work to be rigorously trained in modern statistics and how best to use these tools. Special attention should be paid to showing how statistical tests can mislead when used naively. Note that this education needs to be done not only for students and others entering the research work force, but also for those who are already practitioners in the field. This will not be easy but must be done.

Such considerations bring to mind a historical anecdote from the great Greek mathematician Euclid. According to an ancient account, when Pharaoh Ptolemy I of Egypt grew frustrated at the degree of effort required to master geometry, he asked his tutor Euclid whether there was some easier path. Euclid is said to have replied, There is no royal road to geometry.

The same is true for data analysis: there is no “royal road” to reliable, reproducible, statistically rigorous research. Many fields of modern science are in the midst of a data revolution. To list a few: astronomy (digital telescope data), astrophysics (cosmic microwave data), biology (genome sequence data), business (retail purchase and computer click data), chemistry (molecular simulation data), economics (econometric data), engineering (design and simulation data), environmental science (earth-orbiting satellite data), finance (market transaction data) and physics (accelerator data).

Those researchers who learn how to deal effectively with this data, producing statistically robust results, will lead the future. Those who do not will be left behind.

Comments are closed.