April 7, 2016 – Boulder, CO, US
The American Statistical Association’s (ASA) recently-released draft statement on statistical significance and p-values (http://amstat.tandfonline.com/doi/pdf/10.1080/00031305.2016.1154108) is an excellent touchstone of caution about an all-too-real problem: Statistical significance and p-values, usually derived from null hypothesis testing (NHT), can be terribly misleading when interpreted as measures of evidence. The less an analytical problem resembles a controlled laboratory experiment, the less useful and more misleading p-values become. This problem is exacerbated by poor analytical techniques and the intentional and unintentional selectivity in analysis and reporting. “P-hacking” is the practice of intentionally manipulating the data and analysis until you hit on a “significant” p-value. The term “garden of forking paths” refers to the myriad decisions made in data processing, measurement, analysis that unintentionally and indirectly influences results (http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf).
The problem is that researchers, reviewers, publishers, and journalists are far too willing to accept “p<5%” results as evidence for some claim. This leads to erroneous public policy as well as misleading reports in prestigious journals and in the popular press. In their review of cancer-risk studies of common foods, Schoenfeld and Ioannidis (2013) summarize the very problem the ASA hopes to address:“Associations with cancer risk or benefits have been claimed for most food ingredients. Many single studies highlight implausibly large effects, even though evidence is weak.”. Some substances do cause cancer, but robust identification of cancer risks is often beyond the capabilities of many studies.
Given the complexity of human biology, psychology, and society, is it really likely that moderate consumption of a single ingredient, hearing a certain word, or standing in a particular posture for 3 minutes can substantively change our lives? Any effects they might have on us will almost surely be small and tangled up with thousands of other things that impact us from day to day and moment to moment. Sophisticated problems not only need sophisticated methods to analyze them, but more importantly, a cumulative body of scientific evidence supporting the effect (rather than accepting “significant” results from one study in isolation).
Measuring small effects within noisy systems turns p-values on their heads. In reference to one particular study presenting p-values as evidence for questionable results, social scientist and statistician Andrew Gelman illustrated the situation as “…trying to use a bathroom scale to weigh a feather—and the feather is … in the pouch of a kangaroo that is vigorously jumping up and down”. Gelman and Carlin (2014) show that the inadequacies of p-values and NHT in these situations all but ensure that reported effects are exaggerated just by chance.
Ultimately, the problem addressed by the ASA is not simply about p-values, but rather about inappropriate application of some statistical methods, erroneous interpretation of their results, and the over-reliance of a single statistical measure to accept or reject the importance of a scientific study. Many alternatives to NHT exist, but none is a silver bullet. As the ASA makes clear in their new statement: “The validity of scientific conclusions … depends on more than the statistical methods themselves. Appropriately chosen techniques, properly conducted analyses and correct interpretation of statistical results also play a key role in ensuring that conclusions are sound and that uncertainty surrounding them is represented properly.”