Scientific discoveries are groundbreaking: they can cure disease, improve our standard of living, and help us better understand the world we live in. We are all probably familiar with the general idea of the scientific method that is used to validate theories: ask a question, develop a hypothesis, test the hypothesis, and draw conclusions based on the experiment. The truth is, it is very easy, perhaps even justifiable, for the general public to take this process for granted. People have a vested interest in findings that can potentially make big impacts on their own lives, but why should they care about the technical aspects behind the big, shiny discovery? The researchers are the professionals after all, so why not just let them handle it? Perhaps such an attitude stems from the erroneous assumption of perfect information, which despite its obvious fallacy may nonetheless stimulate a culture of deferring to the experts. Turning a researcher’s hunch into a scientifically supported hypothesis and ultimately a relevant piece of public knowledge is a very arduous and imperfect process. If findings are blindly accepted without a general understanding of how the information came to be discovered, mistaken information can be accepted just as blindly, and the quality of scientific knowledge in the mainstream suffers.
It is in this context that I will examine the role statistics plays in the scientific method, as a key cog in determining the strength and validity of a scientific claim. Statistical methods manage to be vital, ubiquitous, misused, and misunderstood all at the same time, so a very important piece of this discussion involves the dangers caused by an incorrect or incomplete understanding of statistical processes. The concept of statistical significance is especially relevant, due to the common usage of the phrase and its imprecise nomenclature.
But to fully comprehend the more general issues surrounding statistics in science, it is important to first have a clear understanding of the basic application of null hypothesis significance testing, which is the most utilized method of statistical testing in areas such as biology (McDonald, 2014) and psychology (Cohen, 2010). McDonald (2014) explains that the general idea is to analyze the data collected from an experiment as it relates to the prediction, or null hypothesis, which was made in the hypothesis stage of the scientific method. In some types of testing, the researcher is hoping to show that the data gives credence to the null hypothesis (Cohen, 2010), but for the most prevalent form the logic is reversed and less straightforward.
Say, for example, that I have a standard six-sided die I suspect may be rigged. I want to conduct a study and run a test in the hopes of supporting my claim. So, I roll the die one hundred times and note how many times each number was rolled. The summary of this data is called the test statistic (Cohen, 2010). Now, if I wish to use this test statistic to support my claim, I must compare it to some other set of data in order to gain statistical insight into whether the die is rigged or not (Cohen, 2010). It is problematic to conceive of a way to measure an expected pattern for loaded dice, because it is difficult to specify exactly what kind of behavior is implied by “loaded,” and thus there are infinite possibilities for how any given one could behave! Instead, I can compare my test statistic to the behavior of a fair die, since the probabilities on a fair die are exact (1/6 for each side), and thus its potential behavior can be calculated precisely. Hence the prediction, or null hypothesis, that I am testing against in this case is that my die is fair. Cohen (2010) called this type of null hypothesis a nil hypothesis, because it asserts that the alleged rigging actually has no effect on the behavior of the die. Obviously, this is trivial and not what I was excited to try to show, so what is the point? The key is that the specific nature of my nil hypothesis allows me to calculate a null hypothesis distribution (NHD), which indicates the probability of getting each possible test statistic under the assumption that my null hypothesis is true (Cohen, 2010). I now have something that I can compare my data to.
From there, as Cohen (2010) explained, “the next step is to calculate the probability of obtaining one’s actual test statistic from the NHD, or one that is even further from the mean of the NHD” (para. 1 ). In my example, suppose the die rolls obtained a particular number 20 percent more often than the others. Since the null hypothesis of a fair die expects that on average each number will turn up with equal frequency, I would look at the null hypothesis distribution for the probability of obtaining one number 20 percent more often than the others or even more frequently. In other words, what are the chances that in rolling a fair die, its behavior will deviate from what we know about fair dice by at least as much as the die I am testing? This probability is denoted as a p value (Cohen, 2010). Clearly, a small p value would indicate that the test statistic from the experiment data would be a huge outlier if it was governed by the null hypothesis; hence it would lead the researcher to believe that the null hypothesis may be false (McDonald, 2014). If I decide that I believe my null hypothesis is false, I can reject it and adopt an alternative hypothesis (I would state this before the experiment), which is more like my real hunch about the data (Cohen, 2010). Since my null hypothesis for the die rolling was nil, suggesting no difference in behavior between my die and a fair die, my alternative hypothesis will predict that there is indeed some fundamental difference (McDonald, 2014). It is now hopefully evident how the use of a nil hypothesis can allow researchers to compare their test data to a clearly specified distribution, with the hope of rejecting the hypothesis to make way for the (hopefully noteworthy) alternative.
A major question remains: how small a p value does it take to justifiably reject the null hypothesis? McDonald (2014) explained that a cutoff of 0.05 is typically used as the significance level below which the null hypothesis can be rejected. He noted the arbitrary nature of the threshold, which appears to be a purely practical figure: too high a p value will make the null hypothesis more vulnerable to wrongful rejection when it is true, while too low a p value will risk missing a notable result because the null hypothesis is more difficult to reject. It should be clear that despite the specific statistical tests that are used, and the contextual value of the results, the process is more murky estimation than exact science.
Siegfried (2010) argued that it is this element of the concept of significance, along with the confusing usage of the word itself, which creates confusion in the comprehension of statistics. He stated, “If the chance of a fluke is less than 5 percent, two possible conclusions remain: there is a real effect, or the result is an improbable fluke” (Statistical Insignificance section, para. 7 ). Null hypothesis testing comments only on the chances of the experiment data supporting the null hypothesis if the null is true (Siegfried, 2010). He pointed out that, nevertheless, test results are often erroneously interpreted as providing odds for the likelihood of the null hypothesis being true. He similarly dispelled the notion that a statistically significant result automatically implies some large magnitude of difference in behavior of the tested subject. Referring back to my dice example, perhaps I rolled my die a million times, and returned one percent more of one number than the others. I may well get a very low p value, because as the sample size gets larger, deviations become less and less plausible if the null hypothesis is true (McDonald, 2014). Thus, I might achieve statistical significance relative to p = 0.05, reject the null hypothesis, and excitedly proclaim my discovery that this die is rigged. And yet, with the loaded side of the die turning up just one extra time about every 600 rolls, the effect is very unlikely to have any practical significance.
On the flip side, Reinhart (2015) provided an example of how a study with too small a sample size can, despite the null hypothesis being true, fail to achieve a significant result, thus wasting time and energy. He noted an analysis of published medical trials that concluded that “more than four-fifths of randomized controlled trials that reported negative results didn’t collect enough data to detect even a 25% difference in primary outcome between treatment groups” (Reinhart, 2015, p. 19). Though it is intuitive that the challenges of fielding a large enough sample size might make it more difficult to show an effect, such a large number is astonishing. It shows a clear neglect for the concept of power, which is the chance, given the sample size, that a study will achieve a statistically significant outcome (Reinhart, 2015). An even more concerning example Reinhart (2015) discussed involves studies of adverse drug effects when sample sizes are small. An absence of statistical significance due to lack of statistical power prompts the technically correct conclusion that there is no statistically significant adverse effect. However, along the lines of the misconceptions I have discussed already, this can turn into the blatantly wrong conclusion that there are definitely no notable adverse effects. The assumption of a true null hypothesis, simply because the study was too weak to have any chance of rejecting it, can clearly have some dangerous consequences in this instance: imagine if a dangerous drug gets approved simply because no study is powerful enough to discover the harmful effects! (Reinhart, 2015)
Thus far, the examples I have mentioned illustrate the issues that can develop from a misunderstanding of the concepts of statistical testing and its related vocabulary. But even among those who understand how the statistics work, problems are allowed to persist because of how statistical testing is sometimes applied and regarded. Gelman (2013) outlined what he perceives to be a major problem with scientific research:
[A]s long as studies are conducted as fishing expeditions, with a willingness to look hard for patterns and report any comparisons that happen to be statistically significant, we will see lots of dramatic claims based on data patterns that don’t represent anything real in the general population. (p. 1)
Gelman detailed several troublesome studies in which finding any kind of publishable conclusion seemed to be of more importance than following the scientific method through to diligently test an initial hypothesis. In one, researchers passed off arm circumference measurements of college students as representing adult male upper body strength, to find a correlation relating to economic self-interest. In another, researchers found a statistically significant conclusion that women wear more red clothing when their fertility levels are highest, but used “a self-selected sample of 100 women from the Internet, and 24 undergraduates at the University of British Columbia” (Gelman, 2013, p. 1) as the entire sample used to make a generalization about all women.
I present these examples in such stripped down form to make the case that it does not require much beyond common sense to intuitively understand when the manipulation of statistics has gone awry. Thus it begs the question: why are fairly egregious examples of this sort still prevalent (and published) if it is possible to weed many of them out with a rather basic intuitive inspection? “The system of scientific publication is set up to encourage publication of spurious findings,” Gelman (2013, p. 2) concisely concluded, and that would seem to explain the phenomenon. An inherent conflict of interests exists if public consumption of a catchy, news-ready finding is enough to overcome significant flaws in scientific methodology and pave the road to the publication of dubious information. In the wrong circumstances, this conflict can threaten the progress of science and discovery.
Few examples may be as infamous or as damaging as Andrew Wakefield’s purported link between autism and the MMR (measles, mumps, and rubella) vaccine, published in a 1998 paper (Ziv, 2015). Wakefield’s argument hinged on observations of 8 children who developed gastrointestinal problems and showed signs of autism shortly after being vaccinated (Gerber & Offit, 2009). Gerber and Offit systematically debunked Wakefield’s argument, which was flawed by a tiny sample size, a lack of statistical structure, and even a disconnect between the hypothesis and the observed symptoms. They examined 20 studies with much more statistical integrity, all of which failed to replicate any of Wakefield’s conclusions. Furthermore, around the time of his paper, Wakefield was trying to obtain a patent for a different type of preventive treatment aimed at measles and intestinal problems (Ziv, 2015). Once again it is painfully clear which side the science is on, and that motivations other than authentic discovery may have been in play. But that has not prevented the anti-vaccine movement from developing and sustaining a controversy with no scientific backing, and bringing a rise in measles rates with it (Ziv, 2015). Poorly conducted science has no ceiling for negative impact.
The cases I have utilized to illustrate these problems all involve statistics and significance, from different angles and to varying degrees. Insufficient education and a lack of full disclosure seem to be common threads that leave statistics vulnerable to butchering. First, I believe that the misunderstanding of the term statistical significance, as well as the failure to understand what a null hypothesis actually can and cannot prove, spawns the idea that statistical hypothesis tests are exact, perfect, and completely objective. The science can be sanitized in an effort to simplify, and I believe this drives the public to develop apathy towards the empirical component of a scientific study. In turn, a sensationalist, results-oriented culture continues to be encouraged by taking the process of the scientific method for granted. Researchers, in turn, are trusted to partake in the scientific method; those who are driven to search anywhere and everywhere for whatever pattern can be found, only to use a statistical test as a short-order cook to fry up a “significant” result for publication, are abusing the power vested in them as members of a minority that at least theoretically understands how the process should work.
As in many areas, an attempt at repairing this complicated issue should begin with a better connection between the experts and the general public. The improved link can be forged at both ends. Publications of studies with potential public interest could include a layman’s version, still explaining in some detail the general idea of the empirical process that was followed, while avoiding cumbersome, high-level language that turns away those outside the field. On the other side, science educators could make an effort to avoid sheltering students from the messy issues associated with research, and teach not just the facts, but how to reason scientifically. Similarly, in teaching statistics, applications could be much more heavily emphasized; instead of just learning how to conduct various types of tests, actually attempting to run them in a practical environment could be a valuable experience serving to ingrain the imperfection of science.
Finally, the false dichotomy associated with a statistical test either succeeding or failing to achieve significance could be rectified by a change in nomenclature and a change in the handling of results that do not reject the null hypothesis. Calling it statistical support, perhaps, might more accurately depict the nature of information provided by achieving a small enough p value to reject the null, without completely dismissing unsupported results the way calling them insignificant does. Most importantly, the attention paid to studies with these unsupported results should be increased. Every well conducted scientific study has something very real to say about the world we live in, whether by what it shows or does not show. If publication standards were changed to allow more room for unsupported, “statistically insignificant” studies to tell their tales, perhaps motivations would change in favor of conducting more sound research and away from desperate searching for a publishable statistical trend. Perhaps more would be learned if a more representative distribution of information was available, not biased only towards the studies backed by the statistics. Certainly it would be more valuable than sifting through far too much misinformation. Exciting scientific phenomena are still everywhere, waiting to be discovered; perhaps we should only look for those that actually do exist.
References
Cohen, B. H. (2010, Jan 30). Null Hypothesis Significance Testing. Corsini Encyclopedia of
Psychology, 1-2. doi: 10.1002/9780470479216.corpsy0612
Gelman, A. (2013, Jul 24). Too Good to Be True. Slate. Retrieved from
http://www.slate.com/articles/health_and_science/science/2013/07/
statistics_and_psychology_multiple_comparisons_give_spurious_results.2.html
Gerber, J. S., & Offit, P. A. (2009). Vaccines and Autism: A Tale of Shifting
Hypotheses. Clinical Infectious Diseases: An Official Publication of the Infectious Diseases Society of America, 48(4), 456–461. doi: 10.1086/596476
McDonald, J. H. (2014). Handbook of Biological Statistics (3rd ed.). Baltimore, Maryland:
Sparky House Publishing.
Reinhart, A. (2015). Statistics Done Wrong: The woefully complete guide. San Francisco,
California: No Starch Press.
Siegfried, T. (2010, Mar 27). Odds Are, It’s Wrong. Science News, 177, 26-29. Retrieved from
https://www.sciencenews.org/article/odds-are-its-wrong
Ziv, S. (2015, Feb 10). Andrew Wakefield, Father of the Anti-Vaccine Movement, Responds to the Current Measles Outbreak for the First Time. Newsweek. Retrieved from: http://www.newsweek.com/2015/02/20/andrew-wakefield-father-anti-vaccine-movement-sticks-his-story-305836.html