A study in Nature Reviews Neuroscience reported that the reliability of neuroscience studies is troubled by low statistical power resulting from small sample size. To grasp what is meant by low statistical power and how it can result from small sample size, we must first recognize one type of error that can arise when testing a statistical hypothesis, namely Type II error. This occurs when the null hypothesis is accepted when it is actually false. Type II error occurs at a rate described by β; β influences the strength of statistical power as shown by the equation, Power = 1 – β. Statistical power describes the probability of rejecting the null hypothesis when it is in fact false. Simply put, power is the probability of discovering a true difference, whether it is a correlation or difference in means. Low statistical power can result from a small sample size because the value of β is dependent upon sample size. For example, for small sample sizes the β value is large, meaning that it is more probable that we accept the null hypothesis when it is false. In summary, small sample size decreases statistical power making the discovery of a genuine effect less probable.
In addition to reducing the chance of detecting a true difference, Button et al. describe two additional consequences of low statistical power regardless of how well planned the experiments are. The first is a low positive predictive value (PPV) meaning that there is a small probability that a discovered effect that reaches statistical significance (defined by p < 0.05) is indeed a true effect. Secondly, studies of low statistical power fail to discover small effects; thus, a discovered effect represents an amplified approximation of the true effect.
The goal of the authors’ study was to find the average statistical power in neuroscience studies. To calculate the power of individual studies, the authors used an estimate of the effect size obtained from meta-analyses. The authors found the median statistical power was 21% (data from 49 meta-analyses and 730 primary studies published in 2011) meaning only 21 out of 100 true effects would be discovered. Moreover, almost one half of the included studies had an average power less than 20%. With regard to neuroscience subfields, the authors found the median statistical power was only 8% (data from 41 meta-analyses and 461 primary studies from 2006 through 2009), 18% and 31% for neuroimaging, water and radial maze studies, respectively. However, the statistical power of both water and radial maze studies are based on only two meta-analyses and so strong conclusions cannot be made. Regardless, the average sample size of the water maze studies was 22 and a power of 80% would require the detection of a large effect size (Cohen’s d = 1.26). Large effect sizes are not uncommon; however, the authors suggest that large effect sizes discovered in small sample studies are likely to arise due to exaggerated estimations of the true effect size. Using a more likely effect size as determined by the meta-analyses (Cohen’s d = 0.49), to achieve a power of 80%, the authors suggest the required number of animals is 134. This is a large number of required animals; a requirement in animal research is to reduce and replace the number of required animals when possible by the investigator. In addition, investigators must also consider the amount of time and money required to run large sample size studies. Together, these are probable reasons why the examined studies have low statistical power.
The authors argue that low powered studies in neuroscience are “wasteful” both in regards to the use of animals (and humans) and in the reliability of experimental results. However, low powered studies are not specific to neuroscience. Low powered studies have been problematic in areas of behavioral ecology, animal behavior, drug development, genetics and psychology.
What does this mean for neuroscientists? Firstly, it is necessary to be mindful of how small sample size impacts statistical power. Secondly, sample size should be estimated a priori with attention to the anticipated effect size. This can be done using G*Power (link below). Becoming more aware of how both sample size influences statistical power and how to estimate the sample size required to achieve a high power will lead to more reliable results in the future of neuroscience.
- Button KS, Ioannidis JP, Mokrysz C, et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci 2013;14:365-76.
- G*Power program