A quick thought on statistical power

Whether you’re an allied health professional, strength coach, or personal trainer, there’s no hiding from the greater demand for evidence-informed practices. In an age where your clients come to you having spent hours online already trying to tackle their problem, they’re likely to end up on your door step with more questions, and likely expectations, than someone would have had in the past.

A typical professional in any of these fields is, in my opinion, required to read the relevant research in their field on a near daily basis. Reading isn’t enough, they have to be able to sift through the data to create evidence-based philosophies that guide their prescriptions, be it pills or exercise, and turn these data into practical principles to help their patients and clients. This is a multifaceted process and recently there’s been a greater reliance on statistical analysis to help sift through the endless barrage of research clogging up the internet.

Don’t get me wrong, I love to see that this is happening, but while scientific process is complicated enough as it is, the interpretation of statistical analyses isn’t any easier. Lately there has been greater consideration of statistical power in the interpretation of various studies across the health and fitness industry. This is great, as it’s a pivotal component of a thorough, critical evaluation of a study, but unfortunately, there’s a learning curve to the adoption to any new technique

When you assume, you make an….

Statistical power represents the probability that the null hypothesis will be rejected when it is false. In simpler terms, that a difference of a given magnitude will be detected as statistically significant between groups when it exists. Whereas the term beta (β) describes the probability of a type II error, concluding there is no difference when one exists, power (1-β), is the probability of concluding there is a difference when one exists. Low statistical power increases the chance of a type II error, that is concluding there is no difference when in fact there is. Therefore, any study that presents a difference between two groups of sufficient magnitude to be of interest that doesn’t reach statistical significance requires consideration of whether it was adequately powered to detect the observed differences from a statistical point of view.

Any analyses is formulated on assumptions, and when we do a post-hoc (after the fact) power analysis to aid our interpretation of a scientific finding we need to have some concept as to what an important difference might be. The p-values provided by a study do not answer this for us. P-values provide no judgement as to whether any differences are meaningful from a practical perspective, just the probability of the observed statistics occurring due to chance against a threshold value (α=0.05).

This takes many forms in the literature, clinicians have the concept of a minimally clinically important difference (MCID) for some outcomes, however in the strength and conditioning and fitness world, this concept has not been given much consideration. Ultimately then, this is a difficult concept for many professionals to define, and while it is easy to conclude that any improvement is beneficial, we have to consider the fact that statistical power is not necessarily an all-or-none property (our interpretation usually is), and also the precision of the measurement techniques to detect increasingly smaller changes.

In the high performance sports world, when margins are often tight, there’s a tendency to view even the most minuscule difference between two groups as potentially important. Did 0.12 seconds separate Usain Bolt (1st, 9.63 sec) from Yohan Blake (2nd, 9.75 sec, +0.012 sec) in 100m final at the 2012 London Olympics? Would a 0.5% difference in growth with different training practices in the month leading up to the 2014 Olympia mean that Kai Greene would have edged out Phil Heath? The often small margins in the competitive environment are used as a justification for the interpretation of a lack of statistical power when research has very small differences between groups that fail to reach statistical significance.

At the end of the day when I hear someone say that a study is underpowered, what I’m really hearing is that they think that there was a meaningful, or important, difference between the groups, that the study was unable to detect (type II error, false negative) as it did not collect a sufficient number of observations. Most people don’t worry about whether a study was adequately powered when a study has near-identical means. So while a p-value only denotes the probability that an observation would have occurred by chance, our interpretation of these results usually includes some consideration of the magnitude of the difference in terms of what would be practically relevant.

Hypertrophy irrespective of training intensity as an example

Let’s use an example of the now not-so-recent study by Mitchell et al (1) that I’ve discussed elsewhere. This study represents a case where there is a clear example of two experimental groups with very similar means, and another comparison with a potentially larger relative difference lacking statistical significance.

Without too much detail (discussed previously) the authors compared three training conditions, three sets at 30%-1RM, one set at 80%-1RM and three sets at 30%-1RM all to concentric failure in a 10-week training program. The change in quadriceps volume (%) from the study is shown per group in the table below. These numbers are estimated from a graph, the accuracy is not of much importance here, rather the process we are using to draw our conclusions is the point. There were twelve “legs” (training was unilateral, each participant completing two conditions, one per leg) per group, and the observed differences in growth were not statistically significant.

[table caption=”Estimated quadriceps volume change from fig #1 of Mitchell et al (1)” width=”100%” colwidth=”25%” colalign=”left|center|center|center”]
,3 sets 30%-1RM,1 set 80%-1RM,3 sets 80%-1RM
Quad Volume,7±1.5%,3.5±0.5%,7±1.2%
[/table]

Ultimately, the authors concluded that the three training conditions were equivalent from a hypertrophic standpoint. Many saw the relatively large difference between both three set conditions and the one set condition (roughly twice the growth) and concluded the study was universally underpowered. Both of these are reasonable positions to take at first glance of the data, but we can also use a numerical method to better inform our decisions.

Now there are multiple ways that we could approach the problem, but when I do a post hoc analyses on a study it’s usually a rough process to give me a general idea. I like to use the free software G*Power. In that program, selecting a two factor (training condition, time), repeated measures ANOVA as in Mitchell et al (1) and the following values (pooled SD 4.96, effect size 0.33, alpha 0.05, power 0.8, correlation between measures 0.96) yields a total sample size of 90 “legs”, or 30 per group. This number is well in excess of what was observed in the study. We can also compute the actual power of the observed results which was 0.39, short of the usual standard of at least 0.8 in the scientific community. Therefore β, or the type II error rate is 0.61 (61%), suggesting a high false negative risk, or concluding no difference exists when one was present.

Score one for the underpowered camp.

Power on a per comparison basis

Above we’ve calculated sample size and power based on the observed experimental design, but what if we were planning a future study?

To simplify this point with an example, we’ll plan two studies, one investigating the effects of training intensity on muscle growth (80%-1RM vs 30%-1RM) and another looking at different volumes of sets (80%-1RM for 1 set vs 3 sets). In either case, we can use the data from Mitchell et al (1) in the “planning” phase of our studies to perform power calculations to set our sample sizes.

In our training intensity study, we’ll use a between groups design, whereby participants are randomized to each training condition, with pre/post training observations for each. Performing a power calculation for this design (repeated measures ANOVA), informed by the data above from Mitchell et al. (1) and comparable statistical parameters as above, we would need an infinite sample size given the equivalence of the means (7±1.5% vs 7±1.2%). Research budgets are tight right now so I’m not sure this “study” will be happening any time soon. I’ll also note that there is actually a very small difference between the two intensity conditions, so the number is lower than infinite, but still not likely to happen anytime soon. Especially given that recent literature also seems to suggest that hypertrophy is comparable between intensities when taken to concentric failure or equivalent volumes (2,3).

In our volume example (80%-1RM 3 sets vs 1 set) we’d need roughly 33 per group, a much more manageable number. Coupled with the two fold relative change in growth between the one and three set conditions, existing literature favouring multi-set training (4), it would be reasonable to suggest that either this comparison would be worth testing again with an adequate sample size. I’m not convinced we need to spend funds on another single vs multi-set study, so ultimately the lack of effect between volume conditions in Mitchell et al (1) was underpowered, and unlikely to change our stance regarding the relationship between the number of sets and hypertrophy.

Conclusion

This is by no means the only way to consider post-hoc power analyses, it’s just the process I use when going through the results of a study. Just as we like to demand evidence to support our philosophies and practices, it’s important when making a claim regarding the results of a study to have evidence to support our positions. In the case of statistical power, there’s a few key points we should consider:

  1. While statistical significance considers the probability of a finding relative to chance, when we claim a study is underpowered, there’s likely some consideration of the magnitude of the difference in terms of practical relevance. Is your opinion informed by the existing literature as to what an important or “real” difference might be? If there isn’t a relevant or meaningful difference in the study, why would you be so concerned that it was underpowered?
  2. We can use a numerical process to guide our decision as to whether a study is adequately powered or not and free software such as G*Power is available to help us. Most studies report all the requisite numbers that you’ll need to approach this and more often than not it takes less than a minute of work.
  3. We should be cautious when writing a study off as universally underpowered and should always take the time to consider all the comparisons of interest in the dataset. Don’t just run a power calculation and call it a day, consider the actual values in the dataset and look for trends/patterns. Within a single study, power can vary across outcomes, and it’s important to remember that sometimes, the means just aren’t different.
  4. As we’ve seen above, our conclusions can vary depending on how we approach our post-hoc power calculations, don’t be afraid to run multiple analyses.

Recommended Resources

  • G*Power: This free software allows you to perform many calculations for most of the common study designs you’d encounter in the exercise physiology literature. Manuals are available online, and at the very least, the price is right.
  • Intuitive Biostatistics by Harvey Motulsky (Amazon Affiliate Link). An excellent introduction to statistical concepts, of which chapters 18-21 are particularly relevant to this post.
  • Is This Change Real? by Daniel Riddle and Paul Stratford (Amazon Affiliate Link): Oriented to physiotherapists, the concepts discussed in the book are relevant to anyone who deals with change. Whether you’re a doctor, personal trainer, strength coach, or physiotherapist there is info in this book that will aid your interpretation of research findings and assessing whether your patients and/or clients are actually changing.
  • Noninferiority and Equivalence Designs: issues and Implications for Mental Health Research: While this article is concerned with the field of mental health it does an excellent job describing the considerations required when using a equivalence (or non-inferiority) designs. As we’ve seen in the post above, defining the difference, power and sample size are key concepts.

Comments

    Douglas Garfield says:

    It’s always a good idea to point out the difference between statistical and practical significance. But also the many sources of random and systematic error that riddle a good portion of exercise-science research. As you know, far too many physiologists set out to prove points, not uncover the truth. Seek and ye shall find exactly what you’re looking for whether it’s there or not.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share on Facebook
Share on Twitter
Subscribe to Newsletter

Dan Ogborn