The Price of Precision

Back in May, Uri Simonsohn posted an article to his blog about studying effect sizes in the lab, with the general conclusion that the sample sizes needed for a precise enough estimate of effect size make it essentially not feasible for the majority of lab studies. Although I was not at the recent SESP conference, I have been told he discussed this (and more!) there.1 Felix Schönbrodt further discussed Simonsohn’s point, noting that reporting the effect size estimates and confidence intervals is still important even if they are wildly imprecise, because they can still be used in a meta-analysis to achieve more precision. I think both of these posts are insightful, and recommend that you read them both. However, both of them use particular examples with a given level of precision or sample size to illustrate their points. I wanted to go a bit more in-depth on how the precision level and effect size changes the sample size needed, using a tool in R that Schönbrodt pointed out.

Graphing Required Sample Sizes

What I wanted to know was this: Across a range of levels of precision (i.e., sizes of confidence intervals), and a range of effect sizes, how does the distribution of required sample sizes look? To do this, I used the MBESS package in R, which includes a function calculating sample size using the accuracy in parameter estimation (AIPE) framework.2 I then calculated these estimates for a range of effect sizes (Cohen’s d of .1 to 1.0) and levels of precision (width of .1 to .3 pooled SDs).3 The results are plotted in the graph below:

Sample size for given CI precision and effect size

As you can see, the required sample size needed increases dramatically as the precision increases. Effect size plays some role, but not much. At the high end of the scale, for very precise estimates—the kind of estimates Simonsohn suggested in his article as necessary to properly study effect size—the sample size needed is over 3000 (3078 for d = .1, and 3458 for d = 1.0). Even the low end of the scale, with estimates between .2 and .3 pooled SDs, still requires at least 300 participants. I’ve zoomed in on this part of the graph below:

Sample size for given CI precision (.2 to .3) and effect size

It is interesting to note that the required sample size increases exponentially as precision increases. I ran a regression to confirm this:

summary(lm(log(sample) ~ effects + log(width), data.long))

This model has an R2 of .9999 (I’m assuming the rest is probably just error due to rounding). You can also see this graphically. I log-transformed both scale axes and plotted regression lines through for each effect size. The scale units don’t show up too well, but the points line up perfectly:

Sample size graph, log-transformed

So what this means is that you can estimate the sample size for a given effect size and level of precision with the following formula:

sample_size = exp(3.403 + .130*d - 1.999*log(precision))

Effect Sizes and Precision

So let’s take this in another direction. Given an average effect size in psychology of about d = .2, let’s say we run a study with 100 participants (50 per cell). This is well above what many lab studies use. What level of precision can we get from this? Let’s rearrange the formula:

precision = exp((-log(sample_size) + 3.403 + .130*d) / 1.999)

Plugging the numbers into this formula gives us a level of precision of .56. In other words, the confidence interval around our point estimate of .2 would be 95% CI [-.08, .48]. We wouldn’t be able to reject the null, for one thing, but on the upper end we could actually have a medium effect size (~.5), and on the low end we could be looking at an effect that actually goes in the opposite direction!

Of course, this is what one might expect from such a low-powered study (the power for this study is only .17, as calculated here). But perhaps counter-intuitively, one thing we learn from the graphs above is that when the effect size increases, we actually need more participants to reach the same level of precision. For the same study with 100 participants but an effect size of .5, the level of precision is .57; and with an effect size of .8, the precision is .58. These are not large differences, but it is still important to remember that when it comes to precision, large effect sizes will not save you. Yes, with a large effect size, your power to detect an effect increases (i.e., the CI moves up and is less likely to include 0), but your precision actually decreases slightly. This influence of effect sizes increases as you aim for higher precision (as seen in the first graph).

Wrapping Up

In summary, it’s important to remember that the required sample size increases exponentially to achieve more precise estimates of effect size. Making precise estimates may be possible when using large Mechanical Turk samples, archival data, so-called “Big Data”, or repeated-measures designs, but is likely not feasible for most lab studies. However, as Schönbrodt points out, such estimates are still useful for meta-analyses, and thus it should still be standard practice to report them.

But as Simonsohn reminds us, the statistics we use should inform the questions we are trying to answer.4 If we are not making quantitative predictions about the size of an effect, then estimates of effect size and confidence intervals may not answer the right question. But when you are asking the question of estimating the effect size, be prepared to need large samples to offer an adequate answer.


  1. Simonsohn has made his slides available here. He has also written more about confidence intervals on his blog. []
  2. See Kelley and Maxwell (2003)Kelley and Rausch (2006); and Maxwell, Kelley, and Rausch (2008) for more details. []
  3. The R code is available here. []
  4. Of course, when thinking about answering the questions we are asking, one might consider whether we are more interested in determining whether the data is likely given our hypothesis, or whether our hypothesis is likely given our data. If we are more interested in the latter (which is likely), then neither p-values nor NHST-based confidence intervals help us to answer that question. Bayesian statistics would seem to be the better approach in that case. []

Leave a Reply