As one of the resident “stats people” (as opposed to “regular people”) in my department, one of the most common questions I get asked is about power analyses. This is partly because more psychology journals are requiring explicit discussions about statistical power.^{1} The problem is that if you do studies that require analyses anything more complicated than a *t*-test or correlations, things can get a little bit hairy. You can do calculations for analyses like regression, but the effect sizes used for these calculations (like Cohen’s *f*^{2}) are often uncommon, to the point where you might not even be clear if you’re using the correct test or calculating the effect size correctly. And then, as you move up in complexity to analyses like multilevel models, SEM, etc., there isn’t even an analytic solution.^{2} That’s where power simulations come in.

Power is the long-run frequency of finding a significant result when there is a true effect. As such, the idea of a power simulation is to simulate running your study lots of times, and then count how many times the result was significant. From there, you can also vary different parameters and assumptions and see how that changes the power. It’s an incredibly flexible tool, but the downside is that its flexibility can make it challenging to set up. So this post is intended to be a primer on how to set up a power simulation using R. I use R because it’s awesome,^{3} but you can do similar things in SAS, Python, or probably many other languages. The syntax will be different, but the process is the same.

In this primer, I will take you through the process of testing power analyses for an interaction in regression. I chose this because it’s the analysis people have asked me about most often. As we go along, I will try to summarize what the code is doing, but this is not intended as a tutorial on R. If you don’t really know much about R and are unsure what something is doing, try Googling the name of the function along with “r” or “rstats”. If you are familiar with R, you should be used to reading documentation by now—but I do recommend really making sure you understand what the code is doing rather than just copying and pasting. I’m showing one example, but power simulations are a general process, and you’ll have a much easier time adapting it to your needs if you understand what it’s doing. With all that said…let’s get started!

## Running Study 1

I think it’s really helpful to think about these simulations in terms of studies. We are essentially running the same study over and over, so we first need to design our study. That means thinking about all that we need to run a single study. We need to (1) “collect” data, (2) run our statistical analyses on that data, and then (3) grab the results, particularly the *p*-value(s).

### Collecting the Data

When it comes to collecting data, we need to think carefully about the distributions from which to draw our observations. In a lot of cases, that will be a normal distribution. But if you are taking reaction time data, for example, it might be skewed.^{4} Or if you are taking count data, it might be Poisson. You’ll also need to think about the shape of that distribution—for normal data, the mean and SD will need to be set. To the extent possible, I recommend choosing these distributions and values based on actual data from other studies.

For our example, we’re going to create one variable that is continuous and normally distributed, and another variable that is a manipulation with two conditions.

1 2 3 4 |
n <- 100 x1 <- rnorm(n, mean=0, sd=1) x2 <- sample(0:1, n, replace=TRUE) |

In the above code we first set a sample size, which will come in useful later. Then `x1`

draws *n* values from a normal distribution (the “r” in `rnorm()`

stands for “random”), with a mean of 0 and SD of 1. `x2`

draws integers from 0 to 1, *n* times (i.e., gives us *n* numbers of either 0 or 1).

That gets us our predictors. We also need to create a DV. This DV needs to depend on the values of the predictors. So this is where our effect size estimates come in. You can think of this as essentially reverse-engineering a regression analysis. Instead of using our predictors to discover the regression equation that relates *x* to *y*, we’re going to create the regression equation ourselves and then use that to set the values of *y*.

1 2 3 4 5 |
b0 <- 0 b1 <- .3 b2 <- .2 b3 <- .3 y <- b0 + (b1 * x1) + (b2 * x2) + (b3 * x1 * x2) + rnorm(n, mean=0, sd=1) |

The `b0`

to `b3`

variables set the individual coefficients (effect sizes)—first the intercept, then the two main effects and the interaction. Then the final line sets up the regression equation: `y`

is a linear combination of our variables multiplied by our effect sizes. The last bit with `rnorm()`

is our error term—the unexplained portion of *y*. You should also use empirical data to set this value (primarily the SD), if possible.

As a reminder, if you are planning on transforming your data in any way, this would now be the time to do that. You want to simulate the actual study you will run as closely as possible, including the ways you might transform the variables.

### Running the Analyses

Now that we have “collected” our data, we can run the regression analysis. This is pretty straight-forward:

1 2 |
model <- lm(y ~ x1 * x2) summary(model) |

When you look at the output for this, you will likely notice that the coefficients you see are not the same as the coefficients you set for your effect sizes. That is due to the random error we included when we set `y`

. You may find that the interaction is significant…you may not. If you want to, try re-running all the code a few times and see how it changes.

### Saving the Results

The primary thing we want to do is save the *p*-values, because that is what we need to count up in order to determine the power. But there are lots of other things you can test (see below), so I usually recommend erring on the side of collecting more data rather than less. It’s a lot easier to filter out data later, rather than having to re-run all your simulations. Trust me, I have extensive experience on that subject.

Depending on the analyses you are running, the output can be quite simple or very complex. Here’s a tip: `str()`

will be your best friend. It shows you the structure of an object, so you can use it to figure out exactly how to get at the specific pieces you want to pull out from an output. Try it with `str(summary(model))`

and see!

1 2 3 4 5 6 7 8 9 |
output <- summary(model)$coefficients coefs <- output[, 1] ps <- output[, 4] rsq <- summary(model)$r.squared results <- c(coefs, ps, rsq) names(results) <- c('b0_coef', 'b1_coef', 'b2_coef', 'b3_coef', 'b0_p', 'b1_p', 'b2_p', 'b3_p', 'rsq') |

Most of the information we want is stored in `summary(model)$coefficients`

. This is the table of coefficients that you see when you print the summary output. The first column is our actual coefficients, and the fourth column holds the *p*-values. For our purposes, we’ll take both columns. We’ll also pull the *R*^{2} value of the overall model. We combine all of those together into a variable called `results`

, and then rename the headings to something a little more intuitive. And there we go—one study from start to finish!

Note that while the *p*-values are the primary target here, you may also want to do other things like test the direction of the effect—how frequently is it significant in the hypothesized direction?—or the precision—how much does the effect size bounce around? If you have multiple tests within your study, you might want to ask questions like, “How many times is *at least* one of these effects significant? How many times are *all* of them significant?” The R functions `any()`

and `all()`

are quite helpful for this.

## Packing It All Up

In order to make this process much simpler, it is extremely useful to package the code we created above into a function. This lets us run the function multiple times, and more importantly it allows us to really easily change the values as we go. We’ve made lots of assumptions here already, about the distributions, the means and SDs, the effect sizes, etc. We may want to test how those assumptions affect the power by changing them. We’ll also want to vary the sample size and see how that influences the power. So let’s wrap all this up into a function:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
regression_sim <- function(simNum, n, b0, b1, b2, b3, x1_mean=0, x1_sd=1, err_mean=0, err_sd=1) { x1 <- rnorm(n, mean=x1_mean, sd=x1_sd) x2 <- sample(0:1, n, replace=TRUE) y <- b0 + (b1 * x1) + (b2 * x2) + (b3 * x1 * x2) + rnorm(n, mean=err_mean, sd=err_sd) model <- lm(y ~ x1 * x2) summary(model) output <- summary(model)$coefficients coefs <- output[, 1] ps <- output[, 4] rsq <- summary(model)$r.squared results <- c(coefs, ps, rsq) names(results) <- c('b0_coef', 'b1_coef', 'b2_coef', 'b3_coef', 'b0_p', 'b1_p', 'b2_p', 'b3_p', 'rsq') return(results) } |

We first start off by naming the function—the name is arbitrary, but pick a name that is descriptive—and setting the parameters to the function. Notice that I’ve put all our assumptions into the parameters, so we can easily change the sample size, the effect sizes, etc. I’ve also included some default values for the means and SDs. Beyond that, the code is essentially the same, except for replacing the hard-coded means, SDs, and effect sizes. I’ve also added `return(results)`

to the end so the function will actually give us what we want.

You may notice that I’ve also added `simNum`

as the first parameter. That’s necessary for the method we’re going to use to run the function below, and **must be the first parameter of your function**. But we don’t really need to use the variable.

To check that everything works correctly, run the function code so it’s stored in R memory, and then run something like the following:

1 |
regression_sim(1, n=100, b0=0, b1=.3, b2=.2, b3=.3) |

## Repetition, Repetition, Repetition

Now that we have Study 1 completed, we just have to set things to repeat. The easiest way to do that is to use the `ldply()`

function in the plyr package.

1 2 3 4 5 |
if(!require('plyr')) install.packages('plyr') # installs package if not already installed library('plyr') num_sims <- 10 sims <- ldply(1:num_sims, regression_sim, n=100, b0=0, b1=.3, b2=.2, b3=.3) |

This runs 10 simulations—the first parameter to `ldply()`

is a sequence (vector) that it will run through one by one. For each element in the sequence, it will run our custom function. It will pass the value of the sequence to `simNum`

in our function—so the first time it will be 1, then 2, then 3, etc. We also give `ldply()`

the name of the function we want to run.^{5} After that, we set all the parameters we need to set. The `ldply()`

function will pass those along to `regression_sim()`

. That’s it! You can now run as many simulations as you want. The output will be a data frame, with one row per simulation.

Now that we have the output from our simulations, we can calculate the proportion of *p*-values that are significant. Here we’re primarily interested in the *p*-value for the interaction, which is stored as `b3_p`

.

1 |
power <- sum(sims$b3_p < .05) / nrow(sims) |

We use a logical expression to ask whether `b3_p`

is less than .05, which gives us a vector of TRUEs and FALSEs. Then we sum that up (TRUE gets counted as 1, FALSE as 0). Finally, we divide by the total number of simulations, and that gives us the proportion, which equals our power estimate! Try running 1000 simulations and see what our power is. I got about .29.

## Varying Parameters

The next step is to work on a setup that will allow us to test the power across a range of parameter values. The key parameter to adjust will typically be your sample size, as that’s the primary question when doing most power analyses. But you should also explore how adjusting other parameters affects your power. You should test a range of plausible effect sizes. For repeated-measures and longitudinal data, you should test plausible covariances across time points. All this can add up to a significant time investment, unfortunately, but is critical to identifying reasonable power estimates.

Let’s start by testing a range of sample sizes.

1 2 3 4 5 6 7 |
sample_sizes <- c(50, 100, 200, 300, 500) results <- NULL for (val in sample_sizes) { sims <- ldply(1:1000, regression_sim, n=val, b0=0, b1=.3, b2=.2, b3=.3) sims$n <- val # add the sample size in as a separate column to our results results <- rbind(results, sims) } |

First, we create a variable that just identifies which sample sizes we’re going to test. We also create a `results`

variable that will hold our output. The next block of code creates a `for`

loop, which cycles through each value of `sample_sizes`

in turn and sets `val`

to be that value. You then notice that we pass that value to our simulation function as the sample size.

I find it useful to combine all the results into a single data frame. For each sample size, we’re running 1000 simulations, so we end up with a giant data frame with 5000 results. To keep track of which simulations are which, we just add an extra column called `n`

that inserts the current value we’re testing. Finally, we just add the current simulation output to our `results`

variable, and then the loop begins again.

After running this, I recommend using the dplyr package to examine your results, but you can use whichever method you prefer. I’m going to use a combination of dplyr and ggplot2 below, however.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
if(!require('dplyr')) install.packages('dplyr') # installs package if not already installed if(!require('ggplot2')) install.packages('ggplot2') library('dplyr') library('ggplot2') power_ests <- results %>% group_by(n) %>% summarize(power=sum(b3_p < .05) / n()) ggplot(power_ests, aes(x=n, y=power)) + geom_point() + geom_line() + ylim(c(0, 1)) + theme_minimal() |

As you might expect, our power increases as the sample size increases. We hit 80% power somewhere just shy of 400 participants—maybe you want to try running the code again to test that value specifically.

The above code varied one parameter, sample size. If you want to vary more than one, one way to do this is with a “grid search”. The idea is that you create a table with rows for each set of simulations you want to run, like so:

n |
b3 |

50 | .2 |

50 | .5 |

50 | .8 |

100 | .2 |

100 | .5 |

100 | .8 |

200 | .2 |

200 | .5 |

200 | .8 |

The thing to keep in mind is that as you test more values and more parameters, this grid can grow quite quickly. So always be sure you know how big your grid is before running the code! You might want to split it up into more manageable chunks. We’ll also explore some techniques to speed up the process in the section below.

R has a nice function `expand.grid()`

to create your grid.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
grid <- expand.grid(n=c(50, 100, 200), b3=c(.2, .5, .8)) results <- NULL for (row in 1:nrow(grid)) { sims <- ldply(1:1000, regression_sim, n=grid[row, 'n'], b0=0, b1=.3, b2=.2, b3=grid[row, 'b3']) sims$n <- grid[row, 'n'] sims$b3 <- grid[row, 'b3'] results <- rbind(results, sims) } power_ests2 <- results %>% group_by(n, b3) %>% summarize(power=sum(b3_p < .05) / n()) power_ests2$b3 <- factor(power_ests2$b3) ggplot(power_ests2, aes(x=n, y=power, group=b3, color=b3)) + geom_point() + geom_line() + ylim(c(0, 1)) + theme_minimal() |

Using the `expand.grid()`

function, we get a nice data frame with 9 rows. The rest of the code is quite similar, but instead we iterate over each row of the grid. Now we get a nice estimate of the power across multiple sample sizes, and also multiple estimates of our effect size of interest.

## Turbo Boost (Optional)

If you have a reasonably fast processor, the simulations here hopefully shouldn’t take too long. However, if you are running a grid search across many different parameter values, or you’re running statistical models that take some time to optimize or converge, you may find that these simulations take a long time to complete. Luckily, there are a couple things we can do.

The first is simply to keep track of the time. If you know how long an initial set of simulations took, you can have at least a general sense of how long your full set of simulations will take. Keeping track of this is as easy as wrapping the `system.time()`

function around your `for`

loop like so:

1 2 3 |
system.time(for (row in 1:nrow(grid)) { # ... }) |

At the end, you’ll get output that looks like this:

1 2 |
user system elapsed 11.05 0.00 11.05 |

The “elapsed” time is the time that you want to look at, and shows you the amount of time in seconds that the code took to execute.^{6}

Of course, that doesn’t actually make things *faster*. But if you have multiple processor cores, you can use them to run your simulations in *parallel*. Essentially, instead of running each simulation one after another, you can split the simulations across your processor cores so they run at the same time. Most modern computers on the market today are at least dual-core, if not quad-core. ((To figure out how many processors you have, Google “find number of processors [insert name of your operating system]”.))

The nice thing is that the *plyr* package includes support for parallelization out of the box. The annoying thing is that different operating systems require different approaches; Windows in particular can be tricky. However, I think the following approach should work on most operating systems:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
if(!require('doSNOW')) install.packages('doSNOW') # installs package if not already installed if(!require('foreach')) install.packages('foreach') # installs package if not already installed library('doSNOW') ncpus <- 3 # change this to the number of cores you want to use cl <- makeCluster(ncpus, type='SOCK') registerDoSNOW(cl) grid <- expand.grid(n=c(50, 100, 200), b3=c(.2, .5, .8)) results <- NULL system.time(for (row in 1:nrow(grid)) { sims <- ldply(1:1000, regression_sim, n=grid[row, 'n'], b0=0, b1=.3, b2=.2, b3=grid[row, 'b3'], .parallel=TRUE) sims$n <- grid[row, 'n'] sims$b3 <- grid[row, 'b3'] results <- rbind(results, sims) }) stopCluster(cl) |

You’ll need to install the *doSNOW* and *foreach* packages (as *plyr* uses *foreach* for its parallelization). Choose the number of cores you want to use and set this in the `ncpus`

variable. I usually choose 3, because I have a quad-core processor, and having 1 core free lets me still browse the Internet with little interference while waiting for code to run. (It’s important to have your priorities straight.) Beyond that, most of the code is the same, other than adding the `.parallel=TRUE`

bit to the `ldply()`

function. Try running the code with and without parallelization and see the difference in time for yourself!

If this code doesn’t work for you, then…you’re just going to have to do a bit of searching on your own for an approach that works. I’m sorry, but there are numerous parallelization packages, and some of them work better than others on different systems, so it might take a bit of trial and error. I don’t have every operating system in the world to test this to nail this part down for everyone.

## UPDATE (2017-10-24)

Since originally writing this post, I have finally completed an R package I have been working on for some time, which makes this process much simpler. The *paramtest* package is available on CRAN and handles the iteration over simulations, the varying of parameters, and the parallelization parts of this process. The rest is left intentionally general-purpose, so you still need to create a custom function to do what you want. But the rest is taken care of. Let’s take another look at the last example of how to run our `regression_sim()`

function, but using the *paramtest* package.

1 2 3 4 5 6 7 |
if(!require('paramtest')) install.packages('paramtest') # installs package if not already installed library('paramtest') paramtest_results <- grid_search(regression_sim, params=list(n=c(50, 100, 200), b3=c(.2, .5, .8)), n.iter=1000, output='data.frame', parallel='snow', ncpus=3, b0=0, b1=.3, b2=.2) |

An easy, one-line solution! But let’s break it down. The first thing to note is that we use the `grid_search()`

function, which runs simulations for each combination of each parameter, exactly like we did above. The *paramtest* package also has a `random_search()`

function that randomly selects a value from a range for each set of simulations, which can be useful in cases where you are seeking to optimize parameters of a model.

The next thing we do is give the `grid_search()`

function our custom `regression_sim()`

function. We then give it a list of parameters to vary: We want to test three sample sizes at each of three different effect sizes. So we will end up with 9 different simulations. We tell it how many iterations we want to run for each simulation and tell it we want a data frame as output (the default is a list, which is useful if your function returns complex output like fitted models that don’t fit neatly into rows and columns). The next two arguments tell it to parallelize the process. Because I’m on a Windows machine, I chose the `'snow'`

option, but if you are on Mac or Linux then `'multicore'`

might be a better option.^{7} I also tell it that I want to use 3 CPU cores. The `grid_search()`

function will handle creating the cluster so we don’t have to do so manually as in the example above. Finally, we pass three arguments, `b0`

, `b1`

, and `b2`

, to set the other effect sizes that we *aren’t* varying in these tests. Anything you want to pass to your custom function you can just throw at the end and `grid_search()`

will pass it along. If we wanted to vary some of these other effect sizes as well, we would instead include them in the list passed along in the `params`

argument.

And that’s it! The result of this code will be very similar to the code we ran previously. The difference is that *paramtest* will provide a few other things in the output by default. You can see the data frame of results by using `results(paramtest_results)`

. This will look exactly like the data frame we created ourselves above. You can also see a summary of all the parameter values that were tested by using `summary(paramtest_results)`

. The output also includes the total time that the simulations took, so you can see that by running `timing(paramtest_results)`

.

If you want some more examples of power simulations for other types of designs, the *paramtest* package includes a vignette with a bunch more. It doesn’t go as in-depth into the logic behind them, but can serve as a useful template for creating one that works for your study. You can access the vignette here or running `vignette('Simulating-Power')`

in your R session.

Okay, that’s the end of the update. I’ll leave you with the original conclusion to the blog post:

## The Power Is Within You

Hopefully this gives you a useful framework to create your own power simulations. I wish I could cover many more types of study designs, but there are so many variations and possibilities I couldn’t possibly do it. However, the nice thing about this approach is that the only real editing you need to do comes within the custom function we created. That’s where you handle all the details unique to *your* study, like the data generating process, setting the parameters you need, running the particular statistical analysis you need, and pulling out the relevant results. This can be a complicated process, but it is extremely powerful and flexible. Just remember: Create a function that deals with the whole process of *one study*, and then the rest is repetition. Hopefully this provides you with the tools to do your own analyses, so go wild! You now have the ability to test statistical power in complicated designs for which there is no analytical solution. And with great power comes great responsibility. And hopefully great study results too.

#### Notes:

- Power analyses are so hot right now. [↩]
- For cases where there is an analytic solution, I highly recommend the pwr package. For non-R solutions, the stand-alone program G*Power is also a great option. [↩]
- No citation needed. [↩]
- For simulating skewed data, you can try the
`rbeta()`

function in base R, or use the sn package that has functions for skewed normal and*t*-distributions. [↩] - Technically, we are passing the
*actual function*to`ldply()`

rather than just the*name*of the function. That’s why the name isn’t in quotation marks—we’re passing the actual function object that is stored in memory. [↩] - Another option you can use, if you’re feeling adventurous, is to add a progress bar. R has a built-in
`txtProgressBar()`

function you can use, as shown here. You can also use the`.progress`

parameter in the`ldply()`

function to achieve a similar result. [↩] - See the documentation from the
*parallel*package for more details. [↩]

## 5 responses to “Running Power Simulations with the Power of R!”

## Mario Reutter

Dear Jeff,

thanks a lot for this interesting blog post about power simulations! I am confident that it significantly 😉 improved my skills and workflow in R, especially considering the ldply function.

Since I like clean, elegant, and generalizable code, I have two suggestions about which I’d like to hear your opinion:

1. Importing / Installing packages: instead of having a line that needs to be commented in or out, I like to do this:

if(!require(tidyverse)) install.packages(“tidyverse”)

2. Why do you check if results is null before calling rbind? rbind(NULL, df) is equal to df [while rbind(NA, df) creates a first row of NAs!]. Thus, you don’t need the if statement, collapsing 4 lines into 1.

Thanks for your thoughts!

Mario

## Jeff

Hi Mario,

Thanks, these are good suggestions!

1. I have used this approach in the past, and to be honest, I don’t know why I didn’t here. I sort of added the comments about install.packages() as an afterthought, an “oh yeah, people might not have these packages installed already, so I should mention that.” But you’re right, it’s easier to just use require() to check. I’ll update the code with that.

2. You know, I had no idea that rbind(NULL, df) was equal to df. I think I’ve run into the case you mention with the row of NAs and must have misremembered it as being NULL that did that. But I just tested it out and you’re absolutely right. It certainly is more elegant not needing that extra if statement. I’ll update that too 🙂

Thanks for your suggestions! It’s always nice getting other people’s perspectives on ways to improve one’s code 🙂

Jeff

## Andy

This is great and very helpful, but I’m still left with the question of how to estimate the regression parameters and especially how to estimate the error term. We tend to conduct studies to test questions for which the answer isn’t already known! If we knew, we wouldn’t need the study. So in such cases, how can one possibly estimate all regression parameters and error a priori?

## Jeff

For sure, and this is definitely an aspect of power analysis that is always tricky. The same issue exists in simpler problems where there is an analytic solution—you still need to provide an estimate of the effect size, at minimum. The regression parameters are your effect sizes in this case, so you can use the same strategies with power simulations as you do with analytic power analyses.

Some of this information you may be able to pull from previous studies. Maybe a study has examined the main effect, but not the particular interaction you are examining. The error term is just the variance left over after accounting for the other effects, so if you or others have studied the DV on a similar population, you can use the variance from the measure in that study as a basis for the error term in your simulation.

So maybe there is previous data to work with. Maybe there isn’t. In either case, however, I would always highly recommend running multiple analyses across a variety of plausible estimates, as a sort of sensitivity analysis, to see how the power changes as these parameters change. This is akin to testing power at “small”, “medium”, and “large” effect sizes, but is pretty general-purpose and can be used to test how the power is affected by the error variance, for example. Varying parameters is always part of a good power analysis. Even if you had, say, a large meta-analysis to pull an effect size estimate from, it would still be a good idea to, say, test the power at the high and low bounds of the confidence interval.

The most common reason for people to use power analyses is to work out a reasonable sample size. So running a variety of analyses allows you to get a better picture of what is reasonable. Of course, sometimes your best estimate of an effect size is an entirely blind guess, but you can still probably guess that an effect is likely closer to d = .2 than d = 1.2. So that information is useful. But using a variety of guesses is always going to be better than a single guess. That information can be used to choose a conservative estimate (i.e., assume the effect size is small and shoot for that sample size), or to decide on how best to allocate resources (i.e., we only have resources to get good power for a medium effect size, so we’ll do that and take a gamble if it happens to be smaller). Either way, at least the decision is a bit more informed than going for a cell size of n = 20 or whatever.

## David Colquhoun

One problem is that power uses the term “statistically significant”. One thing (almost the only thing) that all statisticians agree on is that this term should never be used. A letter urging that the term should be dropped, supported by 650+ people, will appear in Nature soon. It is a comment on the The American Statistician’s special edition on inference that will appear at the end of March (my contribution is at https://arxiv.org/ftp/arxiv/papers/1802/1802.04888.pdf ).