Anyone who analyzes data knows (or should know!) the importance of not violating the assumptions of the tests one runs. And for common tests like t-tests, correlation, ANOVA, and regression, one of the assumptions is that the variables are normally distributed.1 One method that some people use, then, is a test for normality of the data, such as the Kolmogorov-Smirnov (K-S) test or the Shapiro-Wilk (S-W) test. If the test indicates a deviation from normality, they might try a transformation, or use a more robust statistical test to analyze their data. I’m here to say that this is going to make life hard for yourself. Here’s the summary of this article right up front: If you want to see if normality assumptions are violated, don’t use a normality test.
One of the useful properties of t-tests, ANOVAs, and the like is that they are fairly robust to deviations from normality. However, normality tests like the K-S and S-W test are not so kind. They are very sensitive to any deviations from the normal distribution. This is fine, if your question is, “Is this variable normally distributed?” But if you just want to ask, “Is this variable non-normal enough that it will be a problem for the ANOVA I want to run?” then you are unnecessarily holding yourself to a very high standard.
Here are some simulations I ran. The full R code is available here.2 What I did is generate 1000 samples each from a normal distribution, of sizes n = 10, n = 100, n = 1000, and n = 5000. As we know, the sample should be more likely to be normally distributed as the sample size increases. I ran the K-S and S-W tests on all of these samples. Then I also modified each of the 4000 samples slightly, increasing the skewness to create slight deviations from normality. When I say slight, I mean slight. On average, the skew was only about .07 (see Table 1). Eyeballing a histogram, you might not even notice it. Then I ran the same tests on these slightly non-normal samples as well.
|n = 10, normal||.006||.964||-.018||-1.001|
|n = 100, normal||.004||.997||.010||-.110|
|n = 1000, normal||-.0001||1.001||.006||-.014|
|n = 5000, normal||-.0002||1.000||.001||-.003|
|n = 10, skewed||.806||1.232||.061||-1.058|
|n = 100, skewed||.804||1.249||.072||-.236|
|n = 1000, skewed||.800||1.250||.077||-.156|
|n = 5000, skewed||.800||1.249||.074||-.150|
And here are the results of the two sets of tests:
|Sample||S-W||K-S||Skew > ±2||Kurtosis > ±2|
|n = 10, normal||5.1%||3.8%||0%||0%|
|n = 100, normal||4.1%||3.4%||0%||.2%|
|n = 1000, normal||5.8%||4.3%||0%||0%|
|n = 5000, normal||4.4%||4.7%||0%||0%|
|n = 10, skewed||4.3%||48.5%||0%||.1%|
|n = 100, skewed||4.6%||100%||0%||0%|
|n = 1000, skewed||20.3%||100%||0%||0%|
|n = 5000, skewed||78.3%||100%||0%||0%|
There are two things to note here. First, there are battling forces between the power of these tests (smaller samples provide less power) and the sensitivity of the tests (larger samples are more likely to show trivial deviations). Second, the S-W test is a little more likely than the K-S test to flag normally distributed data as non-normal, while the K-S test is drastically more likely than the S-W test to flag slightly skewed data as non-normal. This might speak to the accuracy of the K-S test, but given that almost no real-world data will be exactly normal, it may mean that the K-S test is not particularly useful for answering questions with real data.
In the last two columns of Table 2, I have included the percentage of simulations that show skewness and kurtosis greater than ±2. General guidelines vary as to what constitutes “problematic” skew or kurtosis, and while West, Finch, and Curran (1995) suggest a threshold of ±2 for skew and ±7 for kurtosis (and Kline, 2005 is even more liberal), I decided to go with the more conservative guidelines of Tabachnick and Fidell (2013), who say ±2 for both. So going by this rule of thumb for what violates assumptions of normality, we can see that both the S-W and K-S tests are flagging much, much higher rates of non-normality. In contrast, virtually none of the simulations had highly skewed or kurtotic data according to the ±2 rule of thumb.
So what does this all mean? Well, if your question is, “Is this variable normally distributed?” then the K-S test seems to perform better at finding any deviation, with a low false positive rate and a high true positive rate. And this is a completely sensible reason to use a test of normality. But if your question is, “Is this variable non-normal enough that it violates the assumptions of the statistical test I want to run?” then the sensitivity of these tests is so high (especially with large sample sizes) that it will flag even trivial deviations as “significantly non-normal”. Thus, you may end up using a more conservative “robust” test that is completely unnecessary, if you go by the significance of one of these normality tests. In contrast, even using a relatively conservative rule of thumb such as skewness and kurtosis being less than ±1.5 would still provide a better indication of whether you’re violating assumptions.
In summary, tests of normality aren’t really set at the same threshold as what is most useful for detecting violations of assumptions. Feel free to use these tests if you want, but you should know that you are setting an extremely conservative threshold for yourself—and if you have a reasonably large sample size and use the K-S test, be prepared to always use a robust test. But let me be clear: You should look at your variables to look for normally distributed data—check the skew and kurtosis, plot out a histogram or a Q-Q plot, etc. Just don’t rely on a convenient p < .05 threshold from one of these normality tests to do the job for you.