Several years ago, Uri Simonsohn (along with Leif Nelson and Joe Simmons) introduced the psychology community to the idea of p-hacking, and his related concept for detecting p-hacking, the p-curve. He later demonstrated that this p-curve could be used as an estimate of true effect size in a way that was better at correcting for bias than the common trim-and-fill method.
Now, more recently, Ulrich Schimmack has been making a few waves himself, using his own metric called the R-index, which he has stated is useful as a test of how likely a result is to be replicable. He has also gained some attention for using it as what he refers to as a “doping test”, to identify areas of research—and researchers themselves—that are likely to have used questionable research practices (QRPs) that may have inflated the results. In his paper, he shows that his R-index indicates an increase in QRPs from research in 1960 to research in 2011. He also shows that this metric is able to predict the replicability of studies, by analyzing data from the Reproducibility Project and the Many Labs Project.
While Schimmack’s evidence is indeed interesting, one of the difficulties with relying on existing published studies (or openly shared replications) to evaluate metrics such as these is that we do not know for sure the true effect size underlying the research. All we have are estimates of the effect, produced after the fact, by the studies themselves. So it is difficult to say “my metric X adequately removes bias from research to provide a good estimate of the true effect size” without being able to compare to the true effect size. This is where simulations are incredibly important—because we can set the true effect size ourselves. So in this article, I’d like to offer a test of these two metrics, the p-curve and the R-index, to see how they stack up in their ability to identify bias and remove it to provide an estimate of the true effect size.1
I won’t go into great detail on how to calculate the R-index and p-curve. Details for how to do so are available here and here. However, in general terms, the R-index examines the median observed (i.e., post hoc) statistical power of a set of studies, and then compares that power to the proportion of significant results. Over the long run, the reasoning goes, if one has 80% ability to detect effects, one should find 80% significant results. To the extent that there are more significant results than this, it is indicative either of chance (which gets increasingly unlikely as more data is examined) or of bias. Schimmack has indicated that the R-index offers an estimate of the true underlying power of the data (which can be easily converted to an estimate of effect size, given a particular sample size and alpha level).
The p-curve, on the other hand, analyzes the distribution of reported p-values below .05. Where no effect exists, the p-values should be uniformly distributed across the range (i.e., by chance). Where a real effect exists, the p-values should tend to be right-skewed, with a greater number of low p-values. Where no effect exists, but the data have been “p-hacked”, the data should instead be left-skewed, with more values just below the p = .05 threshold. By comparing this to known distributions of p-values for particular effect sizes, one can determine to what extent a given observed distribution of p-values lines up with that expected distribution. Using some fancy optimization, that can be converted to an estimate of the effect size.
In comparing these two metrics, I ran several types of simulations. The general process across all of them was as follows: I randomly selected n cases from two normal distributions separated by a particular effect size, d. From there, I calculated the observed effect size, the p-value associated with an independent samples t-test, and the observed power. That data, with a single t-test, constituted a “study”. From there, I generated k studies by repeating that process. Since both of these metrics are designed to be calculated across a set of studies, this is where I then calculated the R-index and the p-curve for each set of k studies. Finally, I repeated this process of generating sets of studies 1000 times. The full R code for generating these simulations is available here, as is the data (so you can skip redoing the simulation process if you wish—it can take some time!).
No Bias, Varying Effect Size
To start off, I wanted to see how these metrics performed when there is no bias present in the data. In other words, this is a scenario in which all data are reported, regardless of whether they are significant or not, and all studies are published. I ran simulations for this scenario across several different effect sizes: d = 0 (no effect), d = .2 (small effect), d = .5 (moderate effect), and d = .8 (large effect). I will note that I ran into errors when calculating the p-curve for d = 0, as the p-curve only analyzes significant results, and it ran into cases where there was insufficient data to estimate the effect (as one might expect when most studies should be non-significant!). In this set of simulations, I generated studies that used 50 participants per condition (100 total), and generated 50 studies per set (i.e., it was based on 50 t-tests each time). With these values, we can calculate the a priori power for each effect size at .050, .168, .697, and .977, respectively.
So how did the metrics stand up? Taking a look at the R-index first, we can see that the measure fared reasonably well for the moderate and large effect sizes (see Table 1). However, it was somewhat upwardly biased when there was no effect, and when the effect was small. This might suggest that as a measure of true power/effect size, the R-index is at least somewhat useful. However, in conditions of no bias, simply examining the proportion of significant results offered a better estimate of true power at all effect sizes. In addition, the standard deviation (SD) of the proportion of significant results was somewhat smaller than the SD of the R-index estimates, suggesting that under conditions of no bias, the proportion of significant results also offers at least slightly more precise estimates.
|Model||M prop. significant||SD prop. significant||M R-index||SD R-index||95% CI lower||95% CI upper|
|d = 0, pwr = .05||.050||.031||.160||.042||.078||.242|
|d = .2, pwr = .17||.285||.064||.305||.088||.132||.478|
|d = .5, pwr = .70||.696||.067||.691||.090||.515||.866|
|d = .8, pwr = .98||.977||.021||.975||.028||.920||1.029|
Note: After discussing these results with Dr. Schimmack, he noted being puzzled at how the R-index performed at low effect sizes with no bias in the data. The eventual conclusion we reached was that the problem was a matter of how power is bounded between 0 and 1. When meta-analyzing two-tailed statistics, like t-tests or Cohen’s d, small positive effects will cancel out small negative effects. However, due to the nature of power, both small effects show small positive power. As such, when the effect size is small/zero, the R-index will move upward because the positive and negative effects don’t cancel out in the same way. His suggestion was that instead of calculating observed power for each individual study, then using the median, one could instead meta-analyze across effect sizes, then calculate observed power on that final value. I decided to re-run these simulations, recalculating the R-index using that method. This did greatly improve things, particularly for the small effect sizes. For instance, when d = 0, the mean R-index estimate was now .054, with SD of .031, and 95% of the estimates fell between -.008 and .116. In other words, the R-index was no longer biased upwards when the data was not biased. However, this change in calculation, it should be noted, did not change the results of the other simulations described below.2
In contrast, the p-curve performed excellently, providing very accurate and precise estimates of true effect size at small, moderate, and large sizes (though again, the estimates were more precise as effect size increased; see Table 2).3 In general, it seemed that the p-curve worked well, though examining the proportion of significant results worked just fine in this case.
|Model||M p-curve||SD p-curve||M p-curve (trimmed)||SD p-curve (trimmed)||95% CI lower||95% CI upper|
|d = .2, pwr = .17||.161||.176||.185||.095||-.001||.371|
|d = .5, pwr = .70||.495||.055||.495||.055||.388||.602|
|d = .8, pwr = .98||.800||.034||.800||.034||.734||.867|
No Bias, Varying Number of Studies
The second set of simulations I performed kept the effect size constant, but instead varied the number of studies over which these estimates were aggregated. In these simulations, once again, the data were unbiased. I selected a moderate effect size (d = .5), used a sample size per condition that provided approximately .80 power (n = 65), and then varied the number of studies between 10, 30, and 80. I tried to ensure that these simulations offered a fair test of both the R-index and the p-curve. In Schimmack’s paper, one of his examples (Bem’s paper on extrasensory perception) calculates the R-index over 10 data points, so he at least implies that the R-index is suitable for this number of studies.4 For the p-curve, one of Simonsohn’s papers has suggested that the p-curve will generally provide accurate estimates of “evidential value” with 20 or more p-values. This would at least suggest that my choices of 10, 30, and 80 should provide a fair test for both of these metrics.
In these tests, the R-index fared very well, in general estimating the true power across simulations pretty much spot on (see Table 3). However, the estimates did get more precise as the number of studies increased. When calculated across only 10 studies, the 95% confidence interval of the estimates ranged anywhere from .47 (very poor power, essentially a 50-50 chance) to 1.10 (which I suppose should be interpreted as 100% power to detect an effect5). In effect size terms, for 10 studies the R-index would estimate a true effect size of d = .56 (a little higher than the true effect size), with 95% of estimates ranging from .34 to .77. While this is not a poor estimate, it should be noted that simply examining the proportion of significant results was a more accurate and more precise (slightly smaller SDs) estimate than the R-index.
|Model||M prop. significant||SD prop. significant||M R-index||SD R-index||95% CI lower||95% CI upper|
|k = 10||.802||.122||.786||.160||.473||1.100|
|k = 30||.808||.072||.798||.105||.593||1.003|
|k = 80||.806||.045||.811||.064||.686||.937|
The p-curve fared similarly well (see Table 4). Its estimates were very accurate, especially after trimming off the artifactual outliers. Similarly to the R-index, the precision of the estimates increased with the number of studies considered. When calculated across only 10 studies, the 95% confidence interval of the effect size estimates ranged from .32 to .67, a slightly smaller range than for the R-index. This greater precision was also true for simulations with a greater number of studies, and in fact the precision of the p-curve converged more quickly than the precision of the R-index as the number of studies increased.
|Model||M p-curve||SD p-curve||M p-curve (trimmed)||SD p-curve (trimmed)||95% CI lower||95% CI upper|
|k = 10||.172||.609||.497||.091||.318||.674|
|k = 30||.378||.402||.498||.051||.398||.598|
|k = 80||.488||.147||.502||.030||.442||.562|
Biased Data, Varying Effect Size
Finally, it is important to examine data that is biased, as the proponents of both of these metrics claim that they are able to compensate for bias. Thus, for the final set of simulations, I varied the effect size as before, with d = 0, d = .2, d = .5, and d = .8; and kept the sample size and studies per set constant at n = 50 and k = 50, respectively. However, instead of collecting 50 studies per set no matter their significance, in this set of simulations, the p-value of the t-test was first examined to see whether it was significant (p < .05), and only included in the set if it was. In other words, each set of studies still included 50 studies, but more biased sets would draw from a greater pool of studies to collect all 50. This method of selecting studies is analogous to several sorts of QRPs, such as dropping conditions that are not significant, or dropping studies that show null effects; it also could be analogous to publication bias, where only significant results are published. In addition, it should be noted that increasing the effect size should generally decrease the amount of bias in this case, as greater power should mean that more studies are able to detect an effect. Thus, simulations with larger effect sizes should be less (though still somewhat) biased overall.
Examining the R-index revealed that it did not seem to adequately correct for bias (see Table 5). This problem was especially exacerbated when the true effect size was 0 or small, most likely because greater bias exists under these conditions.6 When the true effect size was d = 0 and true power was thus .05, the mean R-index estimation was .23, erroneously suggesting a small effect size of d = .25. In fact, the absolute lowest R-index in any of the 1000 sets of studies for true d = 0 was R-index = .14, suggesting an effect of d = .17. However, 95% of the estimates fell between .16 and .31. The R-index did not fare much better in the small effect simulation, in which 95% of the estimates fell between .22 and .44, which did not include the true power of .17. However, when the effect was moderate or large, the R-index offered at least a reasonable estimation of the power, though it did underestimate it slightly.7
|Model||M R-index||SD R-index||95% CI lower||95% CI upper||Comparison with unbiased|
|d = 0, pwr = .05||.235||.040||.156||.314||t(1998) = 40.53, p < .001|
|d = .2, pwr = .17||.331||.056||.220||.441||t(1998) = 7.82, p < .001|
|d = .5, pwr = .70||.632||.068||.497||.766||t(1998) = -16.48, p < .001|
|d = .8, pwr = .98||.955||.019||.919||.992||t(1998) = -18.55, p < .001|
The reason for the relatively poor performance of the R-index under conditions of large bias in data would seem to be a result of the metric’s reliance on the observed (post hoc) power in its calculation.8 While Schimmack does observe that when QRPs are used to gain significant results, this will tend to increase the proportion of significant results, he seems to underestimate the influence this also has on the observed power as well. In fact, when examining the unbiased data, at all effect sizes the observed power was significantly positively correlated with the proportion of significant results—in some cases, quite strongly correlated at approximately .70.9 Although Schimmack attempts to control for this by using the median observed power across all studies, the result of these simulations would suggest that this does not seem to adequately correct for bias when the bias is large. This is the case even when aggregating results across a relatively large number of studies (50 in this case). To the extent that observed power and the proportion of significant results are correlated, this will tend to undercut the bias correction included in the R-index calculation.
In contrast to the performance of the R-index under conditions of bias, the p-curve worked very well at correcting for bias to accurately estimate the true effect size (see Table 6). Similar to the R-index, the p-curve estimates were more precise as effect size increased (and consequently, bias decreased), but even under conditions of large bias when true d = 0, 95% of the p-curve’s estimates fell between -.19 and .18. This is perhaps not as precise as one might like, but it at least includes the true value of 0. In addition, the mean of the estimates was d = -.005, quite impressively close.
|Model||M p-curve||SD p-curve||95% CI lower||95% CI upper||Comparison with unbiased|
|d = 0, pwr = .05||-.005||.096||-.193||.183||—|
|d = .2, pwr = .17||.189||.074||.045||.334||t(1973) = 1.13, p = .26|
|d = .5, pwr = .70||.494||.044||.407||.580||t(1998) = -.53, p = .59|
|d = .8, pwr = .98||.799||.033||.734||.863||t(1998) = -.92, p = .36|
In addition to examining the distributions of estimates for the R-index and p-curve, I also conducted t-tests to compare the estimates for the biased data to the estimates for the unbiased data with the same true underlying effect size. This provides an additional test of the extent to which a metric is able to correct for bias in the data: A non-significant t-test indicates that the estimates of true power/effect size are similar regardless of the extent of bias in the data, while a significant t-test suggests that the metric fails to fully correct for bias. As shown in the last column of Tables 5 and 6, the R-index in all cases showed different estimates in the biased simulations as compared to the unbiased simulations.10 The p-curve on the other hand, showed no difference in its estimates (though there were no unbiased estimates for the d = 0 simulation).11
As a final note, it should be mentioned that this final set of simulations is quite a rigorous test in some sense and also quite fair in another sense. It is fair in that it collates results across 50 studies each time, quite generous in comparison to some of the examples used by both original authors to demonstrate their respective metrics. However, these simulations are quite rigorous in the sense of the bias they depict. For each set of studies, 100% of the results were required to be significant; this is especially unlikely to be the case when the true effect is tiny or non-existent, where studies may sometimes get published for showing “no evidence of an effect, contrary to previous research”. Were this standard relaxed to, say, 80% significant results, both these metrics would presumably fare better in the accuracy and precision of their estimates. However, as previously mentioned, the positive correlation between observed power and proportion of significant results should still unduly influence the R-index even in this case.
In light of this empirical data, simulating a variety of effect sizes, number of aggregated studies, and extent of bias, it appears that the p-curve offers a more accurate and precise estimate across this range of scenarios. Where no bias exists, the R-index may still be useful as an estimate of the true power, but can still suffer from low precision when aggregating across fewer than 30 studies. However, given that a) in real research, we have no a priori knowledge of how much bias exists, and b) this metric is itself intended to be used as a measure of bias, it seems problematic that the R-index does not adequately correct for bias, especially when said bias is large.
The p-curve appears to adequately “cut through” bias in research, even when bias is large. However, researchers should still be careful in their interpretations when they have prior reason to believe that a great amount of bias exists in the data, as this influenced the precision of the p-curve’s estimate of effect size. Where the p-curve indicates a small effect, it may be reasonable to exercise caution in placing confidence in such an estimate—there is considerable overlap between the p-curve estimates of a null effect and a small effect, even aggregating over 50 studies. However, the p-curve still offers an excellent choice for cases where bias may be less extreme, for larger effect sizes, or for cases where distinguishing between null effects and small effects are not particularly meaningful.
Some Notes on “Doping Tests”
In my analyses above, I have focused on using these two metrics as estimates of true effect size/power. It should be noted that both Schimmack and Simonsohn have made claims about the use of their respective metrics for use in identifying cases where QRPs or p-hacking are likely to be present. Indeed, while Schimmack does state that his R-index can be used as an estimate of true power, his primary use for it seems to have been (in his words) as a “doping test” to detect cases of QRPs.
I have not designed my simulations to test the use of either of these metrics for efficacy as a doping test in particular. However, my analyses do at least offer some insight into the usefulness of these tools for that purpose; one would first want to ensure that the test was accurate in identifying bias where it exists, and was precise enough that the false positive rate was low. Accuracy would correspond to how closely the mean estimates of the R-index and p-curve lined up with true power/effect size. Precision would correspond to the width of the distribution of these estimates. The evidence I have presented would suggest that the R-index does not offer a sufficiently accurate test of bias, as it is itself unduly influenced by bias in the data. In situations where the true underlying power is not known (i.e., actual research), it would be very difficult to distinguish between a case where d = 0 but bias exists, and a case where the true effect size is small but no bias exists.
With that said, however, my simulations are not optimal for determining the efficacy of these metrics as a “doping test”. It would be much preferable to use simulations with a signal detection framework to identify cases where bias is correctly identified (hits), where it is missed (false negatives), where no bias is correctly identified (correct rejections) and where no bias is mistaken for bias (false positives).12 Given the possible negative impact on the reputations of researchers in the field, any such possible test for bias should be thoroughly validated in this manner. If it were only able to offer sufficient precision when aggregated across 200 significant tests, for example, it would be inappropriate to apply it to a single paper or set of papers.
However, my primary reason for not testing the efficacy of these metrics as a doping test is that I am not particularly convinced that such a test is necessary. First of all, I would generally prefer to rely on other ways (particularly those focused on making structural changes to the academic environment as a whole13) to encourage the field to reduce its reliance on QRPs. In addition, it can be incredibly difficult to distinguish some forms of bias from other characteristics of data that are common in psychology. As an analogy, meta-analyzing across several positive and several negative effect sizes might provide the appearance that the “true effect size” is zero. However, it might just as easily suggest the presence of as-yet-undiscovered moderators that clarify when and where an effect is reduced or reversed. As I have written previously, discussing the idea of “real effects” can get complicated, especially in a field where all phenomena are complex and multiply determined. In the same way, finding a discrepancy between power and the proportion of significant results could indicate bias, or it could indicate undetermined moderators that amplify or attenuate the effect to a greater extent than what power calculations would suggest. Similarly, a p-curve assessing p-hacking may not arrive at the correct conclusion if aggregating across studies with differing effect sizes. In any meta-analytic technique such as these, it is important to consider the role that undiscovered moderators (of which there are always many) may have on the results in any particular case, making claims about “bias” or “QRPs” or “p-hacking” or “researcher degrees of freedom” potentially problematic.
I hope that this analysis of the R-index and p-curve has been helpful for better understanding how these metrics perform under a variety of conditions.14 Finding ways to better estimate power/effect size in a literature fraught with biases of various sorts is an important addition to our field. However, it is important to ensure that these tools are properly validated with regard to their accuracy and precision. It is also important to use and interpret them wisely, with the recognition that psychological phenomena, to a greater extent than perhaps many scientific fields, are influenced by a multitude of variables that can complicate the interpretation of estimates of “true effect size”.
- I do acknowledge that Simonsohn has done quite extensive simulations to test the p-curve already. So in this article, to a large extent I am duplicating tests that Simonsohn has already himself done. In addition, Schimmack offers on his blog a small simulation using Excel. But I wanted to offer a direct comparison between the two metrics on the same data, so that they are directly comparable. [↩]
- I have chosen to keep the analyses I originally performed in the main text for a couple reasons. First, the way I originally calculated the R-index was in keeping with the way Schimmack describes how to calculate the R-index. As such, it is consistent with how he has used it in his paper and on his blog, so my simulations show the performance of his measure in the way he himself uses it. Second, Schimmack outlined in his paper his reasoning for using median observed power, where power is first calculated separately for each study individually. He noted that this avoids the “assumptions…that effect sizes in one study are related to effect sizes in other studies” (p. 16). He also notes that this helps to deal with the non-normality of the sampling error of observed power estimates. Which calculation on balance is preferable, then, may be something to which Schimmack is better able to speak. [↩]
- When looking at the graph, one notes that there were some strange negative effect size outliers at d = .2. This, I highly suspect, is part of the optimization process going a little wrong, getting stuck optimizing a local minimum instead of finding the global minimum. Notably, the R code that Simonsohn provides in his paper includes a method of graphing the optimization process, which would allow one to easily correct for situations where the optimization process doesn’t quite work properly. For a single estimation of a p-curve, as one would usually do, it is easy enough to examine the graph and make sure it worked properly. For several thousand simulations, it’s not exactly feasible for me. So in the tables of data, I present the p-curve estimates both with and without trimming these outliers. [↩]
- In fact, in at least one case on his blog, he applies the R-index to just two significance tests. [↩]
- Schimmack states in his paper that the R-index ranges from 0 to 1, similar to statistical power. However, this is not technically accurate, as the estimate can exceed 1 if the proportion of significant results is below the median observed power—as we might expect sometimes even just by chance. Mathematically, the R-index is bounded between -1 (when the proportion of significant results is 1 and power is 0), and 2 (when the proportion of significant results is 0 and power is 1), though practically speaking, most estimates should still fall between 0 and 1. [↩]
- When d = 0, for example, only 1 in 20 studies should generally show a significant effect. This means that finding 50 significant studies should generally be drawing from about 1000 studies, 950 of which are non-significant. [↩]
- It is interesting to note that, since the R-index overestimated small effects, and underestimated large effects, the data is at least consistent with the notion that the R-index may itself tend toward moderate levels. However, more simulations would need to be done to confirm that this is indeed the case. [↩]
- Indeed, in the biased simulations here, the proportion of significant results is always 1, so the remaining bias in the R-index must be due to the only other variable in the equation—observed power. [↩]
- Other researchers have noted similar concerns with the use of observed/post hoc power calculations, and the general recommendation is to avoid calculating them entirely. For an example, see Daniël Lakens’ blog article on the subject; see also Hoenig and Heisey (2001); Yuan and Maxwell (2005). [↩]
- Using the revised calculations for R-index in the “Note” above did not change the significance of the t-tests. In fact, in all cases the t values were larger than those reported in Table 5 above. [↩]
- For the p-curve tests, I opted to use the trimmed p-curve estimates for the unbiased data, as I felt that this was more closely analogous to what would occur in an actual p-curve analysis where the optimization process could be easily supervised by the researcher. However, this decision only impacted the comparison for the d = .2 simulations. Using the untrimmed data did show a significant difference in this situation, t(1998) = 4.66, p < .001, due to the large effect of the outliers on the distribution of estimates for the unbiased data. However, this would mean that, if anything, the p-curve estimates for the biased data were more accurate. [↩]
- It should be noted that Simonsohn’s paper on the p-curve does include such simulations. To my knowledge, such an analysis has not been done for the R-index, partly because Schimmack to this point appears to have used the R-index in a continuous manner rather than providing a specific test distinguishing between “bias” or “no bias” (or perhaps various discrete degrees of bias). On his blog, he has offered some benchmarks that could possibly be turned into a discrete test. However, given that some of these benchmarks differ by as little as .10, while the simulations above for small or moderate effect sizes have confidence intervals much larger than this, additional work would need to be done to ensure that such a discrete test is precise enough to be usable. [↩]
- For example, I am highly in favour of developing constructive initiatives that reduce the “publish or perish” mentality, and encourage open sharing of data and materials, among other possibilities. [↩]
- Note that this is not intended to be the final word on the validation of these metrics. I would encourage others to run simulations of their own, perhaps testing other QRPs, like data peeking or dropping conditions. The R code I used is freely available for you to use, and is set up in a modular fashion that should (hopefully!) make it easy to modify. To test a different QRP, you can create a function that generates study results, and then pass that new function to the existing simulation function. If you have any questions or issues with the code, feel free to comment below, or use the contact information in the sidebar to contact me. [↩]