A little over a week ago, I had the opportunity to go to yet another meeting of the Society for Personality and Social Psychology (SPSP). It’s always a great time, with plenty of very interesting talks and posters! It’s also always a pleasure to travel from the harsh Canadian winter to someplace warm to talk about psychology. Walking around in a t-shirt in February is not a common experience for me.
Perhaps this is just my perception, but over the past few years there seems to be a growing trend toward people doing meta-analyses of the studies they present. I’m sure you know what I’m talking about: they present three studies, and maybe the last one has only a marginal effect, but then they say, “But when you meta-analyze over all three studies, the overall effect is highly significant.” This year I saw at least a couple people do this in their talk, and I’ve seen it before at previous conferences and in other contexts. So I want to talk just a little bit about these informal mini-meta-analyses—to distinguish them from more formal meta-analyses, I’m going to call them “meso-analyses”—and talk about some of the caveats of this technique.
A Wave of the Hand
One thing that worries me when people mention the result of a meso-analysis is whether they would have run the same analysis had all their effects been p < .05. It’s worrisome because it sometimes (often?) seems like it’s done merely to bolster an effect that wasn’t quite as strong as the researchers had hoped. That, in and of itself, isn’t necessarily a bad thing, but when people only reach to a meso-analysis to bolster their argument, it turns it into more of a rhetorical flourish rather than a statistical technique.
Of course, the results are what they are, but if you only run the analysis to reassure people that p = .07 in one study is okay, how much information do we really gain from it? Since people will only mention the meso-analysis if it works in their favour, it seems more like a case of an additional researcher degree of freedom. (In other words, if Study 3 doesn’t work well but the meso-analysis is significant, mention the meso-analysis; otherwise, don’t mention it, and drop Study 3 instead.) In order to really take meso-analyses seriously, they would need to become standard practice for all programs of research, rather than something one does to cover up slightly-tarnished results.1
But is it something that should become standard practice in the first place? Just how much information does it actually give us? Well, it’s complicated. Meta-analyses are most convincing when the effects they aggregate are as homogeneous as possible: the same manipulations/measures, the same populations, the same phenomenon. To the extent that the inputs differ from this ideal, things can start to get a little hairy. As an example, consider a set of three studies, where two are strong, hit-participants-over-the-head manipulations leading to d = .8, and the third is a very subtle, subliminal prime manipulation leading to d = .03. Theoretically, we may have reason to argue that these manipulations are influencing the same psychological phenomenon, but to what extent is it really useful to aggregate across these three studies? The meso-analysis might reveal an overall effect of d = .54, but that’s only because we had two strong manipulations and one weak one. If we added another subtle prime for Study 4, this estimate would change. Do you think the “true underlying effect size” should change as a result of how many (or what kind of) studies we decided to run? Of course not; the effect size here is largely a product of how many strong vs. weak manipulations we choose to run.
Of course, a high-quality meta-analysis takes these sorts of issues into account. Calculating Cochran’s Q or an I2 statistic will show you the extent of heterogeneity in a sample, and using subgroups or meta-regression can help to assess the characteristics of studies that account for heterogeneity. In this case, it would make the most sense to calculate the effect size for the two strong manipulations as one group, and the two weak manipulations as separate group.2 The problem is that I’ve never heard anyone mention this in a meso-analysis. Maybe they’ve checked and have adequate homogeneity to make the aggregate effect size worthwhile, but just didn’t mention it in their talk. Maybe. But I have seen cases where it was obviously applied inappropriately. And given what I’ve suggested above, that it’s often used more as a rhetorical rather than statistical tool, I suspect that in many cases researchers have not bothered to do their due diligence. And that’s bad.
The Inconvenience of Convenience
A final note about meso-analyses is that they are particularly susceptible to issues of sampling. Obviously, if I have 3 (or 300) studies on the effects of watching violent TV on the aggression of adolescent boys, I can’t aggregate the results of these studies and then apply the effect size estimate to girls as well. That much is clear. But given that most psychological research does not use random sampling (understatement of the year), this can also make it very dicey to talk about a meta-analysis providing an estimate of the “true effect size in the population”.
While meta-analyses can suffer from this, they are at least often dealing with a greater number of studies, from multiple researchers in multiple labs; in a meso-analysis, problems with sampling can be amplified. Not only are they typically done on one researcher’s work, but they combine multiple convenience samples, meaning that it’s questionable whether you can even make claims about the “true effect size” in that researcher’s university population! A meso-analysis might be useful for an individual researcher to answer the question, “Given these four studies I have already run, what effect size am I likely to get if I run a fifth one?” But it offers limited usefulness for grand generalizations to the world outside that researcher’s laboratory.
Given this issues, is there any real reason to do a meso-analysis? In some cases, sure. Just like a larger meta-analysis, the quality of the studies you put in will determine the quality of the effect size estimate you get out. If you have studies that are very similar in nature, there’s nothing wrong with doing a little meso-analysis to get a better estimate of the underlying effect size. It’s just important to interpret the estimate in light of the studies you put into it. Aggregating over a small number of studies is not going to average out any biases of convenience sampling, choice of measures, or researcher-specific idiosyncrasies. But a meso-analysis of your past research might help you plan out your next study. It might even help others be more convinced by your findings, provided you aren’t grossly misusing the technique to average over different effects to make a marginal effect look better. Considering sampling strategies, heterogeneity, and methodological differences is important for any meta-analytic technique, but it’s especially important if you’re aggregating over a small number of studies. Calling it a “meta-analysis” might lend more credibility than it’s worth, if you have a total of three studies.
One reasonable use of a meso-analysis in a conference talk might be to allow you to present a subsample of representative studies: “We’ve done five studies on this with similar methods; I only have time today to highlight two studies, but meta-analyzing across all five shows a strong overall effect, and these two studies are representative of that effect.” That can help indicate to your audience that you aren’t cherry-picking the best-looking studies from the overall package.
More generally, here are some practical questions to ask yourself when determining whether to run a meso-analysis:
- Am I considering this only to smooth over some troublesome result I don’t like? Would I do this if all my studies were significant?
- Are there important methodological differences between the studies that might suggest it is inappropriate to aggregate across them all? (Or, at least, should I be using a random effects model to account for these differences?)
- Have I calculated Cochran’s Q to assess the heterogeneity of these effects?
- How am I planning to use the results? Am I trying to generalize to the general population? Plan out future studies? Estimate the true effect size? Would the results I get from this meso-analysis actually serve that purpose?
To those of you who sit in an audience listening to someone talk about their meso-analysis, I would encourage you to have a healthy dose of skepticism. Ask yourself, “Is this something they would have done anyway if that last study they presented was significant? Are their studies drastically different in methodology? Are they sampling from entirely different populations?” If they demonstrate that they’ve done their due diligence in considering these issues, and show you that the effects are homogeneous, then great! Offer them kudos for being a good scientist. But don’t accept a meso-analysis on face value. Like the magician’s art of misdirection, it can be easy to hide a glaring methodological issue with a fancy meta-analysis showing an amazingly small p-value. This doesn’t mean that all presenters are hiding things, but it does mean that we must interpret a meso-analysis with caution.
I hope that this article will help psychologists and audience members to understand what to look for when evaluating a meso-analysis, so they can be the obnoxious audience member who raises their hand to ask, “What was the Cochran’s Q on that analysis?” Okay, on second thought, don’t be that person. But do encourage your fellow researchers to be mindful of the perils and pitfalls of meso-analyses, so they can use and interpret them properly.
- Let’s also keep in mind that part of the problem has to do with the aspersion people cast on “marginally significant” results. If people didn’t mind p = .06, then researchers wouldn’t need to reach for the meso-analysis when they get p = .06. [↩]
- Another solution that can be used is a random effects model. The decision of whether to use a fixed or random effects model is beyond the scope of this article, but see Borenstein, Hedges, Higgins, and Rothstein (2009); or Hedges and Olkin (1985) for information on when each is appropriate. [↩]