Minding the Meta-Analysis

Hipster Ariel - So MetaA little over a week ago, I had the opportunity to go to yet another meeting of the Society for Personality and Social Psychology (SPSP). It’s always a great time, with plenty of very interesting talks and posters! It’s also always a pleasure to travel from the harsh Canadian winter to someplace warm to talk about psychology. Walking around in a t-shirt in February is not a common experience for me.

Perhaps this is just my perception, but over the past few years there seems to be a growing trend toward people doing meta-analyses of the studies they present. I’m sure you know what I’m talking about: they present three studies, and maybe the last one has only a marginal effect, but then they say, “But when you meta-analyze over all three studies, the overall effect is highly significant.” This year I saw at least a couple people do this in their talk, and I’ve seen it before at previous conferences and in other contexts. So I want to talk just a little bit about these informal mini-meta-analyses—to distinguish them from more formal meta-analyses, I’m going to call them “meso-analyses”—and talk about some of the caveats of this technique.

A Wave of the Hand

One thing that worries me when people mention the result of a meso-analysis is whether they would have run the same analysis had all their effects been p < .05. It’s worrisome because it sometimes (often?) seems like it’s done merely to bolster an effect that wasn’t quite as strong as the researchers had hoped. That, in and of itself, isn’t necessarily a bad thing, but when people only reach to a meso-analysis to bolster their argument, it turns it into more of a rhetorical flourish rather than a statistical technique.

Of course, the results are what they are, but if you only run the analysis to reassure people that p = .07 in one study is okay, how much information do we really gain from it? Since people will only mention the meso-analysis if it works in their favour, it seems more like a case of an additional researcher degree of freedom. (In other words, if Study 3 doesn’t work well but the meso-analysis is significant, mention the meso-analysis; otherwise, don’t mention it, and drop Study 3 instead.) In order to really take meso-analyses seriously, they would need to become standard practice for all programs of research, rather than something one does to cover up slightly-tarnished results.1

Valuating Variance

Presenter: After closer investigation, it's become clear that we need to enter more than one value.But is it something that should become standard practice in the first place? Just how much information does it actually give us? Well, it’s complicated. Meta-analyses are most convincing when the effects they aggregate are as homogeneous as possible: the same manipulations/measures, the same populations, the same phenomenon. To the extent that the inputs differ from this ideal, things can start to get a little hairy. As an example, consider a set of three studies, where two are strong, hit-participants-over-the-head manipulations leading to d = .8, and the third is a very subtle, subliminal prime manipulation leading to d = .03. Theoretically, we may have reason to argue that these manipulations are influencing the same psychological phenomenon, but to what extent is it really useful to aggregate across these three studies? The meso-analysis might reveal an overall effect of d = .54, but that’s only because we had two strong manipulations and one weak one. If we added another subtle prime for Study 4, this estimate would change. Do you think the “true underlying effect size” should change as a result of how many (or what kind of) studies we decided to run? Of course not; the effect size here is largely a product of how many strong vs. weak manipulations we choose to run.

Of course, a high-quality meta-analysis takes these sorts of issues into account. Calculating Cochran’s Q or an I2 statistic will show you the extent of heterogeneity in a sample, and using subgroups or meta-regression can help to assess the characteristics of studies that account for heterogeneity. In this case, it would make the most sense to calculate the effect size for the two strong manipulations as one group, and the two weak manipulations as separate group.2 The problem is that I’ve never heard anyone mention this in a meso-analysis. Maybe they’ve checked and have adequate homogeneity to make the aggregate effect size worthwhile, but just didn’t mention it in their talk. Maybe. But I have seen cases where it was obviously applied inappropriately. And given what I’ve suggested above, that it’s often used more as a rhetorical rather than statistical tool, I suspect that in many cases researchers have not bothered to do their due diligence. And that’s bad.

The Inconvenience of Convenience

A final note about meso-analyses is that they are particularly susceptible to issues of sampling. Obviously, if I have 3 (or 300) studies on the effects of watching violent TV on the aggression of adolescent boys, I can’t aggregate the results of these studies and then apply the effect size estimate to girls as well. That much is clear. But given that most psychological research does not use random sampling (understatement of the year), this can also make it very dicey to talk about a meta-analysis providing an estimate of the “true effect size in the population”.

While meta-analyses can suffer from this, they are at least often dealing with a greater number of studies, from multiple researchers in multiple labs; in a meso-analysis, problems with sampling can be amplified. Not only are they typically done on one researcher’s work, but they combine multiple convenience samples, meaning that it’s questionable whether you can even make claims about the “true effect size” in that researcher’s university population! A meso-analysis might be useful for an individual researcher to answer the question, “Given these four studies I have already run, what effect size am I likely to get if I run a fifth one?” But it offers limited usefulness for grand generalizations to the world outside that researcher’s laboratory.

Dilbert - averaging bad data

Conclusion

Given this issues, is there any real reason to do a meso-analysis? In some cases, sure. Just like a larger meta-analysis, the quality of the studies you put in will determine the quality of the effect size estimate you get out. If you have studies that are very similar in nature, there’s nothing wrong with doing a little meso-analysis to get a better estimate of the underlying effect size. It’s just important to interpret the estimate in light of the studies you put into it. Aggregating over a small number of studies is not going to average out any biases of convenience sampling, choice of measures, or researcher-specific idiosyncrasies. But a meso-analysis of your past research might help you plan out your next study. It might even help others be more convinced by your findings, provided you aren’t grossly misusing the technique to average over different effects to make a marginal effect look better. Considering sampling strategies, heterogeneity, and methodological differences is important for any meta-analytic technique, but it’s especially important if you’re aggregating over a small number of studies. Calling it a “meta-analysis” might lend more credibility than it’s worth, if you have a total of three studies.

One reasonable use of a meso-analysis in a conference talk might be to allow you to present a subsample of representative studies: “We’ve done five studies on this with similar methods; I only have time today to highlight two studies, but meta-analyzing across all five shows a strong overall effect, and these two studies are representative of that effect.” That can help indicate to your audience that you aren’t cherry-picking the best-looking studies from the overall package.

More generally, here are some practical questions to ask yourself when determining whether to run a meso-analysis:

  1. Am I considering this only to smooth over some troublesome result I don’t like? Would I do this if all my studies were significant?
  2. Are there important methodological differences between the studies that might suggest it is inappropriate to aggregate across them all? (Or, at least, should I be using a random effects model to account for these differences?)
  3. Have I calculated Cochran’s Q to assess the heterogeneity of these effects?
  4. How am I planning to use the results? Am I trying to generalize to the general population? Plan out future studies? Estimate the true effect size? Would the results I get from this meso-analysis actually serve that purpose?

To those of you who sit in an audience listening to someone talk about their meso-analysis, I would encourage you to have a healthy dose of skepticism. Ask yourself, “Is this something they would have done anyway if that last study they presented was significant? Are their studies drastically different in methodology? Are they sampling from entirely different populations?” If they demonstrate that they’ve done their due diligence in considering these issues, and show you that the effects are homogeneous, then great! Offer them kudos for being a good scientist. But don’t accept a meso-analysis on face value. Like the magician’s art of misdirection, it can be easy to hide a glaring methodological issue with a fancy meta-analysis showing an amazingly small p-value. This doesn’t mean that all presenters are hiding things, but it does mean that we must interpret a meso-analysis with caution.

I hope that this article will help psychologists and audience members to understand what to look for when evaluating a meso-analysis, so they can be the obnoxious audience member who raises their hand to ask, “What was the Cochran’s Q on that analysis?” Okay, on second thought, don’t be that person. But do encourage your fellow researchers to be mindful of the perils and pitfalls of meso-analyses, so they can use and interpret them properly.XKCD comic: Life goal to get paper rejected for "too meta"

 

Notes:

  1. Let’s also keep in mind that part of the problem has to do with the aspersion people cast on “marginally significant” results. If people didn’t mind p = .06, then researchers wouldn’t need to reach for the meso-analysis when they get p = .06. []
  2. Another solution that can be used is a random effects model. The decision of whether to use a fixed or random effects model is beyond the scope of this article, but see Borenstein, Hedges, Higgins, and Rothstein (2009); or Hedges and Olkin (1985) for information on when each is appropriate. []

2 responses to “Minding the Meta-Analysis”

Judith A. Hall

I fear you are warning people off of doing very valuable summaries of their own studies. There is no minimum number of studies that makes for a suitable meta-analysis; even comparing or combining just two effect sizes is a reasonable meta-analysis, assuming one has good theoretical reasons for putting them together or comparing them. One just has to remember that with fewer studies, inferences must be more guarded. Also, people summarizing their own studies would not be likely to be claiming to have uncovered the ‘true’ effect size; they would have the more modest goal of summarizing studies from their own lab, with all of the limitations and methodological biases/gaps this entails. Actually chances are their studies are rather homogeneous in design, which is something you advocated for.

In fact, however, I don’t agree with the idea that ideally meta-analysis should be done on methodologically homogeneous studies. Especially if one uses random effects models, there is great value in having lots of study designs and populations represented–because then one can make a strong inference about generality. Of course, the meta-analyst must have good theoretically grounded reasons for putting a set of studies together in a meta-analysis; it is indeed possible to mix fruit and rocks–but that might make little sense. However, mixing apples and oranges (as fair representatives of ‘fruit’) is perfectly reasonable, and one can and should compare the apples to the oranges to see how they differ. Of course, meta-analysts routinely do moderator analyses so I don’t think they need to be told this.

Finally, I think you are yourself mixed up (or the blog post makes it sound this way) on effect size versus p-values. Whether the p-value is .05 or .07 matters little to a meta-analyst, but you seem to be as hung up on p-values as the rest of the people who are obsessing about p-hacking. The effect size associated with a researcher’s p = .07 may actually be bigger than the effect sizes in her other studies that did reach .05. It’s the effect size that matters, and that’s what the meta-analysis tells us about. Your advice will confuse people.

Sincerely,
Judith A. Hall
University Distinguished Professor of Psychology
Northeastern University

Jeff

Hello Judith,

Thank you for your comment! I do appreciate the points that you’ve brought up. In fact, I meant to at least touch on random effects as useful in some cases, but forgot to do that, so I’ve edited the article to include that.

I hope it did not come across as though I am advocating never doing meta-analyses, or even “meso-analyses”, or that they are never helpful. Certainly, meta-analyses can be incredibly useful, and summarizing one’s work can be useful as well. Where I take the most issue of this is where it seems to be done either a) selectively, such that if the meso-analysis did not turn out favourably it would be silently discarded; and b) haphazardly, not taking into account the issues that a proper meta-analyst knows is important. Perhaps it is my anecdotal experience, but when I have seen these “study summaries” done (as opposed to a formal meta-analysis), it has not included the methodological rigor that would allow for proper interpretation. So my article was simply trying to highlight the importance of considering issues such as homogeneity, moderators, etc. even for these informal meta-analyses.

With regard to the emphasis on p-values: I certainly understand that meta-analyses are about effect sizes. Again, my critique here is primarily with regard to situations I have encountered where, after presenting a study in a talk where an effect is marginal or not quite the “right” pattern, the presenter tries to reassure the audience by mentioning that a meta-analysis shows the overall effect to be significant. It’s this particular use of a meta-analysis that rubs me the wrong way, as it sounds like it would to you as well. I have no problem if people want to summarize their research and say “Hey, here’s the best estimate of the effect size.” If that is done routinely, not just when the result is favourable, that’s great. More emphasis on effect sizes instead of arbitrary p-value thresholds is great. But if it is silently discarded if the overall effect isn’t so convenient for the researcher, it makes it difficult to know to what extent we can trust it. It’s roughly equivalent to the file-drawer problem, where studies are dropped if they aren’t so convenient. Perhaps it is the combination of the file-drawer along with running meso-analyses that concerns me.

I’m sorry if that did not come across clearly, and I would be happy to hear any suggestions you might have for how I could make it more clear. But I do appreciate your contribution to the discussion, so thank you!

Cheers,
Jeff

Leave a Reply