63

I see that one time out of the twenty total tests they run, $p < 0.05$, so they wrongly assume that during one of the twenty tests, the result is significant ($0.05 = 1/20$).

xkcd jelly bean comic - "Significant"

  • Title: Significant
  • Hover text: "'So, uh, we did the green study again and got no link. It was probably a--' 'RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!'"

xkcd comic 882 - "Significant"

Sycorax
  • 76,417
  • 20
  • 189
  • 313
DJG
  • 693
  • 1
  • 7
  • 6
  • 8
    95% confidence will mean that on average in 5% of experiments (one out of 20) we'll get an opposite conclusion. Which is exactly what has happened here. I.e., if you also make the same experiment with orange jelly beans 1000 times, ~ 50 of those will give positive result. :) – sashkello Feb 27 '14 at 00:37
  • 1
    Is this a candidate for CW, perhaps? – Glen_b Feb 27 '14 at 00:47
  • 1
    This cartoon also appeared in [What is your favorite "data analysis" cartoon?](http://stats.stackexchange.com/questions/423/what-is-your-favorite-data-analysis-cartoon) – Nick Stauner Feb 27 '14 at 00:58
  • 19
    Who said it's funny? – whuber Feb 27 '14 at 01:12
  • 3
    [Myself, besides the other 59 voters here](http://stats.stackexchange.com/a/9254/32036), so its $\text{funniness}>0$ at least! ;-P (This comment definitely doesn't represent my opinion of XKCD in general.) Unless there's such a thing as $\text{funniness}<0$, in which case we'd probably want better data. Nobody's downvoted it yet though, FWIW as an available operationalization of "negative funniness"...and with that, I've probably taken this comment into negative funniness territory $(p<.05> – Nick Stauner Feb 27 '14 at 01:24
  • 4
    See also [this discussion on explainxkcd.com](http://www.explainxkcd.com/wiki/index.php?title=882:_Significant) – Jeromy Anglim Feb 27 '14 at 03:21
  • 4
    @Glen_b, the favorite data analysis cartoon thread is appropriately CW, however, I see no reason this one should be. 'Why funny' aside, the question asks for an understanding of the statistical point at issue in the cartoon, which has an answer & should be on-topic & not-CW (& which I think you handled well below). – gung - Reinstate Monica Feb 27 '14 at 03:31
  • 1
    @gung yes, I think I agree. I was worried about the opinion involved in the humor part, but as I came to answer it, it became less of a concern – Glen_b Feb 27 '14 at 04:12
  • @Nick I was implicitly trusting this community to recognize a tongue-in-cheek comment. – whuber Feb 27 '14 at 17:15
  • Of course; despite appearances, I recognized it too :) As a somewhat self-indulgent psychologist with professional interest in operationalizing abstract constructs, I tend to brandish [Maslow's hammer](http://en.wiktionary.org/wiki/if_all_you_have_is_a_hammer,_everything_looks_like_a_nail) overzealously. – Nick Stauner Feb 27 '14 at 18:07
  • 1
    If someone has to explain the joke it makes the joke not funny anymore. – MDMoore313 Feb 27 '14 at 20:42
  • @MDMoore313 - hardly. In college, I had a classmate who was brilliant, and had a unique sense of humor. I set a goal to make a joke in class only he would understand. I found my chance, and was successful. He laughed, no one else did, and I explained it all after class. I forget the joke, but as long as one person found it funny. – JTP - Apologise to Monica Feb 28 '14 at 15:29
  • @JoeTaxpayer after you explained the joke, did the rest of the class find it funny? did you still find it *as* funny? – MDMoore313 Feb 28 '14 at 15:32
  • @MDMoore313 - They understood how the genius found it funny. Humor is one of those things that's not always universal. I was born/raised in NYC. When I saw a Woody Allen movie in Texas with a NYC friend, we laughed at times no one else did, and vice versa. Some overlap, but not 100%. There are jokes about parking and/or traffic that aren't universal. – JTP - Apologise to Monica Feb 28 '14 at 15:50

3 Answers3

71

Humor is a very personal thing - some people will find it amusing, but it may not be funny to everyone - and attempts to explain what makes something funny often fail to convey the funny, even if they explain the underlying point. Indeed not all xkcd's are even intended to be actually funny. Many do, however make important points in a way that's thought provoking, and at least sometimes they're amusing while doing that. (I personally find it funny, but I find it hard to clearly explain what, exactly, makes it funny to me. I think partly it's the recognition of the way that a doubtful, or even dubious result turns into a media circus (on which see also this PhD comic), and perhaps partly the recognition of the way some research may actually be done - if usually not consciously.)

However, one can appreciate the point whether or not it tickles your funnybone.

The point is about doing multiple hypothesis tests at some moderate significance level like 5%, and then publicizing the one that came out significant. Of course, if you do 20 such tests when there's really nothing of any importance going on, the expected number of those tests to give a significant result is 1. Doing a rough in-head approximation for $n$ tests at significance level $\frac{1}{n}$, there's roughly a 37% chance of no significant result, roughly 37% chance of one and roughly 26% chance of more than one (I just checked the exact answers; they're close enough to that).

In the comic, Randall depicted 20 tests, so this is no doubt his point (that you expect to get one significant even when there's nothing going on). The fictional newspaper article even emphasizes the problem with the subhead "Only 5% chance of coincidence!". (If the one test that ended up in the papers was the only one done, that might be the case.)


Of course, there's also the subtler issue that an individual researcher may behave much more reasonably, but the problem of rampant publicizing of false positives still occurs. Let's say that these researchers only do 5 tests, each at the 1% level, so their overall chance of discovering a bogus result like that is only about five percent.

So far so good. But now imagine there are 20 such research groups, each testing whichever random subset of colors they think they have reason to try. Or 100 research groups... what chance of a headline like the one in the comic now?

So more broadly, the comic may be referencing publication bias more generally. If only significant results are trumpeted, we won't hear about the dozens of groups that found nothing for green jellybeans, only the one that did.

Indeed, that's one of the major points being made in this article, which has been in the news in the last few months (e.g. here, even though it's a 2005 article).

A response to that article emphasizes the need for replication. Note that if there were to be several replications of the study that was published, the "Green jellybeans linked to acne" result would be very unlikely to stand.

(And indeed, the hover text for the comic makes a clever reference to the same point.)

Glen_b
  • 257,508
  • 32
  • 553
  • 939
12

The effect of hypothesis testing on the decision to publish has been described more than fifty years ago in the 1959 JASA paper Publication Decisions and Their Possible Effects on Inferences Drawn from Tests of Significance - or Vice Versa (sorry for the paywall).

Overview of the Paper The paper points out evidence that published results of scientific papers are not a representative sample of results from all studies. The author reviewed papers published in four major psychology journals. 97% of the reviewed papers reported statistically significant outcomes for their major scientific hypotheses.

The author advances a possible explanation for this observation : that research which yields nonsignificant results is not published. Such research being unknown to other investigators may be repeated independently until eventually by chance a significant result occurs (a Type 1 error) and is published. This opens the door to the possibility that the published scientific literature may include a over-representation of incorrect results resulting from Type 1 errors in statistical significance tests - exactly the scenario that the original XKCD comic was poking fun at.

This general observation has been subsequently verified and re-discovered may times in the intervening years. I believe that the 1959 JASA paper was the first to advance the hypothesis. The author of that paper was my PhD supervisor. We updated his 1959 paper 35 years later and reached the same conclusions. Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa. American Statistician, Vol 49, No 1, Feb 1995

Wilf Rosenbaum
  • 221
  • 2
  • 3
-2

What people overlook is that the actual p-value for the green jelly bean case is not .05 but around .64. Only the pretend (nominal) p-value is .05. There’s a difference between actual and pretend p-values. The probability of finding 1 in 20 that reach the nominal level even if all the nulls are true is NOT .05, but .64. On the other hand, if you appraise evidence looking at comparative likelihoods—the most popular view aside from the error statistical one (within which p-values reside) you WILL say there’s evidence for H: green jelly beans are genuinely correlated with acne. That’s because P(x;no effect) < P(x; H). The left side is < .05, whereas the right side is fairly high: if green jelly beans did cause acne then finding the observed association would be probable. Likelihoods alone fail to pick up on error probabilities because they condition on the actual data attained. There’s no difference in the appraisal than if there had just been this one test of the green jelly beans and acne. So although this cartoon is often seen as making fun of p-values, the very thing that’s funny about it demonstrates why we need to consider the overall error probability (as non-pretend p values do) and not merely likelihoods. Bayesian inference is also conditioned on the outcome, ignoring error probabilities. The only way to avoid finding evidence for H, for a Bayesian would be to have a low prior in H. But we would adjust the p-value no matter what the subject matter, and without relying on priors, because of the hunting procedure used to find the hypothesis to test. Even if the H that was hunted was believable, it's still a lousy test. Errorstatistics.com

  • 2
    It is very hard to tell exactly what this post is trying to say. Let me focus on one part of it, hoping that a clarification might reveal the meaning of the rest: exactly what do you mean by "the overall error probability"? – whuber Aug 21 '14 at 19:41
  • 2
    @whuber I believe that the post is referring to the multiple comparisons problem. – Matt Aug 26 '14 at 21:03