13

"How large must a class be to make the probability of finding two people with the same birthday at least 50%?"

I have 360 friends on facebook, and, as expected, the distribution of their birthdays is not uniform at all. I have one day with that has 9 friends with the same birthday. (9 months after big holidays and valentines day seem to be big ones, lol..) So, given that some days are more likely for a birthday, I'm assuming the number of 23 is an upperbound.

Has there been a better estimate to this problem?

Tim
  • 108,699
  • 20
  • 212
  • 390
Adam
  • 813
  • 10
  • 16
  • 3
    A sample of 360 persons does not make a large sample for the distribution of birthdays over 365 days of the year... You certainly cannot check for uniformity over such a small sample. – Xi'an Jan 31 '12 at 11:00
  • A person has a birthday, what are the odds that a second person *doesn't* share the same birthday? `364/365`, what are the odds that a third person *doesn't* share either birthday? `(364/365) * (363/365)`. Expand on this until you've got a probability `< 50%`. It would mean the odds that *no one* has the same birthday, which would in turn mean that the odds for at least two to share a birthday would be `> 50%`. – zzzzBov Jan 31 '12 at 15:07
  • 8
    Are we to assume you have *random* friends? – James Jan 31 '12 at 15:45
  • 1
    @zzzzBov - you don't understand what the OP is asking for. This is the approach where we assume each birthday is equally likely, each with chance $\frac{1}{365}$ of being yours. The OP is asking for what the estimate would be when say being born on Jan 1 is not as likely as being born on Feb 15. – probabilityislogic Feb 01 '12 at 01:20

1 Answers1

18

Luckily someone has posted some genuine birthday data with a bit of discussion of a related question (is the distribution uniform). We can use this and resampling to show that the answer to your question is apparently 23 - the same as the theoretical answer.

> x <- read.table("bdata.txt", header=T)
> birthday <- data.frame(date=as.factor(x$date), count=x$count)
> summary(birthday) 
      date         count     
 101    :  1   Min.   : 325  
 102    :  1   1st Qu.:1266  
 103    :  1   Median :1310  
 104    :  1   Mean   :1314  
 105    :  1   3rd Qu.:1362  
 106    :  1   Max.   :1559  
 (Other):360                 
> results <- rep(0,50)
> reps <-2000 # big number needed as there is some instability otherwise
> for (i in 1:50)
+ {
+ count <- 0
+ for (j in 1:reps)
+ {
+ samp <- sample(birthday$date, i, replace=T, prob=birthday$count)
+ count <- count + 1*(max(table(samp))>1)
+ }
+ results[i] <- count/reps
+ }
> results
 [1] 0.0000 0.0045 0.0095 0.0220 0.0210 0.0395 0.0570 0.0835 0.0890 0.1165
[11] 0.1480 0.1770 0.1955 0.2265 0.2490 0.2735 0.3105 0.3350 0.3910 0.4165
[21] 0.4690 0.4560 0.5210 0.5310 0.5745 0.5975 0.6240 0.6430 0.6950 0.7015
[31] 0.7285 0.7510 0.7690 0.8025 0.8225 0.8280 0.8525 0.8645 0.8685 0.8830
[41] 0.8965 0.9020 0.9240 0.9435 0.9350 0.9465 0.9545 0.9655 0.9600 0.9665
Peter Ellis
  • 16,522
  • 1
  • 44
  • 82
  • 8
    Indeed, one can show via [Schur convexity](http://en.wikipedia.org/wiki/Schur-convex_function), that for *any* nonuniform distribution of birthdays, the probability of a match is at least as great as in the uniform case. This is **Exercise 13.7** of J. Michael Steele, *[The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities](http://www.amazon.com/Cauchy-Schwarz-Master-Class-Introduction-Mathematical/dp/052154677X)*, Cambridge University Press, 2004, **pg. 206**. – cardinal Jan 31 '12 at 13:16
  • @cardinal: Terrific book!!! – Xi'an Jan 31 '12 at 13:31
  • 2
    @Xi'an: Indeed. Now, if only I knew someone who did book reviews for a high-quality, high-readership stats magazine, I'd suggest they review it to give it higher visibility to statisticians...but where to find such a person... – cardinal Jan 31 '12 at 13:51
  • @cardinal: yes, I wonder too...! Thanks for the suggestion, I will write to CUP (I could have grabbed the book there last week!) – Xi'an Jan 31 '12 at 13:54
  • 3
    (For those who may be wondering about my immediately preceding comment, it references the fact that @Xi'an is the newly appointed [book reviewer for *Chance*](http://chance.amstat.org/2011/11/book-reviews/).) – cardinal Jan 31 '12 at 19:36
  • 2
    @Xi'an, check this out and see what you think: `table(replicate(10^5, max(tabulate(sample(1:365,360,rep=TRUE)))))`. – whuber Jan 31 '12 at 20:12
  • @whuber, I now fully agree with this distribution: `extreme=rep(0,360); for (t in 1:10^5){ i=max(diff((1:360)[(!duplicated(sort(sample(1:365,360,rep=TRUE))))])) extreme[i]=extreme[i]+1 }; extreme=extreme/10^5` in a less elegant coding. Congrats! And thanks. – Xi'an Jan 31 '12 at 20:36
  • 3
    It's probably not clear, except to R cognoscenti, that the code in previous comments by @Xi'an and myself simulates the OP's situation. Running it establishes that the chance of 9 or more people sharing a birthday, out of 360 *randomly* chosen from a uniformly distributed population, is only around 40 out of 100,000. The most likely value for the maximum number of shared birthdays is 5. – whuber Jan 31 '12 at 20:41
  • So, people are not any more likely to be born at certain times of the year? People are equally likely to be having sex and producing babies at all times? Tax time doesn't get people down, while wedding seasons and holiday seasons have no effect? 9 months after Christmas, New Years, and Valentines show the same likelihood of people being born as other days? If 99% of people were born on the same day, the answer to this question is 2. I don't see how the distribution isn't important. – Adam Feb 01 '12 at 02:12
  • 2
    @Adam - I don't think anyone said the distribution isn't important. As a matter of fact, I'm ready to believe it's not exactly uniform (although I doubt the holiday or tax effects are as significant as some people think in terms of sexual behaviour for reproducing couples). However, my answer shows that, with real data, it's close enough to uniform to come up with the same answer as if it were uniform. – Peter Ellis Feb 01 '12 at 02:50
  • @Adam: The distribution *is* important, as Peter also remarks. For more evidence of this, reread the first comment under the answer. :) – cardinal Feb 01 '12 at 03:20
  • @cardinal, I read it, I just misunderstood it. – Adam Feb 01 '12 at 05:17
  • @Adam: Well, the wording was a little awkward. Sorry about that. – cardinal Feb 01 '12 at 14:43
  • @whuber, I posted a 'barplot' representation of the distribution of the number of people sharing a birthdate [on my blog](http://xianblog.wordpress.com/2012/02/01/the-birthday-problem) (as a conclusion to our discussion here-and below). – Xi'an Feb 01 '12 at 18:10