12

The following problem has been posted on Mensa International Facebook Page:

$\quad\quad\quad\quad\quad\quad\quad\quad$enter image description here

The post itself received 1000+ comments but I won't go into details about the debate there since I know this is Bertrand's box paradox and the answer is $\frac23$. What make me interested here is how does one answer this problem using a Monte Carlo approach? How is the algorithm to solve this problem?

Here is my attempt:

  1. Generate $N$ uniformly distributed random numbers between $0$ and $1$.
  2. Let the event of box contains 2 gold balls (box 1) selected be less than half.
  3. Count the numbers that less than $0.5$ and call the result as $S$.
  4. Since it's a certainty to get a gold ball if the box 1 is selected and it's only 50% chance of getting a gold ball if the box 2 is selected, hence the probability of getting a sequence GG is $$P(B2=G|B1=G)=\frac{S}{S+0.5(N-S)}$$

Implementing the algorithm above in R:

N <- 10000
S <- sum(runif(N)<0.5)
S/(S+0.5*(N-S))

The output of program above is around $0.67$ which almost match with the correct answer but I'm not sure this is the correct way. Is there a proper way to solve this problem programmatically?

amoeba
  • 93,463
  • 28
  • 275
  • 317

3 Answers3

14

Like @Henry, I don't really feel that your solution is a Monte Carlo. Surely, you sample from the distribution, but it doesn't have much to do with imitating the data generating process. If you wanted to use Monte Carlo to convince someone that the theoretical solution is correct, you would need to use solution that imitates the data generating process. I would imagine it to be something like below:

boxes <- list(
  c(0, 0),
  c(0, 1),
  c(1, 1)
)

count_successes = 0
count_valid_samples = 0

for (i in 1:5000) {
  sampled_box <- unlist(sample(boxes, 1)) # sample box
  sampled_balls <- sample(sampled_box)    # shuffle balls in the box

  if (sampled_balls[1] == 1) {            # if first ball is golden
    if (sampled_balls[2] == 1) {          # if second ball is golden
      count_successes = count_successes + 1
    }
    count_valid_samples = count_valid_samples + 1
  }
}
count_successes / count_valid_samples

or using "vectorized" code:

mean(replicate(5000, {       # repeat 5000 times, next calculate empirical probability
  x <- boxes[[sample(3, 1)]] # pick a box
  if (x[sample(2, 1)] == 1)  # pick a ball, check if it is golden
    return(sum(x) == 2)      # check if you have two golden balls in the box
  else
    return(NA)               # ignore if sampled ball is silver
  }), na.rm = TRUE)          # not count if silver

Notice that since you condition of the fact that first ball was already drawn and it is golden, so the above code could simply use two boxes boxes <- list(c(0, 1), c(1, 1)) and then sample from them x <- boxes[[sample(2, 1)]], so the code would be faster since it wouldn't make the 1/3 empty runs that we discount. However since the problem is simple and the code runs fast, we could afford to explicitly simulate the whole data generating process "to be sure" that the result is correct. Besides that, this step is not needed since it would yield same results in both cases.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • 7
    @Anastasiya-Romanova秀 you could instead sample from two boxes `boxes – Tim Dec 18 '17 at 10:31
  • Alright Tim, thanks for your answer. Give me time to understand your answer first since I'm fairly new in R. For now, +1 for you & @Henry. – Anastasiya-Romanova 秀 Dec 19 '17 at 02:20
  • I have a question. Does `return(sum(x) == 2)` mean if `x` values (1,1) the program will return 1 and 0 otherwise? – Anastasiya-Romanova 秀 Dec 19 '17 at 07:24
  • 1
    @Anastasiya-Romanova秀 yes, exactly. The code samples a box, then samples a ball from the box, if it is golden (=1) then it checks if the other ball from the box is also golden (1+1=2), if yes then it counts it and in the end it divides the sum by total count (i.e. uses `mean`). – Tim Dec 19 '17 at 08:14
  • OK, understood. Another question, does `return(NA)` mean it won't be included in calculating mean? For example, suppose we replicate only 5 times then the result are 1, 0, 1, NaN, 1. Does this mean the program will calculate the mean as $\frac{3}{4}$, *not* $\frac{3}{5}$? – Anastasiya-Romanova 秀 Dec 19 '17 at 09:12
  • 1
    @Anastasiya-Romanova秀 `return(NA)` returns missing value and then `mean(, na.rm = TRUE)` is used, where `na.rm = TRUE` argument tells the function to ignore the missing values. In other programming languages this could be done differently, e.g. using `continue` or `pass` keywords. – Tim Dec 19 '17 at 09:14
4

I feel your S/(S+0.5*(N-S)) calculation is not really simulation

Try something like this

N <- 10^6
ballsinboxes <- c("G","G", "G","S", "S","S")
selectedbox <- sample(c(1,2,3), N, replace=TRUE)
selectedball <- sample(c(1,2), N, replace=TRUE)
selectedcolour <- ballsinboxes[(selectedbox-1)*2 + selectedball ]
othercolour <- ballsinboxes[(selectedbox-1)*2 + 3 - selectedball ]
sum(selectedcolour == "G" & othercolour == "G") / sum(selectedcolour == "G")
Henry
  • 30,848
  • 1
  • 63
  • 107
-2

Why not simply list the cases?

Here: G is for "gold", S is for "silver", capital is for the initial extraction:

Gg

gG

Gs

... all other cases invove an initial silver (S) extraction and don't satisfy the conditional (initial extraction G).

Such, P(g|G) = 2/3.

ghuezt
  • 1
  • 1
  • 7
    The question asks about Monte Carlo solution. – Tim Dec 18 '17 at 11:21
  • Well, listing ALL the possibilities is an extreme case of Monte Carlo. – ghuezt Dec 18 '17 at 14:13
  • 4
    No it is not, Monte Carlo is about simulation/randomization. – Tim Dec 18 '17 at 14:15
  • Tim, get your math right. With infinitely many draws, you get exactly a list of all cases with equal probabilities. I sad "extreme case" and meant limit. – ghuezt Dec 18 '17 at 15:04
  • 1
    Sure, but listing all cases is not Monte Carlo. Square is a rectangle, but rectangle is not a square. – Tim Dec 18 '17 at 15:33