K successes in Bernoulli trials, or George Lucas movie experiment

Question

I'm reading "The Drunkard's Walk" now and cannot understand one story from it.

Here it goes:

Imagine that George Lucas makes a new Star Wars film and in one test market decides to perform a crazy experiment. He releases the identical film under two titles: "Star Wars: Episode A" and "Star Wars: Episode B". Each film has its own marketing campaign and distribution schedule, with the corresponding details identical except that the trailers and ads for one film say "Episode A" and those for the other, "Episode B".

Now we make a contest out of it. Which film will be more popular? Say we look at the first 20,000 moviegoers and record the film they choose to see (ignoring those die-hard fans who will go to both and then insist there were subtle but meaningful differences between the two). Since the films and their marketing campaigns are identical, we can mathematically model the game this way: Imagine lining up all the viewers in a row and flipping a coin for each viewer in turn. If the coin lands heads up, he or she sees Episode A; if the coin lands tails up, it’s Episode B. Because the coin has an equal chance of coming up either way, you might think that in this experimental box office war each film should be in the lead about half the time.

But the mathematics of randomness says otherwise: the most probable number of changes in the lead is 0, and it is 88 times more probable that one of the two films will lead through all 20,000 customers than it is that, say, the lead continuously seesaws"

I, probably incorrectly, attribute this to a plain Bernoulli trials problem, and must say I fail to see why the leader won't seesaw on average! Can anyone explain?

MånsT · Accepted Answer · 2012-09-05T09:09:21.020

23

Here is some R code to simulate the George Lucas experiment:

B<-20000
steps<-2*rbinom(B,1,0.5)-1
rw<-cumsum(steps)
ts.plot(rw,xlab="Number of customers",ylab="Difference")

Running it, we get pictures like these:

enter image description here

where the difference in sold tickets between A and B is on the y-axis.

Next, we run $10,000$ such simulated George Lucas experiments. For each experiment, we compute the proportion of time spent $\geq 0$, i.e. the proportion of the lined-up viewers for which the number of tickets sold to A is greater or equal to the number of tickets sold to B. Intuitively, you'd say that this proportion should be roughly $1/2$. Here is a histogram of the results:

enter image description here

The proportion is $1/2$ on average in the sense that the expected value is $1/2$, but $1/2$ is an unlikely value compared to values close to $0$ or $1$. For most experiments, the differences are either positive or negative most of the time!

The red curve is the density function of the arcsine distribution, also known as the $\mbox{Beta}(1/2,1/2)$ distribution. What is illustrated in the above picture is a theorem known as the first arscine law for random walks, which says that as the number of steps of the simple symmetric random walk approaches infinity, the distribution of the proportion of time spent above $0$ tends to the arcsine distribution. A standard reference for this result is Section III.4 of An introduction to probability theory and its applications , Vol 1 by William Feller.

The R code for the simulation study is

prop<-vector(length=10000)
for(i in 1:10000)
{
    steps<-2*rbinom(B,1,0.5)-1
    rw<-cumsum(steps)
    prop[i]<-sum(rw>=0)/B
}
hist(prop,freq=FALSE,xlab="Proportion of time spent above 0",main="George Lucas experiment")
curve(dbeta(x,1/2,1/2),0,1,col=2,add=TRUE)

edited Sep 05 '12 at 09:09

answered Sep 05 '12 at 08:06

MånsT

10,213
1
46
65

Thanks! I installed R and would like to repeat all your steps - how can I run 10,000 simulations and compute the proportion of time spent? – andreister Sep 05 '12 at 09:07
1

@andreister: I edited my answer, adding the code for the simulation at the end. I hope you find it useful! – MånsT Sep 05 '12 at 09:10
Thanks, that's very useful! To make sure I understand the stuff, I made http://pastebin.com/mtRdsPkP based on your code - can you flick though? – andreister Sep 05 '12 at 12:59
1

@andreister: That looks good! To answer the question about why `cumsum` is used instead of `sum` imagine that the viewers are standing in line, and that we check which movie they bought a ticket to one by one. `cumsum` gives a vector of partial sums, so that the 1st element tells us how far ahead/behind A is after 1 viewer, the 2nd element how far ahead A is after 2 viewers, 3rd element after 3 viewers, and so on. If element $i$ is positive, then A has had more viewers after the first $i$ viewers. If it is negative, B has had more viewers and if it is 0 they have had the same number of viewers – MånsT Sep 05 '12 at 13:02
1

(contd.) This is the information that we're interested in, since we want to see if the leader seesaws. `sum` would just sum all 1's and -1's, which would give you the final result after _all_ 20,000 viewers have been accounted for (i.e. the last element of the `cumsum` vector). – MånsT Sep 05 '12 at 13:03
I'm not sure if "each positive number in the vector is the length of the "Episode A" viewers chain until it's interrupted by an "Episode B" viewer" is an accurate description of the `cumsum`vector though. – MånsT Sep 05 '12 at 13:04
let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/4739/discussion-between-andreister-and-manst) – andreister Sep 05 '12 at 13:17

score 11 · Answer 2 · answered Sep 04 '12 at 23:23

Both A and B have a $1/2$ chance to be ahead after any odd number of trials $t$ (odd to avoid ties). However, these events are related. Whichever is ahead after $t=1$ has a $3/4$ chance to be ahead after $t=3$, and this gets more dramatic as $t$ increases.

The average number of lead changes does grow to infinity as the total number of trials increases, but slowly. A random walk without drift in $1$ dimension is recurrent, so however far you may be in the lead, the chance you will be tied at some point in the future (with an infinite number of trials) is $1$. However, even if you lead by only one, the expected time until you are even again is infinite. There is a significant chance that it will take an extremely long time to get back to even.

That said, the mode is being used to exaggerate the effect. It actually would be a surprise to see no lead changes at all in $20,000$ trials.

If you would like to compute some of the probabilities, you have to count something akin to lattice walks which don't cross the diagonal. There is a great combinatorial method which applies to random walks (and to Brownian motion) which don't cross such a line, called the reflection principle or reflection method. This is one method to determine the Catalan numbers. Here are two other applications:

The number of sequences so that $A$ ends up ahead $10,200-9,800$ is $20,000 \choose 9,800$. In each sequence ending up at $(10,200, 9,800)$, either $B$ is never in the lead, or there is some point at which $B$ is first in the lead. If $B$ gains the lead, then if you reverse the later trials, you get a sequence ending up at $(9,799, 10,201)$, and this is a bijection. So, the number of sequences which end up at $(10,200, 9,800)$ so that $B$ was never in the lead is ${20,000 \choose 9,800} - {20,000 \choose 10,201} = {20,000 \choose 9,800} - {20,000 \choose 9,799} = {20,000 \choose 9,800} \frac{401}{10,201}.$ So, you can see that the chance $B$ was ahead at some point, given that you end up at $(10,200, 9,800),$ is about $96\%$.

The total number of sequences with any endpoint so that $A$ is never behind is ${20,000 \choose 10,000} \approx 2^{20,000}/\sqrt{10,000 \pi}.$ So, the probability that $A$ is never behind is about $\frac{1}{100 \sqrt{\pi}}$. The chance that the lead never changes is about $\frac{1}{50 \sqrt{\pi}} \approx 1/89.$ The average number of lead changes is about $56$.

Thanks! I need to understand the notation before I understand your answer though! What does it mean "ends up ahead 10,200−9,800" etc, where do you get the numbers from? How do you see 20K is the mode? — andreister, Sep 05 '12 at 09:11
The values $10,200-9,800$ were an example. That's just one of the possible results. You can do the same sort of analysis for $11,000-9,000$ or $10,001-9,999.$ I don't think I said $20,000$ was the mode of anything. Your quote said, " the most probable number of changes in the lead is $0$" which means $0$ is the mode. However, this is similar to a geometric distribution with $p$ close to $0$. The most likely value is $0$ (if you use the $0$-based convention), but it's not likely. There are a lot of other possibilities with slightly lower probabilities. — Douglas Zare, Sep 05 '12 at 10:13

score 0 · Answer 3 · edited Apr 13 '17 at 12:44

"it is 88 times more probable that one of the two films will lead through all 20,000 customers than it is that, say, the lead continuously seesaws"

In plain English: one of the movies gets an early lead. It has to, as the first customer has to go to A or B. That movie is then just as likely to keep its lead as lose it.

88 times more likely sounds, well, unlikely, until you remember that perfect seesawing is very improbable. The chart in MansT's answer, showing this graphically, is fascinating isn't it.

ASIDE: Personally, I think it'll be more than 88 times - due to <buzzword-alert> viral marketing </buzzword-alert>. Each person will ask other people what they saw, and are more likely to visit the same movie. They'll even do this subconsciously: people are more likely to join a long queue to go an see something. I.e. as soon as randomness amongst the first few customers has created a leader, human psychology will keep it as a leader :-).

K successes in Bernoulli trials, or George Lucas movie experiment

3 Answers3