10

I'm working through Think Bayes (free here: http://www.greenteapress.com/thinkbayes/) and I'm on exercise 3.1. Here's a summary of the problem:

"A railroad numbers its locomotives in order 1..N. One day you see a locomotive with the number 60. Estimate how many locomotives the railroad has."

This solution is found with the likelihood function and exponential prior like so:

class Train(Suite):
  def __init__(self, hypos, alpha=1.0):
    # Create an exponential prior
    Pmf.__init__(self)
    for hypo in hypos:
      self.Set(hypo, hypo**(-alpha))
    self.Normalize()
  def Likelihood(self, data, hypo):
    if hypo < data:
      return 0
    else:
      return (1.0/hypo)

Conceptually this is saying, if we see a train number larger than one of our hypotheses (1...1000) then every hypothesis that's smaller has a zero chance of being correct. The rest of the hypotheses have a 1/number_of_trains chance of showing us a train with this number.

In the exercise I'm working on the author then adds on a little extra. This assumes there's only one company. In real life however you'd have a mixture of big and small companies and bigger companies (both equally likely). However, this would mean that you're more likely to see a train from a bigger company since they'd have more trains.

Now the question is how to reflect this in the likelihood function?

This isn't Stack Overflow so I'm not really asking for coding help, but instead perhaps just help about how I might think about this problem in terms of a likelihood function.

Justin Bozonier
  • 1,119
  • 2
  • 11
  • 24
  • The same problem is in 50 Challenging Problems in Probability by Mosteller. The book is widely available. I don't think think bayes soon is correct. –  Nov 03 '13 at 04:28
  • Bought the book @Hogan but it doesn't include the part about other companies being mixed in though. – Justin Bozonier Nov 03 '13 at 19:49

2 Answers2

9

I'm first outlining an approach for two companies in detail, the extension to even more companies then should be intuitive (at least for the likelihood, the prior could be more tricky).

Imagine there are two companies A and B, where A has $N_A$ locomotives and B has $N_B$ locomotives. We assume $N_A \ge N_B$ (you can always switch A and B to make this hold). The total number for that hypothesis of locomotives is $N_{tot} = N_A + N_B$.

Imagine you see a locomotive with the number $n$. There are three cases for the likelihood:

  1. $N_A < n$: This can't happen, so the likelihood is zero.
  2. $N_B < n \le N_A$: This locomotive must come from company A, so there is only one locomotive with this number. Thus the likelihood is $1/N_{tot}$
  3. $n \le N_B$: This locomotive can be either from A or from B, so there are two locomotives with this number. The likelihood to see one of them is $2/N_{tot}$.

As a quick sanity check: The likelihood to see any number at all is $$\sum_{i=1}^\infty L(i) = \sum_{i=1}^{N_B} \frac{2}{N_{tot}} + \sum_{i=N_B+1}^{N_A} \frac{1}{N_{tot}} \\ = \frac{2\cdot N_B}{N_{tot}} + \frac{N_A-N_B}{N_{tot}} = \frac{N_A+N_B}{N_{tot}} = 1$$.


Generally, there will be (number of companies + 1) cases, one for each interval $N_i < n \le N_{i+1}$. Luckily, we can look at the problem from a different angle and see that what we need for the likelihood are actually just two numbers: $N_{tot}$, the total number of locomotives; and $N_n$, the number of locomotives that have the number $n$. How likely are we to see one of the $N_n$ locomotive, out of $N_{tot}$ locomotives? This will happen in $\frac{N_n}{N_{tot}}$ of all cases, so this fraction is the likelihood. In Python, you can calculate this with two sum generators (and you don't even have to order the companies by size). If Ns contains a list (or tuple) of company sizes according to your hypothesis, then this will give the likelihood for seeing a locomotive with number n:

total_number_of_locomotives = sum(N for N in Ns)
number_of_locomotives_with_that_number = sum(1 for N in Ns if n<=N)
likelihood = (number_of_locomotives_with_that_number / total_number_of_locomotives)

Note that the trivial case with one company is also handled by this code (the first sum just will be $N$, the second sum will be 0 or 1, depending on whether $n\le N$).


For the priors, Zipf's law could be a good starting point for a realistic distribution of company sizes.

dobiwan
  • 787
  • 5
  • 8
  • This is a great answer and you're right that I can definitely see how it generalizes. Thank you for taking the time. – Justin Bozonier Sep 13 '14 at 18:30
  • It's worth nothing that the resulting likelihood function has the same value independent of the hypothesis. That is, `Likelihood(data=60, hypo=60)` and `Likelihood(data=60, hypo=1000)` evaluate to the same value. So if the prior distribution was uniform the posterior will also be uniform (minus the values for what the likelihood was 0) – RubenLaguna May 21 '16 at 15:29
-1

I am not going to analyze the code, but below is the solution.

Let

  • P(loc60) be the probability that a random locomotive has number 60
  • P(N) be the prior probability that there are exactly N locomotives
  • P(loc60|N) be the probability that a random locomotive has number 60, if the total number of locomotives is N,
  • P(N|loc60) be the probability that there are exactly N locomotives, if a random locomotive has number 60

Then

$$ P(N|\text{loc60}) = {P(\text{loc60}|N) P(N) \over P(\text{loc60})} = {P(\text{loc60}|N) P(N) \over \sum_M P(\text{loc60}|M)}$$

But $$ P(\text{loc60}|N) = \cases {1/N & if $N\ge 60$ \\ 0 & otherwise } $$

From now on, we assume that $N \ge 60$.

$$ P(N|\text{loc60}) = {P(N)/N \over \sum_{M=60}^\infty P(M)/M} $$

Now we should select P(N), otherwise we are stuck. Since we don't know even the order of magnitude of P(N), it is reasonable to assume that $\log N$ is uniformly distributed between 0 and some $\log N_\max$ (i. e. the probability that $10^2\le N<10^3$ is the same as the probability that $10^3\le N<10^4$). Guestimating $N_\max$ is a tricky task, but from my prior knowledge about railroads and locomotives, I can assume that $N_\max \gg 60$ .

The uniform distribution of $\log N$ means that $$P(N) = c(\log (N+1)-\log N) \approx c/N$$, where c is a constant independent on N.

Substituting this to the previous formula, we have: $$ P(N|\text{loc60}) \approx {c/N^2 \over \sum_{M=60}^{N_\max} c/M^2} $$

But $$\sum_{M=60}^{N_\max} c/M^2 \approx \int_{60}^{N_\max} {c\over M^2}dM = {c \over 60} - {c \over N_\max} \approx {c\over60} $$

Now we have

$$ P(N|\text{loc60}) \approx {60/N^2} $$

What is the median value of N? Let it be $N_\text{med}$ , then

$$ \int_{60}^{N_\text{med}} {60 \over N^2} dN = 1/2 $$

$$ 60/N - {60 \over N_\text{med}} = 1/2 $$

$$ N_\text{med} = 120 $$

If what we need is mathematical expectation rather than median, then

$$ E(N) = \int_{60}^{N_\max} {60\over N^2} N dN = 60 \log {N_\max \over 60} $$

From what I know about railroads, $N_\max$ should be between $10^3$ and $10^6$, so E(N) is somewhere between 170 and 600.

user31264
  • 1,694
  • 10
  • 14
  • 1
    This seems to address the simple problem. But what of the case when you might have different rail road companies of different sizes? – Justin Bozonier Nov 05 '13 at 00:42
  • This addresses exactly the case when there are different rail road companies of different sizes. "$\log N$ is uniformly distributed between 0 and some $\log N_\max$" is the distribution of sizes. – user31264 Nov 05 '13 at 00:50
  • 4
    If you say so. It's odd that the word "company" doesn't show up once in your answer. Sorry, I don't see the connection. – Justin Bozonier Nov 09 '13 at 01:04