16

I'm trying to model some data on train arrival times. I'd like to use a distribution that captures "the longer I wait, the more likely the train is going to show up". It seems like such a distribution should look like a CDF, so that P(train show up | waited 60 minutes) is close to 1. What distribution is appropriate to use here?

foobar
  • 643
  • 1
  • 7
  • 17
  • 10
    If you wait 25 hours and there has been no train, I suspect the chance of a train turning up in the next minute may be close to $0$ as it is quite possible that the line has been closed temporarily or permanently – Henry Jul 05 '18 at 11:32
  • @Henry, this depends entirely on your believes in prior probabilities. For instance, the least used railway station in Britain, https://www.theguardian.com/uk-news/2016/dec/09/brief-encounter-at-britains-least-used-railway-station-shippea-hill , does have gaps of arrivals for more than one day (on Sundays there is no service). – Sextus Empiricus Jul 05 '18 at 15:01
  • @MartijnWeterings - perhaps thanks to journalists, Shippea Hill saw a 1200% increase in usage and did not even make [the lowest 10 of usage the following year](https://www.globalrailnews.com/2017/12/01/these-are-the-10-least-used-railway-stations-in-great-britain/), some of which such as Teesside Airport have one train a week in one direction – Henry Jul 05 '18 at 16:59

2 Answers2

17

Multiplication of two probabilities

The probability for a first arrival at a time between $t$ and $t+dt$ (the waiting time) is equal to the multiplication of

  • the probability for an arrival between $t$ and $t+dt$ (which can be related to the arrival rate $s(t)$ at time $t$)
  • and the probability of no arrival before time $t$ (or otherwise it would not be the first).

This latter term is related to:

$$P(n=0,t+dt) = (1-s(t)dt) P(n=0,t)$$

or

$$\frac{\partial P(n=0,t)}{\partial t} = -s(t) P(n=0,t) $$

giving:

$$P(n=0,t) = e^{\int_0^t-s(t) dt}$$

and probability distribution for waiting times is:

$$f(t) = s(t)e^{\int_0^t-s(t) dt}$$

Derivation of cumulative distribution.

Alternatively you could use the expression for the probability of less than one arrival conditional that the time is $t$

$$P(n<1|t) = F(n=0;t)$$

and the probability for arrival between time $t$ and $t+dt$ is equal to the derivative

$$f_{\text{arrival time}}(t) = - \frac{d}{d t} F(n=0 \vert t)$$

This approach/method is for instance useful in deriving the gamma distribution as the waiting time for the n-th arrival in a Poisson process. (waiting-time-of-poisson-process-follows-gamma-distribution)


Two examples

You might relate this to the waiting paradox (Please explain the waiting paradox).

  • Exponential distribution: If the arrivals are random like a Poisson process then $s(t) = \lambda$ is constant. The probability of a next arrival is independent from the previous waiting time without arrival (say, if you roll a fair dice many times without six, then for the next roll you will not suddenly have a higher probability for a six, see gambler's fallacy). You will get the exponential distribution, and the pdf for the waiting times is: $$f(t) = \lambda e^{-\lambda t} $$

  • Constant distribution: If the arrivals are occurring at a constant rate (such as trains arriving according to a fixed schedule), then the probability of an arrival, when a person has already been waiting for some time, is increasing. Say a train is supposed to arrive every $T$ minutes then the frequency, after already waiting $t$ minutes is $s(t) = 1/(T-t)$ and the pdf for the waiting time will be: $$f(t)= \frac{e^{\int_0^t -\frac{1}{T-t} dt}}{T-t} = \frac{1}{T}$$ which makes sense since every time between $0$ and $T$ should have equal probability to be the first arrival.


So it is this second case, with "then the probability of an arrival, when a person has already been waiting for some time is increasing", that relates to your question.

It might need some adjustments depending on your situation. With more information the probability $s(t) dt$ for a train to arrive at a certain moment might be a more complex function.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161
7

The classical distribution to model waiting times is the exponential distribution.

The exponential distribution occurs naturally when describing the lengths of the inter-arrival times in a homogeneous Poisson process.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357