Developing a prediction model for bus stops

Question

I need to predict if a bus will stop at a particular scheduled route stop. I have GPS coordinates, I can check door signals, the stops have radii, and I also have historical data about if the bus skipped a stop or not. I want to use this information to train a model predicting whether a bus will stop at a particular stop or not.

I was looking at Bayesian rule, and I was thinking it might be possible to use a likelihood ratio to predict the bus reaching a stop. But I'm not sure if this is the right way to look at this problem. I'm not a statistician, so all the jargon to choose the right way to model this problem makes me nervous. I would like to get your advice on this. I'm willing to put in time to learn, but I don't want to spend a few days to find out this might not be the way to look at it in the first place.

I think my problem is limited understanding of various probability based modelling techniques. If my statement of my problem needs any clarification, or any section needs more detail, I'm happy to add it. Any help would be much appreciated. Cheers!

UPDATE: Sorry guys—these are buses that take employees to work and the other way around. So these buses have the liberty to avoid a stop if there are no passengers; the buses can even avoid the last two stops, for example, when taking the passengers from work to home, they usually go back to a different place to pick people from. I have GPS data from which I can check if a bus reached a stop, and there's also data for route configuration. In each route there is data about if a stop is mandatory as well. With this data in hand I'd like to predict if a bus will reach a certain stop. Please let me know if you require further information. Thanks!

I find quite confusing the link between "reach a certain point" and "avoid that stop" and "missed a stop or not". I don't think the problem is statistical jargon, could you maybe re-word the problem more precisely about what in the real world it is about? — Peter Ellis, Jan 08 '14 at 07:12
Agree with Peter Ellis. Describe what the bus does and what is in the data and how the two are related. For example, why would a bus avoid a stop? Isn't the last stop on a bus route normally also the first stop on the route going in the other direction? Bus routes are usually circles, conceptually, even if we draw them as lines. So wouldn't the bus have to go to the last/first stop even if the bus was empty in order to see if anyone wants to get on there? — Bill, Jan 08 '14 at 20:06
I've tried expanding the original question guys - but I do agree that I'm finding it hard to formulate my problem definition, but I'm willing to expand on any areas if you require further information. thanks again! — opensourcegeek, Jan 08 '14 at 20:45
OK, Alex Williams answer is good if you want a relatively simple method. If you want a more complex method (which might give you more accurate predictions), then you want a Hidden Markov Model where the hidden state variable is the number of guys on the bus at any given time. — Bill, Jan 09 '14 at 19:38

score 8 · Accepted Answer · edited Apr 13 '17 at 12:44

It seems like simple logistic regression would work well. Hopefully this matches well to the data you have. I've tried to lay off jargon as much as possible.

Let's confine our analysis to a single bus route for simplicity (you can simply repeat this procedure for other routes). The dependent/predicted variable you are trying to measure is a binary variable $y$; that is, $y=0$ if the bus misses the stop, and $y=1$ if the bus makes the stop.

From your GPS data you can extract the value of $y$ for a bunch of previous bus runs. Let's say you have $N$ observations in this data. This is best conceptualized as a vector/list, $y= \{ y_1, y_2, ... ,y_N\} $. For example, $y= \{ 0,1,0,0,1\}$ would correspond to miss, stop, miss, miss, stop, for five observations.

Now you want to develop a series of predictor/independent variables that you can use to predict $y$ for future observations. AccidentalStatistician has mentioned a few possibilities. Here are a few simple ones:

Whether the bus stopped or missed the previous bus stop (call this variable $x_1$). This is also a binary variable. This could potentially be very informative, for example $x_1 = \{ 0,1,0,0,1\}$, would give evidence that $y=x_1$. Of course, there is no reason to just check the previous bus stop. To be complete, you could try using ALL previous bus stops on the line as predictor variables. Each will be a binary vector like $x_1$ above.
The distance between the bus and the bus in front of it (call this $x_2$). In contrast to $x_1$, this variable is continuous, and could look something like $x_2 = \{0.53,0.9,0.72,0.81,0.62 \}$ where each entry corresponds the distance (e.g., in miles) between the two buses at the time of the stop (or averaged over some period before the stop). It might be more informative to measure this distance in minutes than miles.
Time of day. $x_3 = \{8.5,9.2,10.1,11.2,14.9\}$ in hours.
Day of year... You get the recipe by now hopefully. Feel free to come up with more ideas.

The important step is figuring out what you think might be important in your data and distilling it into some simple form (e.g. zeros and ones).

Once you have your data in this form, you can run a logistic regression to predict the probability that $y=1$ for any observed values of $x_1, x_2, ..., x_p$ (where $p$ is the number of independent variables). If you have just one independent variable, $x$, the result will look something like this (image source)

Here, the black dots are your observed values of $y$ plotted against your observed values of $x$. The red line is the predicted probability of $y=1$ for an arbitrary value of $x$ (here $x$ is a continuous variable).

The following source explains how to fit a logistic model in R: LINK. I recommend the following textbook as an introduction to logistic regression and multiple linear regression (which has pretty similar motivations) LINK. And the following book for advanced understanding of logistic regression and other classification methods: LINK. This last reference will cover a lot of really important methods for variable selection -- it is really easy to come up with too many independent variables and over-fit your data: LINK. Don't do this!

score 2 · Answer 2 · answered Jan 09 '14 at 16:45

In order to predict any outcome (bus stop in your case), you need some information other than the outcome you want to predict. These variables are often called predictors / covariates / independent variables.

So the answer to your question depends on what information you have.

1. If, the bus GPS signal and door open signal are the only ones you have, your predictor can include

historical data on the route
bus stop data in the morning (from home to work) to predict after work stop
data on the previous stops (if real time)

In this case, you can probably find a correlation between pick-up stops and drop-off stops i.e the people from the same pick-up sites are likely to get off at the same drop-off stops. So if a particular pick-up site was not stopped, then the corresponding drop-off site may be avoided. You can use logistic regression for this purpose. You probably cannot model the pick-up stops in this case.

If real-time, you can also use the previous stop information on the same route. The method for real time modelling is Markov Chain Monte Carlo but you can use regression if that is beyond your knowledge.

2.If you have other information such as

day of the week
time of the day
# of people on the drop-off bus

they can also be used in a regression as predictors.

In short, you need Markov Chain Monte Carlo if you know enough statistics and logistic regression otherwise. Your predictors will be anything you think is relevant.

score 1 · Answer 3 · answered Jan 09 '14 at 15:40

I'm assuming, from your description of your data, that you effectively have a perfect record of where the bus opens its doors. So, for predicting what a bus will do at a particular stop, your data is what the bus did on previous shifts (entire routes), and what the bus did for previous stops on the current shift.

For the non-mandatory stops, the simplest model would be to assume the probability of the bus opening its doors at a stop is independent of what happens at other stops. In that case each stop would have a simple likelihood ratio, since it's equivalent to the likelihood for, say, seeing heads when flipping a biased coin.

The next step would be a model where the probability of opening doors at a stop depends on what the bus did at stops earlier along the route. For that you'd want some historical data to look at, so you can look for correlation between stops.

For making the model more complete, it depends on the specifics of your problem.

I've assumed above, for example, that what the bus does on different shifts are independent, but if it does several shifts, at different times of day, that might not be the case. Then you'd have to account for what time of day it is, if you've got measurements for that.

I've also assumed there's only one bus, but if two buses happened to turn up at the same stop soon after each other, that would lower the probability the second bus opens its doors there. So then you'd have to account for what other buses are doing.

score 0 · Answer 4 · answered Jan 08 '14 at 05:59

In your question, you need to describe what data you have that are complete, jagged, and missing.

Firstly, how much data do you have for a single bus?
Does a bus always use the same route (for the data you have)? Or does the route change over time?
If you can assemble data that are not jagged, you may be able to develop a model.
However, if you have multiple realizations representing say, several months of stop data for single bus route where all the trips are over the same route and same stops, then you will be in a better position.
What seems missing is your inclination to describe data for multiple measurements of the same experiment (one bus, same route, 100 days of data for the same route, with stop data and characteristics of bus conditions --> #passengers, etc., when each stop is made.)

Overall, you should think about assembling your data to determine what is missing, what is unique, what is repeated, what is assymetric and jagged (different). Then use a "divide and conquer" method to reduce your large problem into many small problems -- then solve each small problem.

Developing a prediction model for bus stops

4 Answers4