Hierarchical regression in a time series dataset

Question

I have a dataset of 250 lines and 11 columns. Each line contains data referring to a school. For each school I have its name and the number of applications, from 2012 to 2016, and I would like to predict the number of applications for 2017. I also have the neighborhood where each school is located, and the population of children in that neighborhood, from 2012 to 2017.

I know I should apply some kind of algorithm related to time series, but 1) I do not have much experience with time series, and 2) maybe a time series of 5 years is too short, right?

My question is whether it would make sense to transform my database to long format, previously removing the population variable of the year 2017, and apply a hierarchical regression model, with the number of applications as a dependent variable and the neighborhood and the population of children as independent variables. If this does not make sense, what type of analysis do you think is appropriate with this data to be able to make a prediction of the number of applications for the year 2017?

By the way, I'm doing the analysis in Python, but I would not have any trouble doing it in R.

Gijs · Accepted Answer · 2017-11-05T12:22:44.423

I think a reasonable model that you could still estimate from this data is that the number of kids on a school is a percentage of the number of kids in the neighborhood. This percentage might have a trend, so perhaps including a linear trend for that is a good try.

#kids ~ binom(n = #pop, p = inverse_logit(a_school + b_school * (#year)))

This is a timeseries model in that time is involved, but I wouldn't be looking at AR, MA or ARIMA for this if that's your idea. Something that is not in this model is that schools might have a maximum number of kids allowed each year. If there is such system, it will influence the numbers and the predictions of course. Also, in this simple model, you assume each kid has an independent chance of picking this school, but they might not be independent so you might see overdispersion. Also, in this simple model, you are assuming the school only gets kids from the neighborhood, but if there is schoolgoing happening over larger distances, it's not included in the model. Whether these things will actually affect the quality of your predictions remains to be seen.

Now if there's important structural things going on, like a neighborhood expansion, or the opening or closing of a school nearby, these things are fit (badly probably) with the trend term, so if you know something about this, you should definitely include that, perhaps with a indicator term. Let's say you know the neighborhood in 2017 is twice the size of the neighborhood in 2016. Then a good bet will be a doubling of the number of kids on the school, and no model, be it ARIMA or linear or neural network whatever, will see it coming based on the numbers up to that year.

Your question is about long format. The model you are describing is

#kids ~ norm(a + b_neighborhood * neighborhood + b_population * population, sigma)

I think the coefficient of the population should vary by school. Some schools have a large percentage of the kids in a neighborhood, some only a small number. So a single coefficient here is probably not optimal. Also normal errors is not apriopriate, allthough it could work fine.

The model you are describing, by the way, unless I'm missing something, is not a hierarchical model as commonly understood by statisticians. That term is used for a model with random effects, see https://en.wikipedia.org/wiki/Mixed_model. Including random effects in your model can be very beneficial as well, depending on the correlations between the schools.

Thank you very much for your comments, @Gijs. You have given me two good starting points. I like your idea of approaching the problem as a binomial distribution. But I'm not sure if the year should be a predictor variable. How will I predict the percentage of applications for 2017, if my model does not have data on that year? On the other hand, I totally agree with you on the notes you make about the limitations of the model. Unfortunately I am working with open data, and this is the only thing I have. Thank you very much for your help. — giltrapo, Nov 06 '17 at 07:51
Hey @giltrapo, you got it. The year could be used as a predictor in a linear trend. That's what I intended in the first formula. Let's say the percentages of kids on the school you observe are 40 - 45 - 50 - 52 over the period 2014 - 2017. Then you could infer there is a trend there. You can make it linear in the formula, which with these numbers will work, or perhaps use a logistic growth formula, see http://study.com/academy/lesson/logistic-population-growth-equation-definition-graph.html. — Gijs, Nov 06 '17 at 09:30
I'm having trouble getting good predictions. The best predictions I get with a very simple model ('glm(applications/population ~ school, family = binomial, weights = population, data=schools)'). You can try with the data I am using ('schools — giltrapo, Nov 06 '17 at 11:48
Hey, I have the same results, see http://gijskoot.nl/aic/predictions/glm/2017/11/06/school-admissions.html. I don't think it's a big problem that the percentages are small. I do think that in such a case, the size of the neigborhood isn't actually too important relative to the particular popularity of the schools. I found that a gaussian model works better than all the binomial models. — Gijs, Nov 06 '17 at 13:31
I have obtained almost the same results as you, @Gijs. I also used lme4 to adjust multilevel models, but I did not get much better results than with the binomial. Thank you very much for your attention and your help. By the way, if you are working on a project related to schools, and you are interested, I could email you my full dataset. Tell me and I'll send it without problems. — giltrapo, Nov 06 '17 at 15:58
Well that's reassuring for both of us I think, something that makes sense! Thanks for the offer. At the moment, data is (unusually) not my problem, because the open data in the Netherlands on schools is very comprehensive. Best of luck! — Gijs, Nov 06 '17 at 16:03

Hierarchical regression in a time series dataset

1 Answers1

Linked