Identity of residual distribution, and identification of correct model in multiple categorical linear regression

Question

I am using R for this analysis, and so examples and graphics will be produced in this language. I am willing to provide equivalent examples in similar languages if it will help someone, and am willing to accept answers in terms of other languages.

In this question, I intend to display graphs produced in order to verify assumptions, and ask for help in getting a better model. I understand that this may be considered too specific. However, it is my opinion that it would be helpful to have more examples of bad models and how to correct them on this site. If a moderator finds this not to be the case, I will happily delete this post.

I have conducted an initial linear model (lm) in R. It is multiple categorical regression with approx 100,000 cases, two categorical regressors and a continuous regressand. The goal of this regression is prediction: specifically, I would like to estimate prediction intervals. Find below some diagnostics of the initial model:

Residuals histogram (full) below. It may be difficult (impossible) to see, but there exist (sparse) values between 300 and 2000, as well as -50 and -500. Between -50 and 300, values are very dense. This indicates, to my understanding, heavy tails.

Residuals histogram (partial) below. Same image as above, but zoomed to the dense area.

A normal Quantile Quantile (normal QQ plot) is found below. Again, according to the holy grail of qqplots, (super) heavy tails are indicated.

Below is predicted vs residuals. Clearly, funky stuff is going on, suggesting heteroscedasticity:

I first tried some transformations. BoxCox yields a value very close to zero. So I will try to take the log of the regressand (in accordance with the Wikipedia page).

Log Transform:

Log transformed histogram, looks a lot better, but we still have some skew:

And the NormalQQ Plot. Still seems that the residuals are not normally distributed.

Logarithm transformed Residual vs Predicted. Seems we have some decreasing variance now, but I would be willing to accept this assumption.

Other transformations I tried: raising regressand to powers 1/2, 1/3 and -1. None of these had satisfactory results; I choose not to include information about these transformations in order to save space, but will happily provide such information should it be requested.

Here lie my questions:

1) Is the solution to this problem simply to keep trying increasingly wacky transformations (ex: $1/log(x^{\pi/3})$)?

2) I have been looking (intermittently over a period of weeks) at Generalized Linear Models, which seem to allow a non-normal distribution of residuals. Unfortunately, I have not been able to understand them, and non of my (undergraduate statistics) peers have knowledge of them. If GLM's present a solution to this issue, I would be grateful if someone could explain them in this context. (Even if they are not a solution, I would be grateful for a simple explanation, or a reference to one).

2i) If GLM's are a good fit, I believe I would still need a distribution to model error by. What ways are there of detecting which (family) of distribution is the best fit for the residuals, after which I assume I can perform MLE to get the parameters? I've been having issues trying to evaluate heavy tailed distributions with respect to skew, because they tend not to have any moments, and so have $\infty$ or indeterminate skew.

3) Is there another class of models not aforementioned I should look into?

4) Is my current model sufficient for prediction intervals, despite the non-normality of residuals?

Some more information about the model: I am predicting a cost, thus the log transform is appealing in that my predicted values are positive reals.

I will be hanging around my computer all day, and have R gui open on my other monitor, so should be able to fulfill most requests for additional information.

AntoniosK · Accepted Answer · 2015-08-21T07:18:27.020

1

This is a question that any statistician has when building models when prediction is the objective.

You can try many transformations of the dependent variable (log seems reasonable if you deal with positive values) and the independent ones until you find something succesful. Sound like lots of combinations, especially when you have many variables. Or you might try to include variable interaction terms.
GLMs are a good way to create regression models when you assume a specific link function between your dependent and independent variables. The most common one is the binomial link function (logistic regression) when you want your estimation to be a probability (from 0 to 1), but not suitable for your case of course.
For you case try a GLM that performs poisson regression using a log link function as a starting point. I expect it to be similar to your model when you log-transformed your dependent variable (but it's not exactly the same). https://onlinecourses.science.psu.edu/stat504/node/169
If your independent variables are correlated then a regression tree could be more suitable as it accounts automatically for interactions and automatically picks the variables that best predict your dependent variable. http://www.statmethods.net/advstats/cart.html
For any method you use you can always check prediction by plotting actual values vs. predicted, so to spot where you underestimate/overestimate.

edited Aug 21 '15 at 07:18

answered Aug 20 '15 at 23:09

AntoniosK

576
2
7

What do you mean by a "poisson link function"? You seem to be conflating two different ideas there. – Glen_b Aug 21 '15 at 01:14
I mean setting family = poisson and link = log, within the glm command. – AntoniosK Aug 21 '15 at 07:16
Does setting family equal to poisson mean that I expect my errors to have such a distribution? Or is it my response? Or am I missing its point entirely? – John Madden Aug 21 '15 at 13:19
Yes, that's it. You specify the family based on what you think your errors' distribution is. Just in case you are interested in differences between setting family = poisson (log link function) and just transforming y to log(y) check this : http://www.r-bloggers.com/do-not-log-transform-count-data-bitches/ . My opinion is to try both and see which is better, but the main difference is : transformation -> log(y) = ax+b+error, GLM -> y=e^(ax+b)+error – AntoniosK Aug 21 '15 at 13:43
Ok gotchya. I tried the GLM with suggested configurations, and it looks a lot better: http://imgur.com/a/4gZGr I'll admit that I don't fully understand the model (how did I even get likelihood for non-integer poisson values???), but it seems to look good so I'll accept the answer. – John Madden Aug 21 '15 at 14:15
There are lots of stuff going on in the background that I don't know in detail. Clearly if something is not a positive integer it can't be modeled using the actual (let's say it "strict") definition of Poisson distribution. But there are techniques (approximations maybe?) that allow this kind of GLM to work for non-integers. You get the warnings though, right? You can also apply that GLM to a range of (0,1) if you want. Nothing stops you, even if intuitively seems wrong. – AntoniosK Aug 21 '15 at 15:00
If you did a non-integer poisson regression, after seeing the warnings, I assume you saw that the AIC of your model output is Inf. This is another way to spot that something is going wrong (behind the scenes) even if it has an acceptable predictive capability. – AntoniosK Aug 21 '15 at 15:12
OK. I guess I'll have to save such queries for when I have a fuller understanding of statistics. I've been getting reasonable prediction intervals though, so thank you for your comments and reply. – John Madden Aug 21 '15 at 17:08

Identity of residual distribution, and identification of correct model in multiple categorical linear regression

1 Answers1