Impact of regression normality-assumption on model comparison & prediction?

Question

This question is a continuation of the discussion here:

How to test the statistical significance for categorical variable in linear regression?

Following Macro's suggestion, I started a new thread.

The new question is not limited to only study the inclusion/exclusion of categorical variable.

It's about general model comparison and prediction.

I found that my data is highly non-normal. The QQ plot is as follows: the curve is all below the straight 45 degree line. The curve is tangent to that straight line. And the curve looks like the curve of f(x)=-x^2 ( shape-wise).

It's not the entire set of points that's under the 45 degree line. It's the curve with the shape of f(x)=-x^2 is "tangent" to the 45 degree line. By "tangent" I should have meant that those points around the "tangent" point are actually above the 45 degree line, very slightly though. Therefore, visually speaking, most of the data (~98%) are below the 45 degree line...

These are the residual QQ plot coming out from "plot(lmModel)"...

Using qqnorm(lmModel$res); qqline(lmModel$res) I got exactly the same curves and lines.

My questions are:

If my end-goal is to use yhat to do prediction onto a wide data-set, does the non-normality of data matter?
For Model Comparison, the approaches pointed out by Macro probably won't work; any other alternative approaches that don't assume Gaussian distribution?
What shall I do to fix the non-normality problem? (data-size: 10 variables, 1700 observations).

Thank you!

Of general interest are two threads: http://stats.stackexchange.com/questions/2492/normality-testing-essentially-useless and http://stats.stackexchange.com/questions/29731/regression-when-the-ols-residuals-are-not-normally-distributed which may be useful - these are mostly related to (1). There are various threads on transformation so you may want to do a search to find something specific, which would mostly address (3). — Macro, Jul 05 '12 at 20:00
Thank you Macro. However the two threads you posted above don't address the concern about prediction. Am I missing anything there? Thank you again! — Luna, Jul 05 '12 at 22:04

score 2 · Answer 1 · answered Jul 05 '12 at 20:51

2

Because you are using least squares the nonnormality effects the regression coefficients and can hurt prediction. You may want to try a robust regression method.
Criteria like AIC and BIC look at closeness of the fit penalized by the number of parameters used. I do not think that the normality is important in choosing between models using this type of criteria. But keep in mind that if all these models have nonnormal residuals the fact that they all use least squares may mean that they all could be improved using a more robust fitting technique.
If you apply robust regression you do not have to "fix the nonnormality problem". Finding suitable transformations for the covariates in the model might be a way to "fix the nonnormality problem." But appropriate transformations may not be apparent.

answered Jul 05 '12 at 20:51

Michael R. Chernick

39,640
28
74
143

1

Thank you! Why does the non-normality hurt the prediction? Does it ever "help" the prediction? How does a "robust regression" method help in this case? Thank you again! – Luna Jul 05 '12 at 22:06
@Luna The regression coefficients based on least squares are sensitve to outliers. So the slope of the regression is forced to fit outlying observations. This is because least squares minimizes the sum of squared errors. Robust regression use different fitting criteria that don't peanlize as much for large individual errors. For example one robust method uses the sum of the absolute value of the errors. This will not as large . For example a term with an absolute error of 2 would have a squared error of 4. – Michael R. Chernick Jul 05 '12 at 22:18
I can not conceive of a situation where a least squares fit to nonnormal data will improve prediction. – Michael R. Chernick Jul 05 '12 at 22:20
Thank you Mike. But I don't have outliers. It is just I am having highly non-normal data and I was wondering what's an optimal solution under non-normal data and how does it impact my yhat? Thanks again! – Luna Jul 05 '12 at 22:22
Nonnormality can occur because of heavy skewness or heavy kurtosis. If you have heavy kurtosis there are probably some outliers in the data. They don't necessarily show up as large residuals. – Michael R. Chernick Jul 05 '12 at 22:33
From the QQ plot, the residual follows the curve with the shape of f(x)=-x^2, which is very nice and smooth shape... so these shouldn't be called as outliers... Hi Mike, could you please show some math as to why does the robust regression help here? And >95% of the data are below the 45 degree straight line, are you suggesting that >95% of my data are outliers? – Luna Jul 05 '12 at 23:21
@Luna The QQ plot of the residuals will not reveal the outliers If the plot curves away from the 45 degree line it is either indicating heavy or short tails dependending on teh direction of the curvature. Scatter plots of the data in the x y plane where x is a covariate and y is the response would be more revealing and outliers may show up that way. I have no idea whether or not there are any outliers in your data. Regarding the mathematics it is difficult to show without actually looking at the two fits to see how robust regression helps. – Michael R. Chernick Jul 05 '12 at 23:57
Take a simple example yi=a + b xi + ei for i=1,2,...,n where ei is the random error components. Ordinary least squares which is optimal under normality minimizes ∑wi where wi=(yi-(a+bxi))$^2$. On the other hand MAD regression (a robust alternative minimize ∑vi where vi=|yi-(a+bxi)|. – Michael R. Chernick Jul 06 '12 at 00:08

score 1 · Answer 2 · answered Aug 05 '12 at 00:35

The residual plot that you describe sounds like a right skewed distribution. One possibility is to fit a regression model that assumes a right skewed distribution rather than a normal distribution. The glm function can be used to fit a gamma distribution (which is right skewed).

Another approach is to transform the data, a log tranform on the y-variable or other Box-Cox transforms can help with skewness.

The biggest problem with skewed data and regression is that the usual tests are based on normality, so you can fit the regression model using regular least squares or robust methods, then instead of the normal based tests use permutation or bootstrap tests that do not depend on normality (but make sure you understand what assumptions you are making).

For any of these make sure that they make sense with the science and the questions that you are asking.

Impact of regression normality-assumption on model comparison & prediction?

2 Answers2