This question is a continuation of the discussion here:
How to test the statistical significance for categorical variable in linear regression?
Following Macro's suggestion, I started a new thread.
The new question is not limited to only study the inclusion/exclusion of categorical variable.
It's about general model comparison and prediction.
I found that my data is highly non-normal. The QQ plot is as follows: the curve is all below the straight 45 degree line. The curve is tangent to that straight line. And the curve looks like the curve of f(x)=-x^2 ( shape-wise).
It's not the entire set of points that's under the 45 degree line. It's the curve with the shape of f(x)=-x^2 is "tangent" to the 45 degree line. By "tangent" I should have meant that those points around the "tangent" point are actually above the 45 degree line, very slightly though. Therefore, visually speaking, most of the data (~98%) are below the 45 degree line...
These are the residual QQ plot coming out from "plot(lmModel)"...
Using qqnorm(lmModel$res); qqline(lmModel$res)
I got exactly the same curves and lines.
My questions are:
If my end-goal is to use yhat to do prediction onto a wide data-set, does the non-normality of data matter?
For Model Comparison, the approaches pointed out by Macro probably won't work; any other alternative approaches that don't assume Gaussian distribution?
What shall I do to fix the non-normality problem? (data-size: 10 variables, 1700 observations).
Thank you!