So I have this data from of 26 predictors and 418 observations which I want to use to predict an outcome that is normally distributed (and continuous). The outcome variable looks like this when compared to the theoretical Gaussian distribution:
But for some reason the GLM model I’ve created for it does not predict it very well. I used Gaussian family with identity link and reduced the amount of predictors to only the most significant ones (p < 0.05).
Some cross validation results:
It seems that the distribution of the prediction is too ”thin” – it doesn’t predict anything outside the interval of [3.5, 6.5]. What am I doing wrong? First I thought that Gaussian family was not the correct one for this model but the results on the first picture speak for it. It looks like the model squeezes the prediction distribution because it’s form is very similar to the actual one.
EDIT 2016/10/20: After some research and tons of google searches I came across this question and it's answer: Assumptions of generalised linear model. It helped me a bunch in my understanding of this subject. Apparently the distribution of the dependent variable is not that important to analyse when GLM is in question, but rather the homoscedasticity of variables and the normality of residuals.