(Basic question) normal distribution of error in predictive model

Question

When I am developing a predictive statistical model, why do I need to ensure the error is normally distributed? (I have a very small statistical background, so I apologize in advance if this is a very, very basic question).

Could you please cite where you read that this is important? I dispute the claim and want to see the context. — Dave, Sep 09 '20 at 15:39
Closely related threads, which provide many answers, include https://stats.stackexchange.com/questions/16381, https://stats.stackexchange.com/questions/148803, https://stats.stackexchange.com/questions/86835, https://stats.stackexchange.com/questions/395011, *etc.* — whuber, Sep 09 '20 at 15:46
Predictive model has somewhat different nuance than regression. Why close? — BigBendRegion, Sep 09 '20 at 16:01
I second what @BigBendRegion, but nevertheless will read closely the indicated question and the remaining questions indicated in the comments. Thank you :) — Johanna, Sep 09 '20 at 16:43
The suggested answers really are not adequate since the focus of predictive modeling is different so I will provide answers in replies. — BigBendRegion, Sep 09 '20 at 18:52
The conditional distributions of the target variable do matter a great deal for predictive modeling. In the process of checking for normality, you may have some very obvious indications of non-normality that will indicate alternative models and/or methods are needed. Examples: — BigBendRegion, Sep 09 '20 at 18:53
1. The data are very discrete. In the most extreme case, the data have only two possible values, in which case you should be using logistic regression for your predictive model. Similarly, with only a small number of ordinal values, you should use ordinal regression, and with only a small number of nominal values, you should use multinomial regression. — BigBendRegion, Sep 09 '20 at 18:53
2. The data are censored. You might realize, in the process of investigating normality, that there is an upper bound. In some cases the upper bound is not really data, just an indication that the true data value is higher. In this case, ordinary predictive models must not be used because of gross biases. Censored data models must be used instead. — BigBendRegion, Sep 09 '20 at 18:54
3. In the process of investigating normality (eg using q-q plots) it may become apparent that there are occasional extreme outlier observations (part of the process that you are studying) that will grossly affect ordinary predictive models. In such cases it would be prudent to use a predictive model that minimizes something other than squared errors, such as median regression, or (the negative of) a likelihood function that assumes heavy-tailed distributions. Similarly, you should evaluate predictive ability in such cases using something other than squared errors. — BigBendRegion, Sep 09 '20 at 18:54
4. If you do use an ordinary predictive model, you would often like to bound the prediction error in some way for any particular prediction. The usual 95\% bound $\hat Y \pm 1.96 \hat \sigma$ is valid for normal distributions (assuming that $\hat \sigma$ correctly estimates the conditional standard deviation), but not otherwise. With non-normal conditional distributions, the interval should be asymmetric and/or a different multiplier is needed. — BigBendRegion, Sep 09 '20 at 18:54
All that having been said, there is no "thou shalt check normality" commandment. You don't have to do it at all. It's just that in certain cases, you can do better by using alternative methods when the conditional distributions are grossly non-normal. — BigBendRegion, Sep 09 '20 at 18:54

score 1 · Accepted Answer · answered Sep 09 '20 at 15:39

Normal errors are much more important to inference (hypothesis testing and confidence intervals) than prediction. See my recent answer here. Depending on the inferential model, normal errors might be ridiculous e.g. logistic regression, where the output is a probability in $[0,1]$ and the truth is either $0$ or $1$ (so the errors are in $[0,1]$).

When you're making predictions, the evidence that your model is good is whether or not it makes accurate predictions on unseen data. This is the legendary "out-of-sample" test or validation data (the two aren't synonyms but are related in that the model being developed does not see those data sets during training...think of not showing students the exam questions while they study questions from your old exams).

(Basic question) normal distribution of error in predictive model

1 Answers1