When I am developing a predictive statistical model, why do I need to ensure the error is normally distributed? (I have a very small statistical background, so I apologize in advance if this is a very, very basic question).

- 571
- 2
- 17
-
Could you please cite where you read that this is important? I dispute the claim and want to see the context. – Dave Sep 09 '20 at 15:39
-
1Closely related threads, which provide many answers, include https://stats.stackexchange.com/questions/16381, https://stats.stackexchange.com/questions/148803, https://stats.stackexchange.com/questions/86835, https://stats.stackexchange.com/questions/395011, *etc.* – whuber Sep 09 '20 at 15:46
-
1Predictive model has somewhat different nuance than regression. Why close? – BigBendRegion Sep 09 '20 at 16:01
-
I second what @BigBendRegion, but nevertheless will read closely the indicated question and the remaining questions indicated in the comments. Thank you :) – Johanna Sep 09 '20 at 16:43
-
1The suggested answers really are not adequate since the focus of predictive modeling is different so I will provide answers in replies. – BigBendRegion Sep 09 '20 at 18:52
-
1The conditional distributions of the target variable do matter a great deal for predictive modeling. In the process of checking for normality, you may have some very obvious indications of non-normality that will indicate alternative models and/or methods are needed. Examples: – BigBendRegion Sep 09 '20 at 18:53
-
11. The data are very discrete. In the most extreme case, the data have only two possible values, in which case you should be using logistic regression for your predictive model. Similarly, with only a small number of ordinal values, you should use ordinal regression, and with only a small number of nominal values, you should use multinomial regression. – BigBendRegion Sep 09 '20 at 18:53
-
12. The data are censored. You might realize, in the process of investigating normality, that there is an upper bound. In some cases the upper bound is not really data, just an indication that the true data value is higher. In this case, ordinary predictive models must not be used because of gross biases. Censored data models must be used instead. – BigBendRegion Sep 09 '20 at 18:54
-
13. In the process of investigating normality (eg using q-q plots) it may become apparent that there are occasional extreme outlier observations (part of the process that you are studying) that will grossly affect ordinary predictive models. In such cases it would be prudent to use a predictive model that minimizes something other than squared errors, such as median regression, or (the negative of) a likelihood function that assumes heavy-tailed distributions. Similarly, you should evaluate predictive ability in such cases using something other than squared errors. – BigBendRegion Sep 09 '20 at 18:54
-
14. If you do use an ordinary predictive model, you would often like to bound the prediction error in some way for any particular prediction. The usual 95\% bound $\hat Y \pm 1.96 \hat \sigma$ is valid for normal distributions (assuming that $\hat \sigma$ correctly estimates the conditional standard deviation), but not otherwise. With non-normal conditional distributions, the interval should be asymmetric and/or a different multiplier is needed. – BigBendRegion Sep 09 '20 at 18:54
-
1All that having been said, there is no "thou shalt check normality" commandment. You don't have to do it at all. It's just that in certain cases, you can do better by using alternative methods when the conditional distributions are grossly non-normal. – BigBendRegion Sep 09 '20 at 18:54
-
Thank you for your answer @BigBendRegion ! – Johanna Sep 09 '20 at 20:47
1 Answers
Normal errors are much more important to inference (hypothesis testing and confidence intervals) than prediction. See my recent answer here. Depending on the inferential model, normal errors might be ridiculous e.g. logistic regression, where the output is a probability in $[0,1]$ and the truth is either $0$ or $1$ (so the errors are in $[0,1]$).
When you're making predictions, the evidence that your model is good is whether or not it makes accurate predictions on unseen data. This is the legendary "out-of-sample" test or validation data (the two aren't synonyms but are related in that the model being developed does not see those data sets during training...think of not showing students the exam questions while they study questions from your old exams).

- 28,473
- 4
- 52
- 104