1

As the title suggests, I'm curious to know why we base most of the regression theory on the assumption of normality in the response (and of the residuals)? I understand that we have GLMs where we have that the distribution of the response is a member of the exponential family. But why do we use a normal distribution from all of the other exponential families? And what is there a name for the models which use non-exponential families?

Bill
  • 556
  • 1
  • 9

1 Answers1

4

Why is most of the regression theory based on the assumption of normality?

It isn't.

Although some pedagogical approaches make that seem to be the case,* the only place where normality really comes into play in linear regression is in assigning parametric p-values and confidence limits. The Gauss-Markov theorem shows that linear regression provides best linear unbiased estimates (BLUE) whether or not responses or residuals are normally distributed. All you need is that the error terms around the linear model have zero mean, are uncorrelated with each other, and have constant variance.

Some statistical tests, for example t-tests on coefficient values, start with an assumption of normally distributed error terms. But what's really critical for those tests is that the sampling distribution of the coefficient values is close enough to normal, and you don't always need normally distributed error terms for that. See for example this brief discussion about the robustness of t-tests even when error terms aren't normal, and alternate approaches when they aren't. For further reading, see the links provided in a comment on your question, or this discussion about whether normality testing is essentially useless.


*For example, you might sense an over-emphasis on normality if you start learning about linear regression from the viewpoint of correlation coefficients, in which the standard significance tests assume bivariate normality.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • In a given example, How can one verify if the sampling distribution of a regression coefficient is close to Normal? Bootstrap? – JTH Nov 20 '20 at 13:36
  • @JTH Basically you want to know how the estimator is distributed in order to estimate the accuracy. For a linear estimator the estimate is a linear sum of the observations $y_i$. So the estimator will be a sum of the observations and it will have a related variance (variances [add up](https://en.m.wikipedia.org/wiki/Variance#Properties) *independent* from the underlying distribution). This variance of the estimator is independent from the fact whether or not the variable is normal distributed so there is no *need* to verify if the sampling distribution of a coefficient is close to normal. – Sextus Empiricus Nov 20 '20 at 13:52
  • 1
    @JTH one might argue that bootstrapping is worth doing in any event, for validating results to detect optimism and bias and for calibrating the model, while you get direct measures of sampling (co)variances of coefficient estimates. – EdM Nov 20 '20 at 14:00
  • Hmm Isn't the Gauss Markov Theorem nearly useless? The interesting estimators are not linear functions of the data, and they are not unbiased. So nonnormality is quite an important indicator that you can do much better than OLS. – BigBendRegion Nov 21 '20 at 19:43