0

Let us consider two column vectors of random variables ${\bf Y} = (Y_1,\ldots,Y_n)^{\intercal}$ and ${\bf X} = (X_1,\ldots,X_m)^{\intercal}$.

The general linear regression model is written as \begin{equation} Y_i = \alpha_i + \sum_{j=1}^m A_{ij}X_j + \epsilon_i \qquad i = 1,\ldots,n~, \end{equation} where $A$ is a $n\times m$ matrix of the regression coefficients and $\epsilon_i$ are the residuals of the model.

Why are these residuals required to be Gaussian $\epsilon_i \sim \mathcal{N}(0,\sigma^2_i)$? Some sources ask explicitly for this hypothesis, other no. I read other answers saying that is needed of inference, could someone explain me what does it means?

AbateFaria
  • 141
  • 9
  • 1
    See https://stats.stackexchange.com/questions/16381/what-is-a-complete-list-of-the-usual-assumptions-for-linear-regression – Tim Dec 26 '21 at 11:52
  • 1
    A simple answer is that you need this assumption to apply the usual prediction intervals. With non-normal distributions, the intervals should be asymmetric and/or use different critical values. Another answer is that the usual OLS estimators are less accurate than other estimators when the distributions are non-normal. – BigBendRegion Dec 26 '21 at 12:20
  • @BigBendRegion are you referring to the confidence intervals of the regression coefficients? – AbateFaria Dec 26 '21 at 16:50
  • No, the prediction intervals for an individual value; eg, what will be the return on Microsoft stock on a day when the Nasdaq is up 1% – BigBendRegion Dec 26 '21 at 19:40

1 Answers1

2

Let's look at a simple example where you have 1 dependent variable and 1 independent variable. Lets say your independent variable was "male age", and your dependent variable is "height". So your model wants to model how tall a male is given their age.

Essentially, what you are trying to do with a linear regression model is plot the mean of your dependent variable for all the values of the independent variable. So here, for each age, we want to plot the average height. Now when you go and collect a sample of this data, so say you measure a group of 100 thirty year old males. They aren't all going to be the same height, their heights will have a range between say 5'5" and 6'5" in your sample, but most of the group will be an average of 5'10" or something. What you will see is the heights in your sample are normally distributed around the average height for a 30 year old. The errors are the differences between the average height and the sampled heights, so that's where the normal errors come from.

Here is a bad drawing, but hopefully it makes sense.

enter image description here

Now the reason we assume the errors are normally distributed is dependent on whether we believe the data should be normally distributed around the mean for each value of the independent variable (e.g. age 30). The point is that our assumption about the errors comes from our assumptions about what we expect the distribution of the data to look like if we could see all of the possible data. If for some reason you didn't believe the data should be normally distributed about the mean, then you would not assume that the errors are normally distributed.

The key is to take a step back and pick the tool that you think will model the data best. So for a standard linear regression with normally distributed errors, this tool is designed to model data that has a mean and the distribution of the data is normal around the mean. So we pick linear regression when we have data that fits the assumptions of a linear regression. If not, we might think about finding another tool to model the data instead.

ryan132442
  • 361
  • 4
  • My confusion stems from the fact that, let's say in your example I had used Money instead of Height, then even if the the model is linear for for a certain range of age $\text{Money} \sim c +\beta \text{Age}$ I expect that the distribution around the mean is not normal (some power law I suspect), then in this case the errors are non Gaussian. So is it possible to make a regression in such situations? – AbateFaria Dec 26 '21 at 12:30
  • So in the case of money you might expect some observations to be extremely far away from your regression line, and as age increases you would also expect the large gaps to get larger (heteroscedasticity: non-constant variance). So to use a linear regression you would need to look for a transformation that can squash the extreme values (such as a log transform) and you need to find a method that can deal with the heteroscedasticity. It might be the case that you can't find a transformation that allows you to use a linear regression, but in that case, you could find another type of model... – ryan132442 Dec 26 '21 at 12:50
  • ....designed for cases like that. The general idea is you want to figure out how the assumptions of a model you want to use are violated (if they are violated), and then search for additional tools which are designed to deal with those cases. The reason why we want to transform if possible rather than using another type of model is that linear regression is an easy to understand and interpret. It's just not always possible to use one. – ryan132442 Dec 26 '21 at 12:52