Linear regression response variable far way from Gaussian

Question

The book Introduction to Categorical Data Analysis, Agresti 2007, says

Historically, early analyses of nonnormal responses often attempted to transform Y so it is approximately normal, with constant variance. Then, ordinary regression methods using least squares are applicable. In practice, this is difficult to do. With the theory and methodology of GLMs, it is unnecessary to transform data so that methods for normal responses apply. This is because the GLM fitting process uses ML methods for our choice of random component, and we are not restricted to normality for that choice.

I understand, we have discrete response variable and or counting variable we can use other link functions to do logistic regression and Poisson regression. But how do we deal with linear regression response variable far way from Gaussian (still a continuous number, not but skewed and have many outliers)?

If we do not do it (transform to Gaussian) any more, does it mean all the estimation of the coefficient are good, but std error, $t$-value and $p$-value are not valid?

This might be a duplicate of http://stats.stackexchange.com/questions/74372/linear-regression-with-strongly-non-normal-response-variable — kjetil b halvorsen, Feb 11 '17 at 17:20
Possible duplicate of [Linear regression with strongly non-normal response variable](http://stats.stackexchange.com/questions/74372/linear-regression-with-strongly-non-normal-response-variable) — kjetil b halvorsen, Feb 11 '17 at 17:21
@kjetilbhalvorsen thanks for the comment, so does it mean does it mean all the estimation of the coefficient are good, but std error, t value and p value are not valid? — Haitao Du, Feb 11 '17 at 17:22
First, look at the distribution of residuals, and not $Y$ itself! Then, first preoccupy about model structure, correct variables, effects are linear? etc, normality of residuals is the last thing to bother with. Also, what Agresti refers to with quote is that nowadays we have other methods such as glm's which often are better alternatives than transform response. Better you ask a Q about your real modelling problem! — kjetil b halvorsen, Feb 11 '17 at 17:24

Ernest A · Answer 1 · 2017-02-11T21:17:06.573

-1

As @Repmat points out in a comment, the Gauss-Markov theorem states that the OLS estimators are unbiased and have the lowest possible variance among all linear estimators, even when the disturbance term is not normal. So the point estimates $\hat\beta_0,\dotsc,\hat\beta_k$ should be good, regardless of the probability distribution of Y.

On the other hand, the sampling distribution of the least squares estimators is Normal if and only if the responses $Y_1,\dotsc,Y_n$ are independent Normal random variables. This means that when the response is not Normal, the $t$-statistic is not appropriate for individual significance testing, nor can confidence intervals for the estimates be constructed in the usual way, since both rely on the assumption of normality of the estimators.

edited Feb 11 '17 at 21:17

answered Feb 11 '17 at 17:40

Ernest A

2,062
3
17
16

Inference works regardless of any assumed distribution on outcome, the conditional outcome, the residuals, and the explanatory variables. This follows directly from the Gauss Markov theorem. You don't need normality for OLS to be useful – Repmat Feb 11 '17 at 18:30
First of all, the distributional assumptions are made on model errors, not on the dependent variable; so your first sentence is wrong. Second, in large samples distributional assumptions do not matter as we have asymptotic normality. Third, what other consequences are there? The need for normality is precisely for small sample inference. – Richard Hardy Feb 11 '17 at 20:56
@RichardHardy As far as I know, normality of the errors implies normality of the response, so I don't really know what to make of your comment. – Ernest A Feb 11 '17 at 21:22
No, normality of errors does not imply normality of response. Take the $X$ matrix to consist of a single column which is a linear trend or a dummy variable; in both cases $Y$ will be nonnormal. – Richard Hardy Feb 12 '17 at 07:27
@RichardHardy In linear regression we assume $\text{E}(Y|X=\mathbf{x})=\mathbf{x}^\top\mathbf{\beta}$, and the error is defined as $\varepsilon_i = y_i - \text{E}(Y|X=\mathbf{x}_i)$, where $y_i$ is an observation of $Y$ conditioned to $X=\mathbf{x}_i$. Therefore it is clear that $\varepsilon_i$ has the same distribution as $y_i$ shifted by $\text{E}(Y|X=\mathbf{x})$. – Ernest A Feb 12 '17 at 10:01
Your last comment is correct. But it does not imply that $Y$ is normal. In fact, you can easily generate $Y$ such that $Y=\beta X+\varepsilon$ with $\varepsilon$ being normal and $Y$ being not even remote close to normal. Try the examples I mention above. The general condition for $Y$ being normal is that not only $\varepsilon$ but also $X$ is normal, which is pretty restrictive. Otherwise $Y$ is nonnormal. – Richard Hardy Feb 12 '17 at 10:27
It doesn't imply that $Y_1,\dotsc,Y_2$ are identically distributed, it does imply however that each $Y_1,\dotsc,Y_2$ is normal, which is a sufficient and necessary condition (along with independence) for the normality of the OLS estimators. I never claimed it implied that $Y$ was normal i.i.d. – Ernest A Feb 12 '17 at 11:58

Linear regression response variable far way from Gaussian

1 Answers1