2

Consider the plot attached as the histogram of the outcome ($Y$) that is going to be the outcome in a linear regression. Clearly, the histogram shows the outcome is not normally distributed. How can I come up with a transformation that makes the data be normal so that I can fit a linear regression?

My goal is to compare the effect of TRT (treatment) vs. CTRL (control). One obvious regression is:

$Y = \mathrm{TRT} + \text{other covariates}$

Since $Y$ is not normal, do you think I can assess the TRT effect by considering $Y$ as a coefficient and TRT as outcome and fit a logistic regression?

enter image description here

COOLSerdash
  • 25,317
  • 8
  • 73
  • 123
user48405
  • 139
  • 5
  • 5
    $Y$ (the outcome) does *not* have to be normally distributed to fit a linear regression. The *residuals* of the regression are assumed to be normally distributed (see [here](http://stats.stackexchange.com/questions/12262/what-if-residuals-are-normally-distributed-but-y-is-not), for example). Could you explain what the outcome is? – COOLSerdash Sep 29 '13 at 22:22
  • 1
    Using TRT as the DV makes no sense. Treatment isn't a dependent variable. – Peter Flom Sep 29 '13 at 22:37
  • The Linear Regression Model does _NOT_ need any kind of distributional assumptions. If a normality assumption is made, then we have the Normal Linear Regression Model, which is a special case of the former. – Alecos Papadopoulos Sep 29 '13 at 22:45
  • 1
    @COOLSerdash Actually, it's the random errors that are assumed normal (and then only when using normality to produce inferences such as hypothesis tests or intervals). The residuals should approximate the errors, though, so they are useful for seeing if the assumptions that were made are at least reasonable. – Glen_b Sep 29 '13 at 23:25
  • user48405 There's no assumption of unconditional normality for Y (that is, the raw observations don't necessarily tell you anything), even for a normal theory regression. If you don't have normality, you can still fit a regression (though if you're using least squares, a very strong deviation from normality may lead to relatively poor estimates); it doesn't look like you could be in a situation where that's a big concern. If you want to perform a test or compute an interval you may want to assume normality then, but other alternatives exist if the assumptions are untenable... (ctd) – Glen_b Sep 29 '13 at 23:32
  • (ctd)... further, aside from *prediction* intervals, the normal theory inference from least squares regression isn't all that sensitive to moderate non-normality if the sample sizes are reasonably large; it's more important to worry about the other assumptions (linearity, equality of variance, independence). – Glen_b Sep 29 '13 at 23:34
  • @Glen_b Thanks for this clarification. Just out of curiosity: What would be an example where the residuals are not a good approximation of the random errors? – COOLSerdash Sep 29 '13 at 23:37
  • 1
    @COOLSerdash Sorry, correction: If we take the other assumptions as satisfied, then when $I-H$ is too far from $I$, $e$ will generally be a poor approximation for $ε$; we can adjust for the differences in variance, but they're still dependent and in some cases, highly so – Glen_b Sep 30 '13 at 00:04
  • OP, See the discussion [here](http://stats.stackexchange.com/questions/12262/what-if-residuals-are-normally-distributed-but-y-is-not) – Glen_b Sep 30 '13 at 00:23

1 Answers1

1

COOLSerdash is right that only the residuals need to be normally distributed. However, When the $Y$ is skewed, the residuals will often be skewed in the same direction. So look at the residual plot before doing any transformations. For right skewed distributions like this one, you can use a $\log_{10}(x), 1/x$, or $\sqrt{x}$ transformation.

You probably don't want to use logistic regression, especially if you have covariates as you will not be able to use them.

COOLSerdash
  • 25,317
  • 8
  • 73
  • 123
Hotaka
  • 1,114
  • 7
  • 13
  • 2
    No, you can't use log or root, since some y are negative. And if any y are 0 (unclear from histogram) then you can't use 1/x either. (At least, if the histogram is of y, as is implied). – Peter Flom Sep 29 '13 at 22:34
  • Oh, well you can always add a constant number to the variable to make the values all above 0 – Hotaka Sep 29 '13 at 22:38
  • 1
    That would make for a very hard to interpret model. The DV would then be, e.g. log10(y + 10). – Peter Flom Sep 29 '13 at 22:49
  • 2
    If one would insist of taking the logarithm, despite negative values, then a [GLM with a log-link](http://blog.stata.com/2011/08/22/) would be an alternative. But without seeing the regression diagnostics or knowing more about the outcome variable this is all speculation at this point. More info from the OP is definitely needed in order to give more complete answers. – COOLSerdash Sep 29 '13 at 22:53
  • OP is just concerned with fitting the model. Wouldn't interpretation be difficult with any transformation in this case? I almost always prefer to find some way around transforming the Y in mult reg unless I am working with a latent variable – Hotaka Sep 29 '13 at 23:13