3

I assume it is OK to mix different data transformations in the same analysis. I've had to transform some variables to squared and some to cubed in order to meet normal distribution requirements. I'm assuming that it's OK to then use the transformed variables together for regression analysis?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Confuser
  • 39
  • 1
  • 2
  • 8
    It's unusual for *any* variables in a regression analysis to need normal distributions. (The standard situation concerns a single *dependent* variable whose *residuals* are compared to a normal distribution.) Could you please tell us a little more about the nature of these variables and the intended analysis? – whuber Jul 27 '12 at 16:02

2 Answers2

10

From your question, I wonder if you are referring to transforming your covariates. It is important to realize that regression models do not make any assumptions about the distribution of your covariates, but only about the distribution of the residuals (that is, not about the distribution of the response variable per se either).

Of course, it is OK to transform your covariates, and to use different transformations with different covariates. But this is done to fit a model with a curvilinear relationship between the covariate and the response variable, not to normalize the distribution of the covariates.

If you do transform some covariates, there are a couple of things to remember:

  • you need to include the untransformed covariate as well (see here for more discussion of this)
  • if you include interactions between that covariate and others, you need to include interaction terms composed of all of the corresponding covariates, e.g.: $$ \hat{y}=\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2+\hat{\beta}_3x_2^2+\hat{\beta}_4x_1x_2+\hat{\beta}_5x_1x_2^2 $$
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • does your comment about including the original covariates only apply to a polynomial (e.g. x^2 and x^3) case? You may want to include that other transformations, such as log() or sqrt() don't have this requirement. Also, there are transformation schemes such as fractional polynomials (Royston) or splines where this requirement does not hold. – B_Miner Jul 27 '12 at 16:54
  • Firstly, let me apologise for not replying earlier. I'm on vacation in Cornwall with limited internet access. – Confuser Aug 01 '12 at 08:51
  • I'm doing some ad hoc tests on customerfeedback data. There's alot of kurtosis and skew in the independent variables, most of which are more than twice the std error. I believe that I should use data transformations to help reduce the effects of skew and kurtosis before running the regression analysis. For most of the variables I have, squaring the data produces the best reduction of skew and kurtosis, but for a couple, I get the best results by using the square root. I should add that I'm half way through an MSc in psychology, so my stats knowledge is nowhere near as good as yours. – Confuser Aug 01 '12 at 09:02
  • No need to apologize. The important point here is that you **do not** need to transform IVs to address skew or kurtosis (although you *can* transform them for other reasons, such as to achieve a linear relationship, interpretability, or b/c the theory dictates it). B/c this issue of normality in regression has come up a lot recently, I added an answer to [this question](http://stats.stackexchange.com/questions/12262/what-if-residuals-are-normally-distributed-but-y-is-not); it's worth your time to read it & the answers there. – gung - Reinstate Monica Aug 01 '12 at 13:10
  • Thanks, I checked out your previous post and it makes sense now. – Confuser Aug 01 '12 at 15:33
4

In addition to the excellent points made by @gung and whuber, consider what the model will mean once variables are transformed. It would be nice if you could tell us the context of the problem but.... Suppose it is, in fact, the case that the residuals are not normally distributed with untransformed variables, but OK with various different transformations of the variables. Will the result be sensible in your field?

OLS regression assumes normality of the error. Sometimes, it is better to use a different model, rather than force the data to fit the OLS model. There are a lot of alternatives. Again, if you tell us what you are trying to do, we may be able to suggest some.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276