How can I use a control variable that is non-normal in linear regression with other variables that are normal?

Question

I am doing a moderation analysis. I have a predictor, a moderator and an outcome variable, all of which are normally distributed data.

I very much need to add in a control variable. Otherwise my results are pretty meaningless. However, my control variable is not normally distributed at all. Transforming the data and ranking the data do not help. How can I account for this? Do I actually need the control variable to be normally distributed?

score 10 · Accepted Answer · edited Jun 01 '21 at 20:26

10

There is no need for ANY of the variables in multiple linear regression to be normally distributed

There is a very common misunderstanding that the outcome/response needs to be normal, but even this is not correct. Depending on the use that the model is to be put, then it might be desirable for the conditional distribution of the outcome/response to be normal - which means the residuals, not the variable itself.

But when it comes down to explanatory variables - be they moderators, confounders, main exposures, competing exposures etc., there is definitely no requirement for them to follow any kind of distribution.

edited Jun 01 '21 at 20:26

Scortchi - Reinstate Monica

27,560
8
81
248

answered May 31 '21 at 09:11

Robert Long

53,316
10
84
148

Thank you very much!!! Do I need to report the results differently? – Harry Huntington May 31 '21 at 09:16
You're welcome. "differently" than what ? – Robert Long May 31 '21 at 09:21
This. A simple example is that a requirement for normally distributed predictors would exclude the use of (0, 1) indicator variables as predictors, but it's excellent technique. (Some say "dummy variables".) – Nick Cox May 31 '21 at 09:56
@Robert Long. By differently I mean, do I need to report that some of the data are distributed differently from others or is it not relevant? My superviser, before I asked te above question, vaguely mentioned that I would need to run and report tests for normality but your answer makes me think that is not the case. Thanks again. – Harry Huntington May 31 '21 at 10:29
2

A lot depends on who is the audience for the work ? There's nothing *wrong* with reporting some kind of summaries of the data, and if your supervisor suggests that you should, then by all means go ahead and do so, but there's really no need for it, and even less need to test normality. There are some situations where normality would be needed, such as a simple t-test, but in multiple regression, it is not needed. If this is for coursework/homework and your supervisor says to do it, then you probably should in order to get full marks, but keep in mind, that your supervisor is probably wrong. – Robert Long May 31 '21 at 11:37
Thank you for the reply. If you have time could you explain why it is not important, I would like to be able to give reason if I am asked. – Harry Huntington May 31 '21 at 12:35
1

You're welcome. As for Independent variables I would rather turn that around and ask why you think their distribuition is important at all ? As @NickCox mentioned, if their distribution was important how would we be able to incorporate binary variables, or any other categorical variable ? I would be surprised if you could find a textbook or lecture notes that say the distribution of IVs is important. As for the response/outcome, take a look at [my answer here](https://stats.stackexchange.com/questions/525735/how-to-model-heavily-left-skewed-data/526215#526215) (and the ones it links to) – Robert Long May 31 '21 at 12:42
Unfortunately very poor textbooks can be found all too easily. One published by Cambridge University Press explains that normal distribution is needed for a Mann-Whitney test. (It isn't.) – Nick Cox May 31 '21 at 13:24
2

@NickCox Oh dear :O Perhaps we should start a thread on here "What are some examples of textbooks that contain egregiously wrong information about assumptions and conditions for regression and related models" ? It could be a useful community wiki – Robert Long May 31 '21 at 13:35
Can I clarify that normality is not necessary for a simple regression analysis in which I have two IVs and an interaction plus two control variables? Sorry, I just want to be really certain as there is a lot of conflicting information out there. thanks – Harry Huntington May 31 '21 at 15:22
6

No problem. YES there is no necessity for any of the variables to be normally distributed, even the outcome. In some cases you would like the conditional distribution of the outcome to be normal, in which case you inspect the residuals arising from the model, not the raw variable. – Robert Long May 31 '21 at 15:24

How can I use a control variable that is non-normal in linear regression with other variables that are normal?

1 Answers1