1

I need to run hundreds of linear regression models, with the same set of independent variables, but with varying dependent variables. I have checked normality for a few dozens. Some are normally distributed and some are not.

My intention, for practical reasons, is to write a macro that will run this automatically and store the P-Values of the last model (I will use stepwise or similar methods), and the association between the predicting variables and the predicted variables. My question is, since I can't use linear regression for all models, can I simply use robust regression for all models, without checking for normality? Maybe loess regression?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user3275222
  • 675
  • 8
  • 18
  • 2
    Likely none are actually normally distributed. Why do you need marginal normality of the dvs? – Glen_b Jul 10 '16 at 09:30
  • isn't it one of the assumptions of linear regression? I mean, the errors should be normal, and if they are, so will the dependent variable. if it isn't, one uses transformations. practically I can't do it. – user3275222 Jul 10 '16 at 09:36
  • 2
    1. approximate normality of the error term would be needed for inference that relies on normality (not all inference has to, but the usual inference does), though even then, in large samples the issue is more with power than with the level (when testing). 2. normality of the error term does not in general imply marginal normality of the response. Consider a single X variable that only takes two values; then the marginal distribution of Y will be a location-mixture of two normals, – Glen_b Jul 10 '16 at 09:40
  • So how can I know? And if I need to run regression on hundreds of variables, automatically, which model will fit most cases? – user3275222 Jul 10 '16 at 09:43
  • You can't really assess normality without fitting the model. Do you really need it? What are you using the models to do? What's the point of the exercise? – Glen_b Jul 10 '16 at 09:45
  • exploratory analysis. I have a set of variable X1 to X30, and a set of variables Y1-Y60, and I wish to find associations between the X variables to each of the Y variables. I was asked to use stepwise (or other method) for each Y, using all X's (the numbers: 30 and 60, are just to illustrate, I have more). – user3275222 Jul 10 '16 at 09:50
  • How big is your sample size? If it's exploratory, is it necessary that you inference retain any specific properties? – Glen_b Jul 10 '16 at 10:23
  • I have 400 subjects in all 3 groups together. I want to find association between the X's and the Y's, for all significant variables. Clearly I will need an automatic method like stepwise or other subset selection. – user3275222 Jul 10 '16 at 10:26
  • 3
    You may like to search out our posts on stepwise selection. Quite a few of them will discuss the many problems with such an approach. (For example, your subsequent inference will likely be much more badly affected than you would often have from a bit of non-normality). I'd worry much more about the consequences of that. – Glen_b Jul 10 '16 at 10:31
  • OK. What do you suggest instead? – user3275222 Jul 10 '16 at 10:34
  • 2
    I don't really understand your actual needs sufficiently to suggest anything (or I'd already have done so). I'm not sure "exploratory analysis" makes sense in this context. – Glen_b Jul 10 '16 at 10:39

1 Answers1

5

There is a lot of misunderstandings here, mostly posted out in comments. So I will make a summary here.

  1. You should not use stepwise methods in any form, they lead to invalid inferences. Many question on this site about that, here is a good one: Algorithms for automatic model selection which have good answers explaining why it is a bad idea.
  2. If you have many variables and need some model reduction, consider lasso or ridge regression instead. Look at Ridge, lasso and elastic net
  3. Linear regression do not assume that the response variable have a normal (or any other) distribution. It is the error term that should be normal (if you want to use the usual normal-based inference), and that can be checked by plotting the distribution of the residuals, not the response. See Why do we use residuals to test the assumptions on errors in regression? and Does the assumption of Normal errors imply that Y is also Normal?
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467