1

I have 1 dependent variable and 33 independent variables (continuous, categorical & dichotomous). Correlation analyses (2-tailed) show that the DV is only correlated to 7 of the IVs although most of the correlations are very weak, e.g. about 0.1 or less than 0.1.

Is it correct to put only IVs that are correlated with the DV into the regression model?

P.S. What's the use of the correlation matrix (1-tailed) produced with the regression analysis?

  • What kind of $R^2$ are you getting? Dropping predictors will not improve your accuracy, so you may want to re-think your model. –  Aug 25 '15 at 04:23
  • Thanks for your reply, Bey. Sorry, what does it mean by what kind of R2? – statistics newbie Aug 25 '15 at 04:33
  • How much explained variance are you getting with your model? If you plot $Y_{pred}$ vs $Y_{actual}$, what is the coefficient of determination? –  Aug 25 '15 at 04:35
  • I tried with the stepwise method. With the 7 IVs inputted, the best R Square I got is 0.340 and only 3 IVs remained in the model. – statistics newbie Aug 25 '15 at 06:42
  • Don't forget that regression works based on the combination of your predictor/independent variables. So, think about a) if there are strong correlations between your predictor variables, b) whether a predictor variable not so strongly correlated to your dependent variable (individually) might be important when other variables are present. – AntoniosK Aug 25 '15 at 10:47
  • AntoniosK, thanks for the reminders. a) There seems to be strong correlations between some of the predictors, e.g. whether someone has taken Biology and whether someone has taken Chemistry. However, it seems not appropriate to just drop either Biology or Chemistry. What should I do? b) You're right that a predictor might be important when other variables are present. So, should I simply put all 33 predictors in the regression model since there's very little literature on which might be the predictors? That is, this is more exploratory in nature. – statistics newbie Aug 25 '15 at 13:26
  • What correlation matrix produced with the regression analysis are you talking about? – Richard Hardy Sep 04 '15 at 18:13
  • @statisticsnewbie, Bio and Chem is a good example. For instance, if you're studying students of phililogy, the fact that someone took both Bio and Chem may say something about the student, because he could have got away with general requirements by taking only Bio. Maybe this guy's interested in natural sciences. On the other hand if you're studying med students, you may not need both or either because they're probably required to take them. If you look at fine art students, they're probably required to take some Bio but no Chem, so the ones taking it could be interesting subjects etc. – Aksakal Sep 04 '15 at 18:41
  • Richard, the correlation matrix I'm talking about is the one produced by SPSS when selecting the "Descriptives" options under "Statistics" of Linear Regression. – statistics newbie Sep 06 '15 at 11:02
  • Aksakal, in my case, taking science subjects are optional for the students, so someone took both Bio and Chem, or just Bio or Chem, do say something about the student. So, does that mean even though taking Bio and taking Chem are highly correlated, I should still include both variables in the regression model? – statistics newbie Sep 06 '15 at 11:11

2 Answers2

2

Correlating the dependent variable with each of the potential regressors will not generally reveal all the useful information. It might very well be that a linear combination of a subset of regressors will be highly correlated with the dependent variable while each of the regressors in the subset will be only weakly correlated with the dependent variable. In other words, a regression with multiple regressors cannot be substituted with multiple pairwise regressions.
To answer your question, it is generally not wise to put only the IVs that are correlated with the DV into the regression.

For variable inclusion/exclusion strategies, you may check the posts under the feature-selection tag.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
1

Generally, the answer is no, because the presence or absence of a third variable in the model may change the relationships between the independent and dependent variables. There are known phenomena like mediation and confounding, for instance. There are also sampling biases, which may mess up the correlations, hide relationships etc.

Having said this, the bi-variate correlations are informative. You should take them into account. It's just you shouldn't base your variable selection solely on bi-variate correlations.

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • My goal is to find out whether science background can predict money spent on organic food. Since I think taking Physics and taking Biology are quite different, I don't want to use "taking science" as a whole as one variable. Instead I use three different variables: "taking Biology", "taking Chemistry" and "taking Physics". I also thought about including the variables "taking science" and "no. of science subjects taken". – statistics newbie Sep 06 '15 at 11:33
  • However, as you can imagine, these variables are usually highly correlated, and I don't know if it's appropriate to put them all in the regression model. Moreover, for example, "taking physics" is only weakly correlated (< .1) with $ spent on organic food. So, is it still appropriate to include it in the model? I got stuck here and I'm really confused. – statistics newbie Sep 06 '15 at 11:33
  • I'm using the stepwise method. – statistics newbie Sep 06 '15 at 11:53
  • What's the purpose of the analysis? Multicollinearity is not an issue in forecasting (generally). So, if you're forecasting then you can throw in all variables and not worry too much about correlations. – Aksakal Sep 06 '15 at 21:15