11

I have eight independent variables and one dependent. I have run a correlation matrix, and 5 of them have a low correlation with the DV. I have then run a stepwise multiple regression to see whether any/all of the IVs can predict the DV. The regression showed that only two IVs can predict the DV (can only account for about 20% of the variance though), and SPSS removed the rest from the model. My supervisor reckons that I have not run the regression correctly, as due to the strength of the correlations, I should have found more predictors in the regression model. But the correlations were tiny, so my question is: if IVs and the DV hardly correlate, can IVs still be good predictors of the DV?

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105
Elle
  • 111
  • 1
  • 1
  • 3
  • 5
    Your title and contents show some confusion between the terms "dependent" and "independent". Please check that my edit preserves your intended meaning. The fact that people get confused about which is which strengthens the case for more evocative terminology, such as "response" or "outcome" rather than "dependent variable". Finally on abbreviations note that for many people "IV" means **instrumental variable**. – Nick Cox Mar 20 '14 at 13:11
  • 4
    Yes, it's possible. One reason is high sample size. Another reason is confounding: the main independent variable may show a low correlation with the depedent because it is counfounded by another independent variable. Once that confounder is added to the model, it can make the original independent variable change from not predictive to predictive (or predictive to not predictive, depending on the types of confounding.) Regression will fully agree with all correlation tests only when all independent variables are uncorrelated, that nearly never happens. – Penguin_Knight Mar 20 '14 at 13:14
  • 3
    Saying a step-wise regression "showed that only two IV can predict the DV" suggests you don't understand how it works. If two IVs are strongly correlated, & either predicts the DV about equally well, a stepwise procedure can remove one quite arbitrarily. What's the problem with using the full 8-IV model? – Scortchi - Reinstate Monica Mar 20 '14 at 15:01
  • Scortchi - my grasp on everything stats related is greatly lacking :( None of the IVs are strongly correlated either with each other, or with the DV. I did put all 8 IVs into the stepwise regression analysis, and it excluded all the but two of the DVs (which indicates that they are not sig predictors?). T – Elle Mar 20 '14 at 15:09
  • 3
    If tempted to use stepwise, reach for Frank Harrell, _Regression modeling strategies_ Springer, NY, 2001 as an antidote. He's active on this site and likely to shoot rockets if he hears the word "stepwise". – Nick Cox Mar 20 '14 at 15:20
  • 2
    The weaker your grasp of statistics, the less you ought to be messing about with variable selection procedures. If your goal's to examine how each IV relates to the DV after controlling for the others, that's exactly what the coefficient estimates (with their confidence intervals) from the full model are telling you. Looking at variance inflation factors alongside indicates how correlations between IVs are contributing to the uncertainty. Use a cross-validated or adjusted coefficient of determination, $R^2$, to assess the predictive capability of the whole model & to check for over-fitting. – Scortchi - Reinstate Monica Mar 20 '14 at 15:24
  • 1
    @Nick: If people use step-wise after finding the full model over-fits & then confirm that it's improved matters, I'm not inclined to "shoot rockets", but it worries me when they seem to use it for no reason at all without checking anything. – Scortchi - Reinstate Monica Mar 20 '14 at 15:34
  • 1
    @Scorchi Here as almost everywhere else I tend to agree with you. But a 8-predictor fit carefully examined usually allows judicious choice of a simpler model without invoking any of the stepwise machinery. I wouldn't delegate a choice that should be sensitive to the underlying science to a program. – Nick Cox Mar 20 '14 at 15:47
  • 1
    @Nick: I agree entirely; it's that judicious examination I'm recommending - of the full model before the stepwise procedure has mutilated it for its own inscrutable reasons, & in ignorance of the scientific or other considerations that might militate for a simpler model in this particular case. – Scortchi - Reinstate Monica Mar 25 '14 at 16:31

3 Answers3

9

With a correlation matrix, you are examining unconditional (crude) associations between your variables. With a regression model, you are examining the joint associations of your IVs with your DVs, thus looking at conditional associations (for each IV, its association with the DV conditional on the other IVs). Depending on the structure of your data, these two can yield very different, even contrary results.

miura
  • 3,364
  • 3
  • 21
  • 27
7

Coincidentally I was just looking at an example that I had created earlier to show similar concepts (actually to show one of the problems with stepwise regression). Here is R code to create and analyze a simulated dataset:

set.seed(1)
x1 <- rnorm(25)
x2 <- rnorm(25, x1)
y <- x1-x2 + rnorm(25)
pairs( cbind(y,x1,x2) )    # Relevant results of each following line appear below...
cor( cbind(y,x1,x2) )      # rx1y  =   .08      rx2y = -.26      rx1x2 = .79
summary(lm(y~x1))          # t(23) =   .39         p = .70
summary(lm(y~x2))          # t(23) = -1.28         p = .21
summary(lm(y~x1+x2))       # t(22) =  2.54, -2.88  p = .02, .01 (for x1 & x2, respectively)

The correlations and simple linear regressions show low (not statistically significant) relationships between $y$ and each of the $x$ variables. But $y$ was defined as a function of both $x$s, and the multiple regression shows both as significant predictors.

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105
Greg Snow
  • 46,563
  • 2
  • 90
  • 159
4

Your question would be easier to answer if we could see quantitative detail from your software output and ideally have a sight of the data too.

What is "low correlation", in particular? What significance level are you using? Are there built-in relationships between predictors that result in SPSS dropping some?

Note that we have no scope for judging whether you used the best or most appropriate syntax for your purpose, as you don't state exactly what you did.

In broad terms, low correlations between predictors and outcomes imply that regression may be disappointing in much the same way that you need chocolate to make chocolate cake. Give us more detail, and you should get a better answer.

Also in broad terms, the disappointment of your supervisor doesn't imply that you did the wrong thing. If your supervisor knows less statistics than you do, you need to seek advice and support from other people in your institution.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • Thank you everyone. I know this is a bit of a baby question. I have perceived stress as my DV and my IVs are Locus of Control (with 3 subscales), Social Support, Coping Self Efficacy (3 subscales) and Emotional Intelligence (these all relate to self report questionnaires) and i want to know how/whether the DVs are able to predict perceived stress. I looked at correlations between all variables, they are all mostly below .40, significance level is .001. I ran a Pearsons correlation first to see if the DVs correlate with perceived stress, then the regression to see if they can predict stress. – Elle Mar 20 '14 at 14:27
  • 1
    As @miura rightly emphasises, funny things can happen, but these results seem perfectly consistent with relatively low $R^2$. – Nick Cox Mar 20 '14 at 14:49