0

I am trying to ascertain which independent variables matter the most as they pertain to the dependent variable.

The two methods I have used are giving slightly different answers. I have tried two: correlation matrix and scaled coefficients in a regression.

Needing clarification:

The correlation matrix and regression coefficients are giving me slightly different X-variables that matter the most. Which method would you use? For example:

the correlation matrix shows that crime has a strong negative correlation to income (Y), public transit has a strong positive correlation to income (Y), education has a strong positive correlation to income (Y), and population has a strong positive correlation to income (Y).

the scaled coefficients from the regression show that access to public transit has a strong positive relationship to income (Y), education has a strong positive relationship to income (Y), and access to tutors has a strong positive relationship to income (Y).

correlation with R corrr::correlate(data):

  • crime
  • transit
  • education
  • population

scaled coefficients in regression with R (lm(scale(y)) ~ scale(x1) + scale(x2) + scale(x3)...:

  • transit
  • education
  • tutors

Which would you use? And why? I believe the regression because it specifies an actual relationship. Or would you do something else?

And, I thought that collinearity/high correlation was a problem in regressions which makes these two methods seem at odds with each other. Thank you in advance for any clarification/guidance.

HITHERE
  • 3
  • 2
  • 3
    It is not an either-or choice. Your last paragraph, expanded, is key. The regression coefficients, scaled for units of measurement, are contingent on what other predictors are in the model (and everything else about a model too, especially if the specification is a poor choice). The correlations absolutely aren't. In any case "variable importance" is not a clear-cut thing, although researchers (me too) would really like to know. Contradictory results comparing correlations and regression coefficients are par for the course. Indeed the art is to choose predictors so that signs make sense. – Nick Cox Dec 21 '21 at 15:16
  • 2
    There is no free lunch here. Anything can be misleading or irrelevant. For example, correlations and regressions can be enigmatic because the (conditional) relationship is nonlinear. – Nick Cox Dec 21 '21 at 15:35
  • Understanding this question requires us to appreciate what you mean by "pertain:" could you explain? – whuber Dec 21 '21 at 21:21
  • 2
    See https://stats.stackexchange.com/search?q=regression+strength for the same question as asked in various ways. Due to the ambiguity of your current formulation, I have identified a well-regarded thread that addresses one plausible interpretation of your question. If that doesn't work for you, please indicate in an edit to this post how your interest differs from the duplicate. – whuber Dec 21 '21 at 21:24

2 Answers2

2

With a correlation matrix, you are looking only at the pairwise relationships. Regression analysis tells you about the relationships when considering also the other variables. To give a completely made-up example: say that you're interested in predicting the probability of committing a crime. Having a tattoo, in your dataset, strongly correlates with a crime. However, if you consider in regression analysis both having a tattoo and criminal history, the regression parameter for a tattoo may go to zero as it was only a proxy for loosely identifying gang members. Using pairwise correlations alone should not be a reason for including or excluding features from analysis.

Tim
  • 108,699
  • 20
  • 212
  • 390
1

What you're describing is called Feature Importance. If you search with this keyword, I'm sure you will find a lot of interesting materials, and quite probably someone pointing to Shapley values as a good way to identify good predictors. Other techniques include analyses with permutation, trying to identify the Markov blanket, and so on.

The issue with regression coefficients and correlation or variance-covariance matrixes is that you don't really have the unique contribution of each predictor. You're probably aware of this since you mentioned multicollinearity. Is it a problem in regression? It depends if your goal is prediction or inference. If it's prediction, unless it's almost perfect colinearity, it's fine. If it's inference, then it can be an issue, as you can see in answers to many other questions already asked.

mribeirodantas
  • 796
  • 3
  • 17
  • Thank you for your reply. I'll google feature importance and Shapey values. My goal is not to strictly predict (income will be 45000) but to see which variables are most impactful. So, if I were to look into other towns, I could see which have (for example) high educational attainment and access to public transit and plot this with income. Just as a pointer. – HITHERE Dec 21 '21 at 21:08
  • 1
    As commenters and the respondents are trying to explain, the sense of "impactful" in multiple regression is rarely well-defined. Even then, it can have several meanings. – whuber Dec 21 '21 at 21:23
  • 1
    I'm not saying your goal is to predict something. Instead of prediction, your goal could be inference, for example. You can do great predictions with some variables that actually are not even causally related to your target variable, so I completely understand your desire to identify which independent variables are more relevant. It's a very broad question, though. I tried to answer you without being too broad, but it's indeed broad. – mribeirodantas Dec 21 '21 at 21:29
  • 1
    Being "impactful" isn't easier to assess than being important. If anything, it implies what should make most difference if you changed it, which outside of an experiment or intervention, is even harder to answer that what helps most in prediction. – Nick Cox Dec 22 '21 at 14:24