Multiple linear regression while using only the explanatory variables with largest correlation (by magnitude)

Question

I'm looking at the analysis for the red wine dataset on https://datauab.github.io/red_wine_quality/. This is a Jupyter note book, and if you go down to about In [9]:, it states "From all these features, we are going to select the ones with bigger numbers since these are the ones that will give us more information. To do so we are going to stablish a minimum threshold of correlation approximately around 0.2 (absolut value) since we do not have to take into account features whose values might be redundant and not provide information at all."

I don't understand any theoretical justification for this. The correlation between variables only tells us how strong the linear relationship is between the variables in a simple linear regression test (single explanatory variable), but theoretically, just because an explanatory variable might have zero to no correlation with the response in simple linear regression doesn't mean it'll have the same poor predictive power in a multiple linear regression. Is this understanding correct? If so, is this user's approach not a very good one? Is what this user doing what is typically done in practice?

I know methods like forward selection use a greedy approach, where it includes variables one by one, starting with the largest correlated variable. So I guess what this user did doesn't seem incorrect from that perspective.

score 1 · Accepted Answer · answered Apr 08 '21 at 17:48

I agree with you. Their approach seems bad. You are right that what they are doing is similar to forward stepwise approaches, but that's precisely why it's not great. Stepwise methods are deeply problematic and can cause all sorts of bogus results.

See this discussion for more: Algorithms for automatic model selection

Having said that, let me just flag a few more specific problems with this particular approach.

First, a variable might be an important contributor to perceived quality even if it has a low bivariate correlation if it’s part of an interaction. For example I know nothing about wine chemistry but let’s say that sulphates make high alcohol wine taste better but make low alcohol wine taste worse. In a bivariate comparison with quality these two effects might cancel out making it look like sulphates don’t matter. It’s only when you include an interaction between sulphates and alcohol that you see what’s really going on.

Second, selecting variables based on bivariate correlations raises the possibility of multicollinearity. For example, it seems to me like the “ph” variable might be collinear with the various acidity variables, since it doesn’t seem to be physically possible to increase the citric acid in the wine without also lowering the ph. If that’s true then including both “ph” and “citric acid” in the model might give you wonky results - neither coefficient might be significant even though they are both independently important.

Bivariate correlations might be helpful in deciding if a variable is a good candidate to include but you can’t just use them blindly. My view is that model specification always requires some subjective decisions, informed by theory and subject area knowledge. People always want there to be a magic formula that tells them how to specify a model, but there usually isn’t.

For more on this issue and the problems with stepwise methods see this article:

https://journals.lww.com/psychosomaticmedicine/Fulltext/2004/05000/What_You_See_May_Not_Be_What_You_Get__A_Brief.21.aspx

Regarding your second point, "selecting variables based on bivariate correlations raises the possibility of multicollinearity." Multicollinearity, as I understand, is when an explanatory variable may be written as a linear combination of the others. So I don't quite understand what you mean here because they're choosing features based on notable correlations with the response, and not amongst the explanatory variables? — student010101, Apr 08 '21 at 18:00
What I mean is that choosing variables based ONLY on bivariate correlation with the dependent variable raises the possibility of choosing two independent variables that are colinear with each other. As an extreme example imagine we looked at both temp in Fahrenheit and temp in Celsius and found that both had high correlation with the dependent variable at the bivariate level, so we include both in the model. In this situation the error would be easy to note since the correlation is perfect. But even if the correlation isn't perfect you can still get nonsensical results if it's too high. — Graham Wright, Apr 08 '21 at 22:16

Multiple linear regression while using only the explanatory variables with largest correlation (by magnitude)

1 Answers1