I'm looking at the analysis for the red wine dataset on https://datauab.github.io/red_wine_quality/. This is a Jupyter note book, and if you go down to about In [9]:
, it states "From all these features, we are going to select the ones with bigger numbers since these are the ones that will give us more information. To do so we are going to stablish a minimum threshold of correlation approximately around 0.2 (absolut value) since we do not have to take into account features whose values might be redundant and not provide information at all."
I don't understand any theoretical justification for this. The correlation between variables only tells us how strong the linear relationship is between the variables in a simple linear regression test (single explanatory variable), but theoretically, just because an explanatory variable might have zero to no correlation with the response in simple linear regression doesn't mean it'll have the same poor predictive power in a multiple linear regression. Is this understanding correct? If so, is this user's approach not a very good one? Is what this user doing what is typically done in practice?
I know methods like forward selection use a greedy approach, where it includes variables one by one, starting with the largest correlated variable. So I guess what this user did doesn't seem incorrect from that perspective.