Why important features does not correlated with target variable?

Question

I'm testing if there is a relation between top important features and correlation between those features and target.

I'm working on the titanic dataset.

I plot the feature importance (using xgboost):

I checked if there is a relation (correlation) between the top 2 important features (Fare, Age) and target (Survived).
Moreover I checked the least important feature (sex) and target (Survived).
I used 3 different types of correlation methods.

Results:

Type: pearson, fare cor: 0.2573065223849625
Type: pearson, Age cor: -0.06980851528714314
Type: pearson, Sex cor: -0.5433513806577555

Type: spearman, fare cor: 0.32373613944480834
Type: spearman, Age cor: -0.03910946205127973
Type: spearman, Sex cor: -0.5433513806577551
   
Type: kendall, fare cor: 0.2662286416742869
Type: kendall, Age cor: -0.03268974393136027
Type: kendall, Sex cor: -0.5433513806577552

As the data shows, it seems that there is no relation at all between important or less important features and the target.

Am I right ?
If so, when it will be good idea to use correlation ? (Because we can see in this example that correlated or uncorrelated features doesn't affect the target results)

although you are talking about xgboost, this question is answered in the context of linear regression here https://stats.stackexchange.com/questions/33888/x-and-y-are-not-correlated-but-x-is-significant-predictor-of-y-in-multiple-regr/34016#34016 https://stats.stackexchange.com/questions/28474/how-can-adding-a-2nd-iv-make-the-1st-iv-significant? https://stats.stackexchange.com/questions/73869/suppression-effect-in-regression-definition-and-visual-explanation-depiction — rep_ho, Apr 12 '21 at 13:58

score 3 · Answer 1 · edited Apr 12 '21 at 14:25

Not necessarily. Correlation measures the strength of a linear relationship. Age appears to have a weak correlation but, the relationship between age and the outcome may not be linear. See the wikipedia entry for correlation for some examples in which x and y are related but the correlation is 0.
I'm not a big fan of correlation. Feature importance via correlation seems to miss a lot of important variables. I demonstrate this in one of my blog posts. Correlation feature selection (which would be akin to what you're doing here) fails to result in superior performance over other methods across 2 real datasets and 1 simulated dataset. I have little confidence in its ability to successfully pick out good predictors (unless those predictors are linearly related to the outcome and not confounded by any other variables).

kendall correlation dosn't assume linear relationship,does it ? so I assumed (wrongly) that kendall will catch relation between those features and target. — Boom, Apr 13 '21 at 03:35

Ben Reiniger · Answer 2 · 2021-04-12T15:36:00.907

Tree models' measures of feature importance have been called into question in general.

But also, xgboost's python implementation get_score defaults to "weight", which measures the number of splits a feature makes. This obviously hurts small-cardinality features like sex (which should be highly predictive in titanic): it can only be used to split once per tree; even if it is used for the first split of every tree, if your trees are deep enough its weight-importance will be low.

Why important features does not correlated with target variable?

2 Answers2