4

I'm testing if there is a relation between top important features and correlation between those features and target.

I'm working on the titanic dataset.

I plot the feature importance (using xgboost): enter image description here

  • I checked if there is a relation (correlation) between the top 2 important features (Fare, Age) and target (Survived).
  • Moreover I checked the least important feature (sex) and target (Survived).
  • I used 3 different types of correlation methods.

Results:

Type: pearson, fare cor: 0.2573065223849625
Type: pearson, Age cor: -0.06980851528714314
Type: pearson, Sex cor: -0.5433513806577555

Type: spearman, fare cor: 0.32373613944480834
Type: spearman, Age cor: -0.03910946205127973
Type: spearman, Sex cor: -0.5433513806577551
   
Type: kendall, fare cor: 0.2662286416742869
Type: kendall, Age cor: -0.03268974393136027
Type: kendall, Sex cor: -0.5433513806577552

As the data shows, it seems that there is no relation at all between important or less important features and the target.

  1. Am I right ?
  2. If so, when it will be good idea to use correlation ? (Because we can see in this example that correlated or uncorrelated features doesn't affect the target results)
Boom
  • 195
  • 4
  • 2
    although you are talking about xgboost, this question is answered in the context of linear regression here https://stats.stackexchange.com/questions/33888/x-and-y-are-not-correlated-but-x-is-significant-predictor-of-y-in-multiple-regr/34016#34016 https://stats.stackexchange.com/questions/28474/how-can-adding-a-2nd-iv-make-the-1st-iv-significant? https://stats.stackexchange.com/questions/73869/suppression-effect-in-regression-definition-and-visual-explanation-depiction – rep_ho Apr 12 '21 at 13:58

2 Answers2

3
  1. Not necessarily. Correlation measures the strength of a linear relationship. Age appears to have a weak correlation but, the relationship between age and the outcome may not be linear. See the wikipedia entry for correlation for some examples in which x and y are related but the correlation is 0.

  2. I'm not a big fan of correlation. Feature importance via correlation seems to miss a lot of important variables. I demonstrate this in one of my blog posts. Correlation feature selection (which would be akin to what you're doing here) fails to result in superior performance over other methods across 2 real datasets and 1 simulated dataset. I have little confidence in its ability to successfully pick out good predictors (unless those predictors are linearly related to the outcome and not confounded by any other variables).

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • 1
    kendall correlation dosn't assume linear relationship,does it ? so I assumed (wrongly) that kendall will catch relation between those features and target. – Boom Apr 13 '21 at 03:35
2

Tree models' measures of feature importance have been called into question in general.

But also, xgboost's python implementation get_score defaults to "weight", which measures the number of splits a feature makes. This obviously hurts small-cardinality features like sex (which should be highly predictive in titanic): it can only be used to split once per tree; even if it is used for the first split of every tree, if your trees are deep enough its weight-importance will be low.

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15