Negative Feature Importance Value in CatBoost LossFunctionChange

Question

I am using CatBoost for ranking task. I am using QueryRMSE as my loss function. I notice for some features, the feature importance values are negative and I don't know how to interpret them.

It says in the documentation, the i-th feature importance is calculated as the difference between loss(model with i-th feature excluded) - loss(model).

So a negative feature importance value means that feature makes my loss go up? What does that suggest then?

Nice question (+1), this can indeed be perplexing at times! – usεr11852 Apr 21 '19 at 21:44 — usεr11852, Apr 21 '19 at 21:44

score 3 · Accepted Answer · answered Apr 21 '19 at 21:43

Having negative feature importances suggests that CatBoost is mislead by the inclusion of those features during the modelling procedure. Simply put, these features gives false information about the regression task at hand. In fairness, we would expect a good learning procedure to avoid such features but their presence might suggest issues with over-fitting and/or wrong parameter values (e.g. very large learning rate).This might also occur if the encoding of a categorical feature is very sparse and certain levels are over-fitted.

The above being said, please note that Shapley values can be negative and that is fine. This would simply suggest that a particular feature has a negative influence overall (e.g. age has a negative influence in adult human bone density).

Negative Feature Importance Value in CatBoost LossFunctionChange

1 Answers1