I have a model (GBDT) where adding a feature X is not important (according to SHAP), but when I add other features, and add X again, now feature X is the second most important!
What could explain that? How do I investigate what is going on?
I have a model (GBDT) where adding a feature X is not important (according to SHAP), but when I add other features, and add X again, now feature X is the second most important!
What could explain that? How do I investigate what is going on?
You don't need a boosted tree, or even interaction terms/deep trees, to get this type of behavior. This is an example of omitted-variable bias, which can show up in a context as fundamental as ordinary least-squares regression.
This answer shows a simple example, in which adding a new categorical predictor actually flips the direction of a continuous predictor's relationship to outcome. That was just from adding the categorical predictor to the model without an interaction term. In ordinary linear regression this is a risk if you omit a predictor that is correlated both with outcome and with the included predictors. In logistic regression, more closely related to your boosted decision trees, an omitted predictor doesn't even need to be correlated to the included predictors to lead to trouble; see this page for an analytical demonstration in probit probability models.
It will probably be hard to figure out directly from your gradient-boosted trees, with their high interaction depths, just what is going on. I suspect that logistic regression involving the feature X and the ones that you added to the the model, along with investigating the correlations of those added predictors with X, will provide important and readily interpretable hints.
It sounds like you have a response that has a very non-linear relationship to the features you're using. By itself, X may have a weak relationship to the target, but in combination with another feature, it becomes important. If you're using neural nets or random forest (or other decision tree ensembles) that can identify interactions between variables, this type of relationship will reveal itself. With a decision tree ensemble, you can inspect the trees to identify when two variables co-occur, but it is much harder to identify this in a neural network.