3

Let's say you don't have a lot of data to fit a model but you still want to have a sense of feature importance for your model. SHAP values are a very interesting tool to do that.

Not having a lot of data how can we trust the SHAP values returned?

For example if we do KFold cross-validation on our small dataset and fit 5 models and for each fold calculate the SHAP values. Those SHAP values are going to vary a lot between each fold just because we are data poor and the model could fit the data very differently. The same would occur if we would fit multiple models with different bootstrap samples.

The SHAP variable importance rank can change drastically from one model to another.

What would be a good measure of how we can trust the SHAP feature importance ranking?

My guess would be to calculate the variance of the rank of importance for each variable, then take the mean of the rank variance across all variables. So if the rank of variables change a lot I can trust them less. But I was wondering if there any other better measure I should use or any paper I could read that cover this subject.

Thanks,

EtienneT
  • 263
  • 2
  • 7

1 Answers1

1

Quite old issue without any replies, but I have a suggestion [here]: https://medium.com/@lucasramos_34338/visualizing-variable-importance-using-shap-and-cross-validation-bd5075e9063a:

Where I compute shap-values for all iterations and combine them. You can check the individual results and compare them to the combined one and see if they actually deviate a lot or not.

Lucas Ramos
  • 111
  • 1
  • 1
    Hi @Lucas Ramos, your article is really good. Thanks for that. Could you explain me this part? I don't understand it. test_set = list_test_sets[0] shap_values = np.array(list_shap_values[0]) for i in range(0, len(list_test_sets)): test_set = np.concatenate((test_set, list_test_sets[i]), axis = 0) shap_values = np.concatenate((shap_values, np.array(list_shap_values[i])), axis = 1) Why you are appending list_shap_values[0] twice ? like one outside for and one is within the for.. Should the range in for loop starts from 1 ? – JSVJ Apr 29 '21 at 07:08
  • 1
    Could you please tell me how to plot the dependence SHAP plots ? I am kinda new to this :( Thanks in advance :) – JSVJ Apr 29 '21 at 08:40
  • 1
    Dear @JSVJ, thanks for your comment, you actually found a mistake there, the for loop should indeed start at 1. I will update that in the code. – Lucas Ramos May 03 '21 at 09:22
  • 1
    @JSVJ I've updated the code and included a short example for dependency plots, let me know if you need further help. – Lucas Ramos May 03 '21 at 09:33
  • Thank you very much. I will check it. – JSVJ May 04 '21 at 07:33
  • Hi @Lucas Ramos, Thank you. But I couldn't find the code in the blog. Could you Kindly share the link of the code :) ? – JSVJ May 06 '21 at 03:04
  • 1
    Sure, there you go, hope it helps: https://gist.github.com/L-Ramos/743319d0c405b386d481c924e0fc6789#file-shap_cross_validation-ipynb – Lucas Ramos May 07 '21 at 05:55