Let's say you don't have a lot of data to fit a model but you still want to have a sense of feature importance for your model. SHAP values are a very interesting tool to do that.
Not having a lot of data how can we trust the SHAP values returned?
For example if we do KFold cross-validation on our small dataset and fit 5 models and for each fold calculate the SHAP values. Those SHAP values are going to vary a lot between each fold just because we are data poor and the model could fit the data very differently. The same would occur if we would fit multiple models with different bootstrap samples.
The SHAP variable importance rank can change drastically from one model to another.
What would be a good measure of how we can trust the SHAP feature importance ranking?
My guess would be to calculate the variance of the rank of importance for each variable, then take the mean of the rank variance across all variables. So if the rank of variables change a lot I can trust them less. But I was wondering if there any other better measure I should use or any paper I could read that cover this subject.
Thanks,