1

Question is somewhat related to: What are variable importance rankings useful for?

I am performing some basic churn modeling.

Lets say for instance that I am training a GBM on my Dataset which has about 10M Rows and 100 Columns.

Within these columns I have some features that are somewhat correlated, for example one may be the cost and other columns could be an indicator variable for a discount code. When running my model, the cost is one of the top features, but indicator variables for discount codes rank very low, even under my Random Variable.

The GBM still uses this variable for splits so there may be some signal in it, but it ranks under a RV. In this case would it be worthwhile to remove it to prevent overfitting?

BrianW
  • 11
  • 1

0 Answers0