2

Since decision tree don't use all the input features and select them in the process, is it useful to do feature selection before?

As I see it, choosing features will decrease computing time (and decrease overfitting risk on small dataset?), but as multiple weak features can perform better than strong ones, I may also have a worse prediction.

EDIT : Bonus question : Is there a way to select features before a decision tree, or should I let it do the work ?

CoMartel
  • 83
  • 1
  • 8
  • The question is how can you select them before the decision tree? – Metariat Sep 19 '16 at 07:36
  • I was thinking using a feature selection technique, maybe LASSO or something else – CoMartel Sep 19 '16 at 07:46
  • 1
    Variables that are important in LASSO don't necessarily have the same relationship with the outcome as in decision tree. You can see a related question here: http://stats.stackexchange.com/questions/164048/can-random-forest-be-used-for-feature-selection-in-multiple-linear-regression – Metariat Sep 19 '16 at 07:53
  • Ok, I see your point. The remaining question is : is there a way to select features before a decision tree, or should I let it do the work ? – CoMartel Sep 19 '16 at 08:11
  • In my personal experienc, I don't see any way to select features before building the tree. – Metariat Sep 19 '16 at 08:12
  • If your goal is to reduce features to make a smaller or more general tree, you may want to use dimensionality reduction like PCA rather than feature selection like LASSO. Unlike feature selection, which just gets a subset of features, dimensionality reduction may merge features. Another option is to just prune your tree (allow only n levels). Yet another option is to get the "importance" (predictive power) of each feature by predetermining it's information gain, the most common loss function in a tree, and removing the unimportant ones. – Victor Stoddard Jul 04 '17 at 22:24

1 Answers1

2

Decision Trees are pretty good at finding the most important features, they consider all features and create a split on the one that is separating class labels the best (in terms of entropy).

If you use Random Forests it's even better, because some implementations (like scikit-learn's) are capable of sampling the features and use only a subset of it. Also in general Random Forests are more robust than decision trees.

If you want, you can compute Information Gain before using a Decision Tree to see how much information a particular feature contains regarding the Label:

https://en.wikipedia.org/wiki/Information_gain_in_decision_trees

Peter Csizsek
  • 306
  • 1
  • 4