Does it make sense to use the top n features by importance from Random Forest in a logistic regression?

Question

I am new to machine learning and am very lost trying to deciding on features from a data set. The data set that I have has over 25000 observations and just under 500 features. I have a churn variable 0/1 where 0 is churn. I am attempting to build a classification model for churn. Does it make sense to do a random forest and then take the top 10 variables by importance from this model and use these in a logistic regression? I am hoping to use the logistic regression to make the results more interpret-able for presentation purposes.

score 0 · Answer 1 · edited Apr 13 '17 at 12:44

0

Using top 10 variables from random forrest according to their importance seems completely valid idea to me. However, it also depends upon how the data is also how importance is distributed across different variables. I would definitely try more than one method. I am big fan of random forests and they can even be made interpretable as discussed here.

edited Apr 13 '17 at 12:44

Community

1

answered Feb 21 '17 at 23:10

discipulus

726
4
14

Does it make sense to use the top n features by importance from Random Forest in a logistic regression?

1 Answers1