2

I am new into Machine Learning so please excuse me if my question is naive.

My question is, is it possible to use trees for example rpart or ctree after variable selection procedures such as Lasso/Random Forest to study the interactions amongst the variables? Are there any published academic articles on this?

Thank you

user3571389
  • 275
  • 3
  • 6
  • 1
    Related: http://stats.stackexchange.com/questions/164048/can-random-forest-be-used-for-feature-selection-in-multiple-linear-regression/164068#164068 – Sycorax Feb 16 '16 at 16:23

1 Answers1

8

No, this would not be correct. Methods that properly use penalization combined with variable selection (elastic net is one of the better ones) cannot be used to obtain variables to be fed into a regular unpenalized analysis. The first, comprehensive, analysis properly penalizes for context. Example: if you have to screen 1000 candidate predictors to arrive at 10 good predictors, the regression coefficients for these 10 will be tremendously discounted (shrunk). If you had 20 candidate variables instead of 1000 and you arrived at 10 variables, the amount of shrinkage would not need to be nearly as much. The simple way to say this is if you have to work very hard to achieve good predictions you distrust the analysis.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322