Is it necessary to use warm_start when tracking oob_score in scikit RandomForestClassifier?

Question

I'm planning on doing feature-selection with RandomForestClassifier by using the feature_importances and oob_score. My plan is to recursively drop the 20% least important features and measure the OOB error until I get a significant drop, as recommended here.

BUT, I'm baffled by a comment I saw in a scikit example for tracking oob errors. It says: Setting the warm_start construction parameter to True is necessary for tracking the OOB error trajectory during training.

Why do they mean by that?? I was planning on using clone for each iteration with a new subset of features, and comparing the oob_score for each classifier. Am I missing something, or is it just a performance recommendation available in that particular example?

check the answer of mbq: http://stats.stackexchange.com/questions/41104/number-of-trees-for-random-forest-optimization-using-recursive-feature-eliminati?rq=1 — Soren Havelund Welling, Mar 21 '16 at 11:46

score 2 · Answer 1 · answered Mar 21 '16 at 11:35

I'm a R user but understand the warm_start documentation as old trees are saved for later use.

Ideally, when you choose to run a recursive feature selection by a sequence of forests you should pick warm_start=False. The idea of a recursive feature selection is that when dropping some features, other features may become more important and to 'appreciate' this all trees must be retrained. Depending on your actual implementation, it may defacto be a cold_start even if you set warm_start=True. Recursive feature selection is for good and bad analogues to backward elimination in regression models.

Remember the internal OOB_error cannot be used to decide, when the feature selection should stopped. You will tend to go too far, and the OOB_error will be over-optimistic. Use a outer cross-validation to test if you really achieve a increase in prediction performance.

In the example of tracking oob errors it was stated that warm_start is necessary. I would add: ... "to perform a online monitoring of OOB_erroras function of added to trees to the ensemble with this very particular wrapper implementation". Just disregard it, it does need to apply to your case. I would just run the feature selection process obviously too far and go drink coffee or hang out in this forum. Later I plot the outcome versus iterations and decide for when it probably would have been a good time to stop.

I may soon update my opinion on feature selection, see comments: http://stats.stackexchange.com/questions/200823/does-it-makes-sense-to-use-feature-selection-before-random-forest/200959#200959 — Soren Havelund Welling, Mar 21 '16 at 11:39

Is it necessary to use warm_start when tracking oob_score in scikit RandomForestClassifier?

1 Answers1