Logistic regression before decision tree model

Question

I am trying to run several decision tree models (CHAID, C&RT, QUEST), but I have learned that several researchers have applied logistic regression model first in order to select risk factors. So, once they have significantly factors they used them to perform a decision tree.

It is the best way to do that? Do I really need to do this before a decision tree? If so, what are the advantages?

Do you have any references? It sounds a bit odd to use logistic regression to select features for decision tree. — Zhubarb, Mar 04 '15 at 08:54
See whether the material on CHAID at http://stats.stackexchange.com/questions/7815/what-skills-are-required-to-perform-large-scale-statistical-analyses is helpful. — rolando2, Mar 04 '15 at 12:48
Zhurbarb, this is the reference that talks about doing Logistic regression before decision tree model: http://www.biomedcentral.com/1472-6963/14/382 — Rogelio Pujol, Mar 05 '15 at 11:35

score 1 · Answer 1 · answered May 22 '20 at 19:09

Doing feature selection based on statistical significance is a bad idea.

There are no real advantages to doing this. With enough data, all effects will be significant, so you would end up selecting all the variables. Not only that, but the p values doesn't tell you anything about the size of the effect so you might end up selecting features which effect the outcome in a negligible way.

There is no need to do this.

score 0 · Answer 2 · answered Mar 04 '15 at 08:05

0

Perhaps some are thinking it as a way to do variable selection via automated stepwise or other methods, but I think it might hide some problems related to data minining results quality.

Why not do first independently all models and do some quality checking via ROC-curves or such?

It could also be proper to do ensembling, such as averaging probability estimates from the various models.

answered Mar 04 '15 at 08:05

Analyst

2,527
10
11

I'd be interested in an expanded version of this answer. – rolando2 Mar 04 '15 at 12:51
"It could also be proper to do ensembling, such as averaging probability estimates from the various models." -- this is effectively what a random forest does. – Victor Stoddard Jul 04 '17 at 22:32

Logistic regression before decision tree model

2 Answers2