1

I am trying to run several decision tree models (CHAID, C&RT, QUEST), but I have learned that several researchers have applied logistic regression model first in order to select risk factors. So, once they have significantly factors they used them to perform a decision tree.

It is the best way to do that? Do I really need to do this before a decision tree? If so, what are the advantages?

Andy
  • 18,070
  • 20
  • 77
  • 100
  • 1
    Do you have any references? It sounds a bit odd to use logistic regression to select features for decision tree. – Zhubarb Mar 04 '15 at 08:54
  • See whether the material on CHAID at http://stats.stackexchange.com/questions/7815/what-skills-are-required-to-perform-large-scale-statistical-analyses is helpful. – rolando2 Mar 04 '15 at 12:48
  • Zhurbarb, this is the reference that talks about doing Logistic regression before decision tree model: http://www.biomedcentral.com/1472-6963/14/382 – Rogelio Pujol Mar 05 '15 at 11:35

2 Answers2

1

Doing feature selection based on statistical significance is a bad idea.

There are no real advantages to doing this. With enough data, all effects will be significant, so you would end up selecting all the variables. Not only that, but the p values doesn't tell you anything about the size of the effect so you might end up selecting features which effect the outcome in a negligible way.

There is no need to do this.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
0

Perhaps some are thinking it as a way to do variable selection via automated stepwise or other methods, but I think it might hide some problems related to data minining results quality.

Why not do first independently all models and do some quality checking via ROC-curves or such?

It could also be proper to do ensembling, such as averaging probability estimates from the various models.

Analyst
  • 2,527
  • 10
  • 11