0

I have a complex dataset, number of features is much bigger than number of samples. The question is - which features are important for classification into 2 groups.

I think that (after some engeneering of features taking into account possible interactions) ctree is a good instrument for doing this. However I need to present results in a paper.

Do I need to cross-validate ctree in order to be able to present some "significance", e.g. "feature X appears 10 times out of 12 as a root split - may be it is important"? I would go with random forest feature importance (and shuffle the labels to find p-values), but as far as I know RF is parametric and ctree is non-parametric which is preferable...

German Demidov
  • 1,501
  • 10
  • 22
  • 1
    To my knowledge random forest is non-parametric, why would you assume otherwise? – Scholar Feb 01 '19 at 10:40
  • but are not the decision trees that random forest build based on parametric assumptions? I am sure that regression trees yes, each split is performed according to distribution of residuals. I also know that - theoretically - random forest can be built based on any type of trees, ctree also, but I do not know where it was implemented... – German Demidov Feb 01 '19 at 11:25
  • What assumptions? – Scholar Feb 01 '19 at 13:32
  • 1
    Perhaps you should step back and look at the bigger picture: parametric vs non-parametric statistics: https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726 – Peter Teoh Feb 02 '19 at 03:29
  • 1
    https://stats.stackexchange.com/questions/147587/are-random-forest-and-boosting-parametric-or-non-parametric – Peter Teoh Feb 02 '19 at 03:30
  • @bi_scholar I was sure that the split at each point is performed according to some metric such as RSS in case of continuous output - I was wrong, sorry – German Demidov Feb 06 '19 at 08:15
  • @GermanDemidov that is indeed the case, but that doesn't make the algorithm parametric, as it does not imply that any particular distribution of the data-generating process is assumed. In theory, random forests can model any distribution, while (parametric) models such as LDA can not. – Scholar Feb 06 '19 at 10:22
  • @bi_scholar yeap, agree, so the crucial mistake in my question was "non-parametric" instead of "robust to outliers" =( my fault. But thank to you and Peter Teoh I understand the definitions much better now... – German Demidov Feb 06 '19 at 11:43

0 Answers0