0

$\newcommand{\R}{\mathbb{R}}$

In the context of the implementation of tree based models, suppose that a predictor space $X \in \R^n$ is under a partition process of the algorithm recursive binary splitting.

So the recursive binary splitting compute a metric across those predictors, in the case of continuous variables used the residual sum of the squares and when the values are predefined and can identify under a label, categorical variables, used cross-entropy or gini index.

How the implementations that use the mechanism describe above can automatically identify which of the predictor variables, $X_1, ..., X_n$, are categorical and which ones are continuos? or just the identification stage of the algorithm, the stage to define which metric compute for the predictor $X_i$, it's based in the data type of each variable?

  • Could you provide a sufficiently clear definition of "categorical" and "continuous" that would distinguish them *as implemented in a digital computing system*? If the dataset is sufficiently large, then it might be safe to characterize any variable with many ties in its values as "categorical," but even that is not necessarily the case. The problem is that these characterizations--"categorical" and "continuous"--are really *conceptual* distinctions that help guide your modeling decisions more than they are inherent properties of anything. See https://stats.stackexchange.com/questions/206. – whuber Aug 30 '17 at 12:48
  • [whuber](https://stats.stackexchange.com/users/919/whuber) my question is more oriented to the implementation. For example, if is inspected the variable `Years` of a patient ${14, 28, 32, 12, ..}$, how the algorithm determine: *"ahh! this is a continuos variable, use the RSS to compute the loss function based in the cut points and find the minimum"*? or just the implementation relies on: *"variable `Years` is a float type variable, so I use RSS, or the case of `Years` is a integer type then I use gini or entropy measures"*. – Cristóbal Alcázar Aug 30 '17 at 13:38
  • You would usually just tell that to your algorithm. Either explicitly through an additional function parameter which contains a list of the types of variables or implicitly. The algorithm would for example assume that objects of the type double are continuous and hat strings are categorical. – David Ernst Aug 30 '17 at 17:27

0 Answers0