4

I'm having trouble understanding exactly how to obtain the regularization parameter when pruning a decision tree with the minimal cost complexity approach. Assume the cost complexity function is represented as

$$C(T) = R(T) + \alpha|T|,$$

where $\alpha$ is the regularization parameter to be chosen.

Utilizing the entire data set, We now use weakest link cutting to obtain a set of $\alpha$'s and the corresponding sub-trees which minimize the cost for a given $\alpha$.

Next, we generally use a K-fold cross-validation. This is where the pruning approach becomes unclear to me. For K-fold CV we estimate K trees. Next, I would think we would use the original $\alpha$'s which we obtain from the entire sample to identify the sequence of optimal sub-trees in each fold. We would then proceed with CV, selecting the $\alpha$ with corresponding smallest average error.

However, several sources (These lecture notes and Intro to Stats Learning Pg 309) seem to suggest that within each fold a new set of $\alpha$'s are obtained. Let's refer to the set of $\alpha$'s obtained within the kth fold as $\alpha^{(k)}$. This does not make sense to me. It is not likely that each entry within the set $\alpha$ (i.e. the set of $\alpha$'s obtained from the entire data set) will be equivalent to $\alpha^{(k)}$ of even that the elements of $\alpha^{(k)}$ will be equivalent to $\alpha^{(j)}$. How can we pick the entry of $\alpha$ that minimize cost when $\alpha^{(k)}$ potentially share no similar entries with $\alpha$?

Jacob H
  • 762
  • 5
  • 17

1 Answers1

2

It is as you say. For each of the K folds you obtain a sequence $\alpha^{(k)}$. Each of these sequences is in general different. Now, let $\alpha^*$ be the "union" of all the sequences: in other words, $\alpha^*$ es the set of all values of the cost-complexity parameter at which in at least one of the folds we transition from one tree to another.

The idea is then to compute the cross-validated error for all values half way between contiguous values of $\alpha^*$ (I seem to recall that in the original reference on CART the authors propose the geometric mean of contiguous $\alpha^*$) and pick the value which makes such cross-validated error minimum. With that value of $\alpha$ you go and prune the tree based on the whole sample.

F. Tusell
  • 7,733
  • 19
  • 34
  • I am not completely sure on how to select the final value of alpha given this [scikit-learn's example](https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost-complexity-pruning-py). Having trained a tree on the entire training set, one can find a finite number of $\alpha$-values (illustrated on the example's graph's x-axis). Next, suppose you have found an optimal $\alpha^*$ from the crossvalidation procedure. How do match this $\alpha^*$ with the $\alpha$-values found from the entire training set? – Tanguy Apr 07 '20 at 11:28
  • You do not find values from the entire training sample. You obtain the optimal $\alpha^*$ as you have described, then grow a (possibly oversize) tree from the full training sample and prune it to size using $\alpha^*$. – F. Tusell Apr 08 '20 at 12:08
  • The problem is that there is no correspondence between the $\alpha^*$ (found during the cross-validation procedure) and the $\alpha$-values mapping the sub-trees (which were found by pruning the tree trained on the full training sample). The $\alpha^*$ is expected to lie between 2 different $\alpha$-values, so which $\alpha$-value (each corresponding to a specific sub-tree) do we end up choosing? – Tanguy Apr 08 '20 at 16:18
  • [Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html) selects "the subtree with the largest cost complexity that is smaller than [the given] $\alpha$" (see documentation on the `ccp_alpha` parameter), but why not choose the subtree which has the cost complexity just above that value? – Tanguy Apr 08 '20 at 16:25
  • 1
    Don't know about Scikit-learn. You could select either of the two $\alpha$'s on either side of $\alpha^*$ and it would not make much of a difference: one leave more or less in the tree. In the original book by Breiman et al. on CART they propose the 1 S.E. rule: pick the largest $\alpha$ (=smallest tree) such that the cross-validated error is within 1 standard error of that corresponding to $\alpha^*$. This is in my experience a very conservative rule unless you have oodles of data, I prefer something closer to $\alpha^*$ (without much concern about whether to the right or left). – F. Tusell Apr 10 '20 at 15:56