0

I have implemented the tree building part of CART, but I'm wondering if it can be optimized. According to this quora answer, the complexity should be $\mathcal O(v \cdot n \log n)$ where $v$ is the number of features and $n$ the number of records.

Now my implementation looks as follows (in python, with some implementation details left out):

best_cost = float('inf')
best_split = None
for i in range(n_features):
  for features, target in data:
    left, right = list(), list()
    for x, y in data:
      if x[i] < features[i]:
        left.append((x, y))
      else:
        right.append((x,y))

    gini = 0
    for node in [left, right]:
       p = sum([1 if target == 1 else 0 for (feature, target) in node]) / len(node)
       gini += calc_gini(p)

    if gini < best_cost:
      best_cost = gini
      best_split = (left, right)

If I'm not mistaken the first loop (for i...) is the $v$, the second loop (for feature, target in...) is $n$ but then the third loop is identical to the second, making it $n^2$. So together $\mathcal O(vn^2)$.

I have a feeling that by sorting the data in some way, some loops can be stopped early. But to do that you have to order the data first which is at best $\mathcal O(vn \log n)$. Perhaps by sorting the data once, the whole tree can be build in the $nv \log n$ time order, tuhs the whole tree is built in $vn \log n + mvn \log n = (vn \log n)(1+m)$ where $m$ is the total resulting number of nodes, but this is pure speculation on my part.

How does the CART algorithm find the greedy best split in only $\mathcal O(v\cdot n \log n)$ time?

Jurgy
  • 111
  • 1
  • 3

0 Answers0