Is there an ExtraTreesClassifier-like classifier that has decision boundary function like SVM?

Question

I'm using sklearn and I tested many models and those two worked best: Linear SVM and the ExtraTreesClassifier as binary classifiers. The ExtraTreesClassifier outperforms the Linear SVM in terms of scores: 0.92 F1 vs 0.72 F1.

Unfortunately I cannot use the trained classifier as is and I need to make a threshold on the confidence of the trained classifier. This works very well with the linear SVM because it provides the decision boundary function as a confidence quantification (distance to the hyperplane). The ExtraTreesClassifier on the other hand does not provide this, even if I use the predict_prob I will only get 1 and 0 (even if I vary the min_samples_leaf).

Now to my question is it possible to use a tree/forest classifier (not necessarily the ExtraTreesClassifier) and also get a proper confidence for the predictions on which I then can put threshold(s)?

Sycorax · Accepted Answer · 2020-01-23T21:29:42.080

Is there an `ExtraTreesClassifier`-like classifier that has decision boundary function like SVM?

The SVM boundary arises directly from how SVMs are defined. It's not something that can be "tacked on" to an arbitrary classifier because non-SVM classifiers are not defined in the same way. There are alternative ways for decision trees to make predictions, but I don't think these alternative tree-based methods will result in more informative predictions.

Tree-based models work by partitioning the space and then assigning scores to those partitions. When the contents of the partitions are pure, the scores are binary. This seems to be the case with your data; for whatever reason, the terminal nodes of all trees all contain the same class. The model is telling you that it's very certain about the partitioning.

Gradient-boosted trees assign weights to each terminal partition, and then sum up these weights across all trees and use a sigmoid function to return probabilities. This method is more ornate than just predicting according to the proportion of the class in the terminal node. However, I believe that because your terminal nodes are all pure, what will happen is that the gradient-boosted tree will just assign very large weights (in the case of positive instances) or very small weights (in the case of negative instances) to each leaf. As a result, these predicted probabilities will tend to be very close to 0 or 1, but I wouldn't expect them to be much more informative than the binary predictions you have now because these predictions result from weights assigned to pure nodes.

`ExtraTreesClassifier` is working as intended.

This seems to be a quirk of your data, not a property of sklearn's software.

We can verify this by using this toy model. Many of these predictions are between 0 and 1, so we can conclude that ExtraTreesClassifier does give continuous-valued predictions. The fact that continuous-valued predictions dont't occur in your data must be a quirk of how your data work. For whatever reason, the ExtraTreesClassifier applied to your data gives predictions that are 0 or 1.

from sklearn.datasets import load_iris
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split

if __name__ == "__main__":
  X, y = load_iris(True)

  X_train, X_test, y_train, y_test = train_test_split(X, y)
  xtc = ExtraTreesClassifier(n_estimators=129)

  xtc.fit(X=X_train, y=y_train)

  y_test_hat = xtc.predict_proba(X_test)
  print(y_test_hat)

This prints

[[0.         0.94573643 0.05426357]
 [0.98449612 0.01550388 0.        ]
 [0.         0.00775194 0.99224806]
 [0.         0.9379845  0.0620155 ]
 [0.         0.85271318 0.14728682]
 [1.         0.         0.        ]
 [0.         1.         0.        ]
  ...]

Comparing models on the basis of continuous proper scoring rules is more powerful.

In your question, you compare two models on the basis of $F_1$ score. This is a discontinuous scoring rule because it discretizes the prediction information (probabilities, distances) when it calculates precision and recall. We have a number of posts about this, such as Why is AUC higher for a classifier that is less accurate than for one that is more accurate?

So let's compare these two toy models on the basis of $F_1$ and log-loss. One model is an ExtraTreesClassifier with unlimited depth, and the other is an ExtraTreesClassifier with depth 3.

When we limit depth to 3, the out-of-sample predictions are worse in the sense that more probability is allocated to the incorrect classes.

[[0.07983777 0.56589267 0.35426955]
 [0.83532454 0.12041077 0.04426469]
 [0.00446764 0.18980284 0.80572952]
  ...]

Comparing these models, we find that their $F_1$ scores are tied, while there is a dramatic difference in the log-loss.

$$ \begin{array}{l|r|r} \text{Model} & F_1 & \text{Log-loss} \\\hline \text{Depth Unlimited} & 1.0 & 0.041 \\\hline \text{Depth 3} & 1.0 & 0.222 \end{array} $$

The reason that $F_1$ and log-loss disagree should be obvious: discarding the information about confidence means that the differences between the two models are suppressed. For log-loss, this is the case. When the model is confident and correct, the model's log-loss is lower. Contrariwise, when the model is confident but incorrect, the model's log-loss is higher.

But for $F_1$ score, the scoring is entirely binary. Either the sample is correctly classified, or it's not, and the precision and recall are computed as summaries of these binary decisions.

Indeed, there's a symmetry between your underlying problem -- your predictions are not differentiated -- and the mechanics of using a continuous proper-scoring rules. You want a continuous predictor, which is just another way of saying that you want the predicted probabilities to be informative. Continuous proper scoring rules reward models with informative probabilities.

Are binary predictions really a problem?

It's worth considering whether the binary predictions are a real problem. When you compare the out-of-sample predictions to their true labels, does the model do a good job? Does this model solve the problem that you have? If so, it might be sufficient to pick a number between 0 and 1 as your threshold and call it a day. Unfortunately, you won't have much information about the risks involved in choosing a poor threshold. All you can say for sure is that choosing a larger threshold decreases the true positive rate and decreases the false positive rate, while choosing a smaller threshold increases them both.

Thank you for your elaborate answer. I think I was a bit unclear in my question. I was by no means stating that the ExtraTreesClassifier wouldn't be working as intented. Also I don't agree that assigning a lower probability is a bad thing per se. I'm currently testing XGBoost which has a binary:logistic option which seem to give me proper confidence estimations. — Philipp, Jan 23 '20 at 16:51
Now addressing the point of why binary predictions is a bad thing for me is that my model is still missing a scoring that is coming from a client in the end. If my model would have only binary output there is no way this client scoring can be applied anymore. — Philipp, Jan 23 '20 at 16:58
I re-read your question and noticed that you're comparing $F_1$ scores. This is a scoring rule which uses precision and recall, so it discards the information provided by predicted probabilities and signed distance from a separating hyperplane. We have many threads about this topic, such as https://stats.stackexchange.com/questions/90659/why-is-auc-higher-for-a-classifier-that-is-less-accurate-than-for-one-that-is-mo/90705#90705 I suggest comparing models on the basis of continuous proper scoring rules. This will tend to reward models which give calibrated predictions — Sycorax, Jan 23 '20 at 18:54
I don't agree with you that trees do not have a decision boundary. As I have understood they do have decision boundaries that are perpendicular to the axes. Calculating a distance to the closest boundary point doesn't seem too far fetched to me. The point about having a proper scoring for getting meaningful outputs where I can put thresholds on is noted and appreciated. I now just realized that I can switch from a binary prediction to a continuous prediction (regression) which would not only force me to use a proper scoring but also gives my model more information (that was unused before). — Philipp, Jan 24 '20 at 07:53
Every machine learning algorithm has a decision boundary. (Pick a number and a class. Any predictions for that class greater than that number belong to that class.) But not all of these decision boundaries are "like the SVM" boundary because they do not arise from the SVM equation. Your suggestion breaks down for a tree ensemble because an ensemble averages the results of many trees, so each point in space is assigned a real number. Threshold these values will create many boundaries; these are not generally convex, so there are many distances to an edge & computing all the edges is expensive. — Sycorax, Jan 24 '20 at 14:03
SVM boundaries are simple because they're constructed to be a hyperplane in the kernel mapping space. Tree ensembles are recursively split on the feature space over and over, so there will be all sorts of weird shapes that arise from averaging lots of rectangles. — Sycorax, Jan 24 '20 at 14:06

Is there an ExtraTreesClassifier-like classifier that has decision boundary function like SVM?

1 Answers1

Is there an ExtraTreesClassifier-like classifier that has decision boundary function like SVM?

ExtraTreesClassifier is working as intended.

Comparing models on the basis of continuous proper scoring rules is more powerful.

Are binary predictions really a problem?

Is there an `ExtraTreesClassifier`-like classifier that has decision boundary function like SVM?

`ExtraTreesClassifier` is working as intended.