Is there an ExtraTreesClassifier
-like classifier that has decision boundary function like SVM?
The SVM boundary arises directly from how SVMs are defined. It's not something that can be "tacked on" to an arbitrary classifier because non-SVM classifiers are not defined in the same way. There are alternative ways for decision trees to make predictions, but I don't think these alternative tree-based methods will result in more informative predictions.
Tree-based models work by partitioning the space and then assigning scores to those partitions. When the contents of the partitions are pure, the scores are binary. This seems to be the case with your data; for whatever reason, the terminal nodes of all trees all contain the same class. The model is telling you that it's very certain about the partitioning.
Gradient-boosted trees assign weights to each terminal partition, and then sum up these weights across all trees and use a sigmoid function to return probabilities. This method is more ornate than just predicting according to the proportion of the class in the terminal node. However, I believe that because your terminal nodes are all pure, what will happen is that the gradient-boosted tree will just assign very large weights (in the case of positive instances) or very small weights (in the case of negative instances) to each leaf. As a result, these predicted probabilities will tend to be very close to 0 or 1, but I wouldn't expect them to be much more informative than the binary predictions you have now because these predictions result from weights assigned to pure nodes.
ExtraTreesClassifier
is working as intended.
This seems to be a quirk of your data, not a property of sklearn
's software.
We can verify this by using this toy model. Many of these predictions are between 0 and 1, so we can conclude that ExtraTreesClassifier
does give continuous-valued predictions. The fact that continuous-valued predictions dont't occur in your data must be a quirk of how your data work. For whatever reason, the ExtraTreesClassifier
applied to your data gives predictions that are 0 or 1.
from sklearn.datasets import load_iris
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
if __name__ == "__main__":
X, y = load_iris(True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
xtc = ExtraTreesClassifier(n_estimators=129)
xtc.fit(X=X_train, y=y_train)
y_test_hat = xtc.predict_proba(X_test)
print(y_test_hat)
This prints
[[0. 0.94573643 0.05426357]
[0.98449612 0.01550388 0. ]
[0. 0.00775194 0.99224806]
[0. 0.9379845 0.0620155 ]
[0. 0.85271318 0.14728682]
[1. 0. 0. ]
[0. 1. 0. ]
...]
Comparing models on the basis of continuous proper scoring rules is more powerful.
In your question, you compare two models on the basis of $F_1$ score. This is a discontinuous scoring rule because it discretizes the prediction information (probabilities, distances) when it calculates precision and recall. We have a number of posts about this, such as Why is AUC higher for a classifier that is less accurate than for one that is more accurate?
So let's compare these two toy models on the basis of $F_1$ and log-loss. One model is an ExtraTreesClassifier
with unlimited depth, and the other is an ExtraTreesClassifier
with depth 3.
When we limit depth to 3, the out-of-sample predictions are worse in the sense that more probability is allocated to the incorrect classes.
[[0.07983777 0.56589267 0.35426955]
[0.83532454 0.12041077 0.04426469]
[0.00446764 0.18980284 0.80572952]
...]
Comparing these models, we find that their $F_1$ scores are tied, while there is a dramatic difference in the log-loss.
$$
\begin{array}{l|r|r}
\text{Model} & F_1 & \text{Log-loss} \\\hline
\text{Depth Unlimited} & 1.0 & 0.041 \\\hline
\text{Depth 3} & 1.0 & 0.222
\end{array}
$$
The reason that $F_1$ and log-loss disagree should be obvious: discarding the information about confidence means that the differences between the two models are suppressed. For log-loss, this is the case. When the model is confident and correct, the model's log-loss is lower. Contrariwise, when the model is confident but incorrect, the model's log-loss is higher.
But for $F_1$ score, the scoring is entirely binary. Either the sample is correctly classified, or it's not, and the precision and recall are computed as summaries of these binary decisions.
Indeed, there's a symmetry between your underlying problem -- your predictions are not differentiated -- and the mechanics of using a continuous proper-scoring rules. You want a continuous predictor, which is just another way of saying that you want the predicted probabilities to be informative. Continuous proper scoring rules reward models with informative probabilities.
Are binary predictions really a problem?
It's worth considering whether the binary predictions are a real problem. When you compare the out-of-sample predictions to their true labels, does the model do a good job? Does this model solve the problem that you have? If so, it might be sufficient to pick a number between 0 and 1 as your threshold and call it a day. Unfortunately, you won't have much information about the risks involved in choosing a poor threshold. All you can say for sure is that choosing a larger threshold decreases the true positive rate and decreases the false positive rate, while choosing a smaller threshold increases them both.