8

I have a general question about asymmetric costs. In machine learning problems, there are times when the cost of a false positive is different from the cost of a false negative. Accordingly, models should be built differently to account for this asymmetry in costs.

How is this done for a random forest?

Some possible ways are:

  1. Changing the information gain calculated when considering different splits in a given branch of a decision tree to account for asymmetry
  2. Adjusting the threshold from 0.5 within each leaf when assigning the predicted label of a positive class in a given decision tree
  3. Adjusting the threshold from 0.5 within the collection of decision trees when "voting" on the predicted label for the random forest
  4. Using ROC curves and choosing a different threshold than what is typically chosen (typically, the threshold closest to the top-left corner of the ROC graph is chosen as the "ideal")

Which of these way(s) are implemented to account for asymmetric costs, in practice?

mnmn
  • 156
  • 5
  • Random Forest Classifiers are generally build to node purity, so (2) is not an option. – Scholar May 22 '19 at 09:11
  • Shouldn't this be done though? I agree the code that random forest is built on generally does not do this, but you could conceivably customize the node purity calculation to take into account different weightings for a positive and negative class. – mnmn May 22 '19 at 14:16
  • How is that possible? Since the leaf nodes consist of only one class, the ratio of class labels is either 0 or undefined, so applying different weights does nothing. – Scholar May 22 '19 at 14:55
  • (2) You do not want specific threshold values between nodes, you really are after tuning how the tree is built not how the tree evaluates. (3) Voting thresholds would be quite bad too, since one most forests will collect probabilities from the trees rather than 0 or 1s. (1) is implemented pretty much everywhere (and works well), since adding weights to samples is how ada boost is performed. Never thought about (4) need to think on it a while. – grochmal May 22 '19 at 15:04
  • @grochmal Something to add regarding the points you've made for (3) and (1): First, (3) and (4) are equivalent and common practice. Second, I don't see any reason why Ada Boost is relevant as a justification for why (1) is common practice. – Scholar May 22 '19 at 15:28
  • @bi_scholar - I guess you're quite right (3) and (4) are very similar. In a Random Forest you will not allow the decision tree to reach clean leaves, you will stop after 2-3 splits and leave the tree at that. The voting result from such a tree are the ratios of each class in the final leaf. I'll still argue that you will do the threshold on a sum of tree outputs and do ROC on binary votes but that's a minimal diff (or can you do ROC on non binary?). As for ada boost is really an implementation point - there are pretty much no RF implementations without sample weights, because of ada boost – grochmal May 22 '19 at 16:07
  • @grochmal please get your facts straight. Random Forests are generally build to node purity, as this minimizes the model bias and Random Forest work by reducing model variance only. Performing only 2-3 splits makes no sense in Random Forests! You can verify this in [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) or even the original paper by Breiman. – Scholar May 22 '19 at 16:13
  • @grochmal "As for ada boost is really an implementation point - there are pretty much no RF implementations without sample weights, because of ada boost". What is this supposed to mean? You repeatedly mentioned Ada boost. Why? It's completely irrelevant in this discussion. Are you mixing these models up by any chance? In tree boosting, performing only a small number of splits is very common. Again Ada boost is NOT an implementation of Random Forest and vice versa, they are completely different models. – Scholar May 22 '19 at 16:15
  • @bi_scholar - You will want to add some limits to the forest tree builds. e.g. minimum samples per leaf, impurity thresholds. If you end with an isolated sample somewhere it can take several splits to get it out - which will be costly. P.S. language - i understand i'm the implementation guy and you're the theoretician, but that is no reason to be calling each other. – grochmal May 22 '19 at 16:20
  • @grochmal Where are you getting your information from? Did you bother to check my citations? I'm all for a pleasant Q&A environment, but you are actively spreading false information and this needs to be addressed, especially since the OP is a new user. There is little to no reason to add regularization to trees in a random forest and most definitely not extreme cases such as stopping after 2-3 splits. PS: Consider providing factual evidence in form of proper citations whenever your arguments are challenged and don't view it as a personal attack. – Scholar May 22 '19 at 16:31
  • @bi_scholar - i'm thinking directly on sklearn forests - that's what i use. They will always have some form of min_sample per leaf enabled. And will not do binary voting unless asked. This second part is easy to find in [code - around line 600 we have the voting from the individual classifiers](https://github.com/scikit-learn/scikit-learn/blob/b7b4d3e2f/sklearn/ensemble/forest.py) - predict proba will be the probability output instead of binary outputs. – grochmal May 22 '19 at 16:45
  • @grochmal I'm asking for actual evidence, a peer reviewed paper where the authors apply regularization to a Random Forest model and lay down their reasoning and their results. The fact that sklearn provides min_leaf_size as an hyper-parameters is meaningless and how you deduce your earlier statement, that one should only do 2-3 splits per tree, from this is beyond me. **Breiman, the actual author of Random Forests, suggest to build to node purity.** PS: This board is about statistics, not code. – Scholar May 22 '19 at 16:56
  • @bi_scholar - I think we are hitting the [classical meta question of this SE - where code and implementation come in?](https://stats.meta.stackexchange.com/questions/4990/are-machine-learning-questions-about-an-incorrectly-working-piece-of-code-consid) . And the answers seems to be "What OP needs to be explained". Which I admit is not 100% clear which way goes in this question. But yes, this is really a meta discussion. That said, I got to think a bit and exercised which is a plus for me. – grochmal May 22 '19 at 17:07
  • @grochmal that's fair. I assume that this comment chain will be removed any time anyways. – Scholar May 22 '19 at 18:42
  • Just a note about weighting classes in the splitting criterion https://stats.stackexchange.com/q/68940/2719 – Simone May 23 '19 at 06:45
  • @Simone that's a great link. do you know if that is the right (or only appropriate) procedure to account for asymmetric costs in classification for a random forest? – mnmn May 24 '19 at 13:19
  • @mnmn reweighting classes according to costs yield to similar effects of cost sensitive classification. In that link I was mentioning Metacost http://weka.sourceforge.net/doc.stable/weka/classifiers/meta/MetaCost.html – Simone May 24 '19 at 13:59

1 Answers1

2

Misclassification costs can often be dealt with through class weights, the same way as unbalanced classes can. This means that if the misclassification cost is higher for a class, elements of such a class will be more influent when making predictions.

For decision trees and random forests, this has been shown in this paper by Breiman, that I would say puts together points 1 and 2 of your question.

Indeed, Weighted Random Forest uses a weighted version of the Gini Coefficient in order to make the splits. This means that the Gini Coefficient will be maximum when the weighted sum of the elements of each class is equal (normally, Gini is maximum when elements are evenly distributed within the classes) (1). At the same time, this also means that the threshold that is used when considering the majority class of a node will not be 0.5, but it will come from the ratio of the class weights. Finally, this also applies for predictions (2), as the threshold will be modified by the weights.

Unfortunately, to this day I do not know any major statistical package using this method, as class weights are usually used for over/undersampling of the classes, which is much more specific to unbalanced classes.

Finally, using the ROC score is always advised when your classes and/or your costs are not balanced, so that you can tweak the thresholds to balance the results of your classifier.

Davide ND
  • 2,305
  • 8
  • 24