How to make the randomforest trees vote decimals but not binary

Question

My question is about binary classification, say separating good customers from bad customers, but not regression or non-binary classification. In this context, a random forest is an ensemble of classification trees. For each observation, every tree votes a "yes" or "no", and the average vote of all trees is the final forest probability.

My question is about modifying the behavior of the underlying trees: How can we modify the randomForest function (of the randomForest package of R) so that each tree votes a decimal instead of a binary yes/no. To better understand what I mean by decimal, let's think about how decision trees work.

A fully grown decision tree has 1 good or 1 bad instance in its terminal nodes. Assume that I limit the terminal node size as 100. Then terminal nodes are going to look like:

Node1 = 80 bad, 20 good
Node2 = 51 bad, 49 good
Node3 = 10 bad, 90 good

Notice, even though Node1 and Node2 vote "bad", their "strength of bad-ness" is severely different. That is what I am after. Instead of having them produce 1 or 0 (which is the default behavior) can one modify the R package so they vote 80/100, 51/100, 10/100 etc?

Just wanted to include the following link to a related discussion: https://stackoverflow.com/questions/22409019/how-do-i-get-individual-tree-probabilities-from-random-forests-in-r?rq=1 — FatihAkici, Jul 25 '18 at 15:32
You mean you want the *actual predicted probabilities*, not just the most likely predicted class. — smci, Jul 25 '18 at 23:33
Transformation Forests ([Hothorn & Zeileis, 2021](https://doi.org/10.1080/10618600.2021.1872581)) should give you exactly what you need, as probabilistic classifications. — Stephan Kolassa, Feb 12 '22 at 07:50

Sycorax · Accepted Answer · 2018-07-26T13:38:55.987

10

This is a subtle point that varies from software to software. There are two main methods that I'm aware of:

Binary leafs - Each leaf votes as the majority. This is how randomForest works in R, even when using predict(..., type="prob")
Proportion leafs - Each leaf returns the proportion of the training samples belonging to each class. This is how sklearn.ensemble.RandomForestClassifier.predict_proba works. In another answer, @usεr11852 points out that R's ranger package also provides this functionality. Happily, I can attest that from my limited usage, ranger is also much, much faster than randomForest.

I don't think that there's an easy way to get randomForest to use the proportional leaf method, since the R software is actually just a hook into a C & FORTRAN program. Unless you enjoy modifying someone else's code, you'll either have to write your own, or find another software implementation.

edited Jul 26 '18 at 13:38

answered Jul 25 '18 at 14:38

Sycorax

76,417
20
189
313

Thanks so much, Sycorax. Do you think the source code of R's `randomForest ` can be modified to accompany this? – FatihAkici Jul 25 '18 at 14:44
Given enough resources, any software can be modified to do anything. On the other hand, `randomForest` is just an R interface into FORTRAN code, so it might take considerable resources to accomplish. – Sycorax Jul 25 '18 at 14:46
@FatihAkici Actually, it looks like I was mistaken. The compiled code is a C port of Breiman's original FORTRAN code. – Sycorax Jul 25 '18 at 15:41
5

I worked on the randomForest package for a summer in 2015 as part of an REU. It's definitely possible to modify the code to do this, but it's a bit tricky because it's actually mixed C-Fortran. Most of the "outer" code is C, while a few core functions remain in Fortran, and are linked after compilation. Unfortunately, it's been too long since I've seen the code to know where to look. but I'd suggest being prepared to work in both C and Fortran if you want to modify the package. – chipbuster Jul 25 '18 at 18:22

score 5 · Answer 2 · answered Jul 25 '18 at 23:39

It is perfectly possible to grow a "probability forest". The methodology in Malley et al. (2012) "Probability machines: consistent probability estimation using nonparametric learning machines." that outlines how this is done and how it compares to standard random forest implementation. In addition, the excellent R package ranger implements this functionality already; just set probability = TRUE when making the function call to ranger.

score 2 · Answer 3 · answered Jul 25 '18 at 14:23

2

Simply use predict.randomForest(..., type="prob"). You are doing a Good Thing.

answered Jul 25 '18 at 14:23

Stephan Kolassa

95,027
13
197
357

2

Stephan, thanks for your answer, but that's not what I am looking for. The code you mentioned takes the average of the underlying binary tree votes, but I am trying to have the underlying trees vote decimal. So in a forest of 3 trees, I don't want to (1+1+0)/3, rather (.80+.51+.10)/3. Does that make sense? – FatihAkici Jul 25 '18 at 14:30
2

It does make sense, and that's the behavior of the random forest classifier in sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba – psarka Jul 25 '18 at 14:36
1

Ah, I see your point. Sorry, then I misunderstood. Judging from the help page and looking at the `predict.all` and the `norm.votes` parameters, this does not seem to be possible. If you really want this, you will probably need to code your own random forest. I agree with @psarka that this makes perfect sense. (I'd delete this answer, but then we would lose this discussion in the comments.) – Stephan Kolassa Jul 25 '18 at 14:38
It was indeed a very fruitful discussion, dear Stephan and @psarka. Thank you! – FatihAkici Jul 25 '18 at 15:09

How to make the randomforest trees vote decimals but not binary

3 Answers3

Linked