7

Hi I am developing a fraud prediction model. Because this is a highly unbalanced classification problem I have chosen to try to resolve it by Random Forests.

Inspired by this article
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
I have chosen to try Balanced Random Forests.

For now I am not sure how to implement these Forests in R.
The article suggests that: For each iteration in random forest, draw a bootstrap sample from the minority class.
Randomly draw the same number of cases, with replacement, from the majority class.

Is this achieved by specifying these parameters?

replace = TRUE  
strata = fraud.variable  
sampsize = c(x,x) where x is the size of samples to be drawn
C Ried
  • 109
  • 4
MiksL
  • 177
  • 2
  • 8

4 Answers4

6

You can balance your random forests using case weights. Here's a simple example:

library(ranger) #Best random forest implementation in R

#Make a dataste
set.seed(43)
nrow <- 1000
ncol <- 10
X <- matrix(rnorm(nrow * ncol), ncol=ncol)
CF <- rnorm(ncol)
Y <- (X %*% CF + rnorm(nrow))[,1]
Y <- as.integer(Y > quantile(Y, 0.90))
table(Y)

#Compute weights to balance the RF
w <- 1/table(Y)
w <- w/sum(w)
weights <- rep(0, nrow)
weights[Y == 0] <- w['0']
weights[Y == 1] <- w['1']
table(weights, Y)

#Fit the RF
data <- data.frame(Y=factor(ifelse(Y==0, 'no', 'yes')), X)
model <- ranger(Y~., data, case.weights=weights)
print(model)
Zach
  • 22,308
  • 18
  • 114
  • 158
  • I read that case.weights in Ranger determines the sampling for training, does it also affect the sampling for OOB samples? – David May 07 '20 at 01:15
3

For reference and adding to @zach's answer:

The package ranger now(*) implements a sample.fraction argument that allows a vector of class-specific values for a stratified sampling scheme suitable for imbalance cases.

(*) see issue #167 and the fix #263 allowing class-wise sample.fraction

Cazz
  • 41
  • 1
2

The writers had a presentation of the techniques found here: http://www.interfacesymposia.org/I04/I2004Proceedings/ChenChao/ChenChao.presentation.pdf

According to the authors, there’s an add-on package to R that implements their original Fortran:

Here are the working links to the R package:

Unfortunately if you search the documentation for that package here, there is no mention of "balanced" or "brf." This paper, provides a clue: "we estimate balanced RF models using the sampsize argument from the randomForest package"

This can save you from having to implement this manually.

Afflatus
  • 141
  • 6
1

The "randomForest" function in the "randomForest" R package supports the Balanced Random Forest. One need to specify the "strata" and the "sampsize" parameters to enable the balanced bootstrapping resampling.

  • strata
    A (factor) variable that is used for stratified sampling.
  • sampsize
    Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

A reference can be found here at: http://appliedpredictivemodeling.com/blog/2013/12/8/28rmc2lv96h8fw8700zm4nl50busep

Hope it helps!

Heng Li
  • 11
  • 2