5

Is there a specific way of sampling which maintains the ratio of samples in an unbiased set? e.g., lets say I want to do k-fold cross-validation on my training set And my training set is very imbalanced (let's say 1:10) How can I sample such that each of the k-folds also has the same ratio of +/- (In this case, 1:10).

user41799
  • 661
  • 1
  • 6
  • 5
  • 4
    It is called "stratified" cross-validation. You split into k groups each of the classes, and make sure that in each training set go k-1 groups from each class, and in each test set goes 1 group from each class. – amoeba Mar 12 '14 at 16:42

1 Answers1

1

You cannot guarantee to maintain perfectly balanced fold-sets, however, you can try and get the fold-sets as balanced as possible. Here is one solution:

# Function to assign a elements from a factor to n fold-sets
# Returns a list (of length n) of the indices in the factor
nfoldsets <- function(x, n) {
  if (!is.numeric(n)) stop("n must be numeric")
  x <- as.factor(x)

  # list of indices for each level
  types <- split(1:length(x), x)

  # fn to create random assignments for a vector of length l into n groups
  nblocks <- function(l, n) sample(rep(sample(1:n), (l+n)/n)[1:l])

  # assign indices for each level to n groups
  assignments <- lapply(types, function(x) split(x, nblocks(length(x), n)))

  # merge assignments for same group from each type
  out <- as.data.frame(do.call(rbind, assignments))
  lapply(out, function(x) sample(unname(unlist(x))))
}

Usage:

# create a vector of length 110, with ratio 1:10, a:b
> x <- sample(c(rep("a", 10), rep("b", 100)))

# assign elements to each of 5 fold-sets, keeping as balanced as possible
> fs <- nfoldsets(x, n=5)

# check assignments are balanced
> sapply(fs, function(i) table(x[i]))
   1  2  3  4  5
a  2  2  2  2  2
b 20 20 20 20 20
waferthin
  • 511
  • 3
  • 10