Is there a specific way of sampling which maintains the ratio of samples in an unbiased set? e.g., lets say I want to do k-fold cross-validation on my training set And my training set is very imbalanced (let's say 1:10) How can I sample such that each of the k-folds also has the same ratio of +/- (In this case, 1:10).
Asked
Active
Viewed 1,451 times
5
-
4It is called "stratified" cross-validation. You split into k groups each of the classes, and make sure that in each training set go k-1 groups from each class, and in each test set goes 1 group from each class. – amoeba Mar 12 '14 at 16:42
1 Answers
1
You cannot guarantee to maintain perfectly balanced fold-sets, however, you can try and get the fold-sets as balanced as possible. Here is one solution:
# Function to assign a elements from a factor to n fold-sets
# Returns a list (of length n) of the indices in the factor
nfoldsets <- function(x, n) {
if (!is.numeric(n)) stop("n must be numeric")
x <- as.factor(x)
# list of indices for each level
types <- split(1:length(x), x)
# fn to create random assignments for a vector of length l into n groups
nblocks <- function(l, n) sample(rep(sample(1:n), (l+n)/n)[1:l])
# assign indices for each level to n groups
assignments <- lapply(types, function(x) split(x, nblocks(length(x), n)))
# merge assignments for same group from each type
out <- as.data.frame(do.call(rbind, assignments))
lapply(out, function(x) sample(unname(unlist(x))))
}
Usage:
# create a vector of length 110, with ratio 1:10, a:b
> x <- sample(c(rep("a", 10), rep("b", 100)))
# assign elements to each of 5 fold-sets, keeping as balanced as possible
> fs <- nfoldsets(x, n=5)
# check assignments are balanced
> sapply(fs, function(i) table(x[i]))
1 2 3 4 5
a 2 2 2 2 2
b 20 20 20 20 20

waferthin
- 511
- 3
- 10