How can I cross-validate a simple binary classifier?

Question

I have a dataset of 30 observations of two variables (one is a class and it's binary, the other is a percentage and it's continuous). My ultimate goal is to build a classifier that is able to predict the class using the percentage.

For this I simply used ROC analysis and identified the best threshold (the one with the lowest cost, given my own estimates of the costs of a false positive and of a false negative).

I would like to use cross-validation to enhance the robustness of this classifier, but I seem to be unable to do this, both conceptually and in R.

This is what I did so far, using 10-fold cross-validation.

Conceptually, my understanding is that I should:

randomly divide the data in 10 folds
use 9 folds to develop a classifier (i.e. find the best threshold for the percentage variable)
see how well this threshold classifies the remaining fold (the "test" fold)
(repeat 10 times using each of the 10 folds in turn as the test fold)

This then leaves me with 10 thresholds and 10 confusion matrices. I think I can get the test performance on unseen data by summing these 10 confusion matrices and calculating sensitivity and specificity from there. However, to get a final threshold it seems naive to pick the weighted average or the median of the thresholds, because that would basically be the same threshold I got in the first place without doing any cross-validation. Where am I mistaken?

As regards the R part, I gave up on trying to use the caret package and simply did the following:

library(ROCR)
find.cutoff = function(data, cfp, cfn) {
    pr = prediction(data$percentage, data$class)
    cost = performance(pr, "cost", cost.fp=cfp, costs.fn=cfn)
    ind = which.min(cost@y.values[[1]])
    best.cutoff = pr@cutoffs[[1]][ind]
    best.cutoff
}

test.on.unseen = function(data, cutoff) {
    pr = prediction(ifelse(data$percentage >= cutoff,1,0), data$class)
    ind = pr@cutoffs[[1]]==1
    print(paste(pr@tp[[1]][ind], pr@tn[[1]][ind], pr@fp[[1]][ind], pr@fn[[1]][ind]))
}

# thanks to https://stats.stackexchange.com/a/105839/232872    
data<-data[sample(nrow(data)),]
folds <- cut(seq(1,nrow(data)),breaks=10,labels=FALSE)
for(i in 1:10){
    testIndexes <- which(folds==i,arr.ind=TRUE)
    testData <- data[testIndexes, ]
    trainData <- data[-testIndexes, ]
    cutoff = find.cutoff(trainData, 5, 1)
    test.on.unseen(testData, cutoff)
}

Then I couldn't really make out how to do this properly, so I used the printed data in the second function to compute sensitivity and specificity by hand.

My knowledge of R and statistics in development, so I apologize for any clunkiness or terrible mistakes.

Thanks in advance!

Of course you should! But you're in the textbook case of cross validation, can you elaborate what you don't understand so we can give you a useful answer? — Calimo, Jan 03 '19 at 06:36

How can I cross-validate a simple binary classifier?

0 Answers0