I have a dataset of 30 observations of two variables (one is a class and it's binary, the other is a percentage and it's continuous). My ultimate goal is to build a classifier that is able to predict the class using the percentage.
For this I simply used ROC analysis and identified the best threshold (the one with the lowest cost, given my own estimates of the costs of a false positive and of a false negative).
I would like to use cross-validation to enhance the robustness of this classifier, but I seem to be unable to do this, both conceptually and in R.
This is what I did so far, using 10-fold cross-validation.
Conceptually, my understanding is that I should:
- randomly divide the data in 10 folds
- use 9 folds to develop a classifier (i.e. find the best threshold for the percentage variable)
- see how well this threshold classifies the remaining fold (the "test" fold)
- (repeat 10 times using each of the 10 folds in turn as the test fold)
This then leaves me with 10 thresholds and 10 confusion matrices. I think I can get the test performance on unseen data by summing these 10 confusion matrices and calculating sensitivity and specificity from there. However, to get a final threshold it seems naive to pick the weighted average or the median of the thresholds, because that would basically be the same threshold I got in the first place without doing any cross-validation. Where am I mistaken?
As regards the R part, I gave up on trying to use the caret
package and simply did the following:
library(ROCR)
find.cutoff = function(data, cfp, cfn) {
pr = prediction(data$percentage, data$class)
cost = performance(pr, "cost", cost.fp=cfp, costs.fn=cfn)
ind = which.min(cost@y.values[[1]])
best.cutoff = pr@cutoffs[[1]][ind]
best.cutoff
}
test.on.unseen = function(data, cutoff) {
pr = prediction(ifelse(data$percentage >= cutoff,1,0), data$class)
ind = pr@cutoffs[[1]]==1
print(paste(pr@tp[[1]][ind], pr@tn[[1]][ind], pr@fp[[1]][ind], pr@fn[[1]][ind]))
}
# thanks to https://stats.stackexchange.com/a/105839/232872
data<-data[sample(nrow(data)),]
folds <- cut(seq(1,nrow(data)),breaks=10,labels=FALSE)
for(i in 1:10){
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- data[testIndexes, ]
trainData <- data[-testIndexes, ]
cutoff = find.cutoff(trainData, 5, 1)
test.on.unseen(testData, cutoff)
}
Then I couldn't really make out how to do this properly, so I used the print
ed data in the second function to compute sensitivity and specificity by hand.
My knowledge of R and statistics in development, so I apologize for any clunkiness or terrible mistakes.
Thanks in advance!