0

I need to see whether different k-fold cross-validations with repeats will improve my classification accuracy.

For that, I have tested with repeatedCV (k-fold=50, repeat=10, k-fold=100, repeat=10, and k-fold=1000, repeat=10)

But, accuracy values in the confusion matrix are same for the above different repeated cross-validations.

Please see my code below. I am using caret package for training the data.

I am thinking whether I am getting the same accuracy results because my my model is stable/robust or due to some other reason.

Any help is appreciated.

Thanks in advance.

standardized.X <- LossT[,-c(1:10)]
set.seed(50)
ind <- createDataPartition(LossT$LRPPcat, p=0.85, list = FALSE)
train.X <- standardized.X[ind,]
test.X <- standardized.X[-ind,]
train.Y=LossT$LRPPcat[ind]
test.Y=LossT$LRPPcat[-ind]

set.seed(123)
seeds <- vector(mode = "list", length = 501)
for(i in 1:501) seeds[[i]] <- sample.int(1000, 20)
## For the last model:
seeds[[501]] <- sample.int(1000, 1)

# Train the model
ctrl <- trainControl(method = "repeatedcv",number = 50, repeats = 10, preProcOptions = list(thresh = 0.85, k = 5),seeds = seeds, selectionFunction = "best", savePredictions = TRUE, allowParallel = TRUE) 
modelknn <- train(train.X,train.Y, method = "knn",preProcess = c("pca"), trControl = ctrl,metric = "Accuracy",tuneLength = 20)

fitted.results <- predict(modelknn,test.X) 
cm <- confusionMatrix(data=fitted.results, reference=test.Y) 
Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
  • 1
    I think you have a misunderstanding about what cross validation is for. K is not a tuning parameter in a model, it does not change the complexity of the model being fit to the data. Different k's give you different estimates of the same quantity, for the same model, for the same population. – Matthew Drury Aug 08 '17 at 14:28
  • Thanks for your reply and appreciate it. But, when used "knn" the numbers of nearest neighbors (k) is the tuning parameter. That's why I have put the tuning length there. But, I cannot understand the reason why for different repeatedCV gives me the same classification output results. Any idea? Thanks again. – user3408139 Aug 08 '17 at 17:12
  • As Matthew Drury pointed out, the $k$ in CV is not a tuning parameter. Conversely, the $k$ in KNN is. There is no relationship between the two. Perhaps you might want to review a textbook on these matters. – Ami Tavory Aug 08 '17 at 21:18
  • @Ami, Thanks for your comment. I totally understand it. Do you think my code needs to be revised? If yes, where should I change? Please let me know. Thanks a lot again. – user3408139 Aug 09 '17 at 14:25

1 Answers1

1

But, accuracy values in the confusion matrix are same for the above different repeated cross-validations.

This is how it should ideally be. The actual choice of the number of folds does not influence the results much, see also Choice of K in K-fold cross-validation

I am thinking whether I am getting the same accuracy results because my my model is stable/robust or due to some other reason.

The aggregated confusion matrix (summing up the results of all surrogate models) is not the best place to characterize stability [of the predictions wrt. small changes in the training set]. You can get that more directly by comparing predictions across the repetitions.

The "other reason" may be that you have many cases (as you do k = 1000), so the random (relative) differences you observe in the final result are typically smaller than those observed with small data sets.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133