Is it necessary to split dataset for cross validation?

Question

I am using caret package in R for training dataset and cross validation process. I am confused about cross validation process.

Now, i am splitting the dataset to two subset, training and testing;

inTraining <- createDataPartition(dataset$class, p = .75, list = FALSE)
training <- dataset[ inTraining,]
testing  <- dataset[-inTraining,]

After that, i am using the code below for training the model on training dataset;

fitControl_cv <- trainControl(## 10-fold CV
  method = "cv",
  number = 10,
  verbose = TRUE)

model <- train(TRAINING$class ~ ., TRAINING, method = "<a_name>" ,trControl = fitControl_cv)

Is this true or not? I am confused why i am split dataset first. In my opinion, i don't need to split data to two subset, training and testing. Because, cross validation process is already doing that job, i guess.

This is my approach in R;

model <- train(DATASET$class ~ ., DATASET, method = "<a_name>" ,trControl = fitControl_cv)

Which one is true? Applying cross validation to whole dataset or training set?

So here you mean we have to split data into train and test and then apply cross validation on train ???? — Rawia, Feb 22 '18 at 12:03

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

You need to split your data into training and testing subsets for cross-validation. In $k$-fold cross-validation you do it $k$ times repeatedly.

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). (Wikipedia)

What seems to be confusing is that in most cases when you use some software package for cross-validation (as compared to coding it from the scratch), then software does the splitting for you and you do not to split data by hand.

Documentation of train function from caret library shows usage of this function with "cv" method (cross-validation). As you can see in the example, shown below, you only need to define method with trainControl possibly defining also other parameters e.g. number for number of cross-validation rounds or p for proportion of sample to be used as train set.

library(caret)
library(e1071)

data(iris)
TrainData <- iris[,1:4]
TrainClasses <- iris[,5]

knnFit1 <- train(TrainData, TrainClasses,
                 method = "knn",
                 preProcess = c("center", "scale"),
                 tuneLength = 10,
                 trControl = trainControl(method = "cv"))

To learn more about cross-validation check this thread and Why every statistician should know about cross-validation blog entry by Rob Hyndman, the paper he quotes also seems to provide nice review.

This is, unfortunately, a drawback to new user's investing immediately in tools like caret, which hide many details. It's always worth doing these things the hard way first, so you can really see how things work, and why they work that way. Stuff like caret should be viewed as a shortcut for those who already have some experience. — Matthew Drury, Jan 02 '16 at 18:08

Is it necessary to split dataset for cross validation?

1 Answers1

Linked