-1

I want someone to answer me if what I'm doing is correct. I have a labelled data that I want to train different machine learning models that can predict the outcome. here are the steps that I have went through:

  1. I have divided the data into two sets 80% for training and 20% for testing.
  2. I have cross validated the training set (and only the training set) with 10 folds using different models (Knn, ANN, SVM ,...etc).
  3. I kept tuning the parameters of the models until I got a satisfactory root mean squared error (RMSE) for each model.
  4. I used the parameters that produced the lowest RMSE to build each model using the training set (80% of the data).
  5. I fed the testing set (the remaining 20%) into the each model and got a prediction from each model.
  6. I evaluated the testing set prediction error of each model using MSE, RMSE,MAPE and MAE.
  7. Compare the models and recommend the model that produced the lowest error.

My questions:

  1. is using 10-fold cross validation on the testing set alone is similiar to dividing the data into 70% training, 10 % validation and 20% testing? It would be really helpful if you could provide me with research papers that adopt such technique.

  2. Does this procedure makes sense, or am I doing something wrong?

Jan Kukacka
  • 10,121
  • 1
  • 36
  • 62
  • 1
    There are [999 posts](https://stats.stackexchange.com/search?q=[cross-validation]%20how%20to%20use) dealing with "how to use cross validation" on this site. Did you have a look at them? For example [this one](https://stats.stackexchange.com/q/187881/163572) or [this one](https://stats.stackexchange.com/q/250282/163572)? – Jan Kukacka Mar 16 '18 at 15:49

1 Answers1

0

Though I cannot recall having seen a strategy such as this used in the literature, my initial impression is that it is doing a cross-validation on a cross-validated conclusion. A ”second-order“ cross-validation, if you will.

For example, the 80/20 split can be seen as a form of cross-validation; a 1-k unbalanced split. Thus, doing a cross-validation on the 80% subset is essentially just producing a different model that will then be tested on the 20% testing subset.

My suggestion is to just do the 10k CV. Additionally, it may help to think of validation and testing as synonymous in this context.

Hope this helps.

Gregg H
  • 3,571
  • 6
  • 25