I want someone to answer me if what I'm doing is correct. I have a labelled data that I want to train different machine learning models that can predict the outcome. here are the steps that I have went through:
- I have divided the data into two sets 80% for training and 20% for testing.
- I have cross validated the training set (and only the training set) with 10 folds using different models (Knn, ANN, SVM ,...etc).
- I kept tuning the parameters of the models until I got a satisfactory root mean squared error (RMSE) for each model.
- I used the parameters that produced the lowest RMSE to build each model using the training set (80% of the data).
- I fed the testing set (the remaining 20%) into the each model and got a prediction from each model.
- I evaluated the testing set prediction error of each model using MSE, RMSE,MAPE and MAE.
- Compare the models and recommend the model that produced the lowest error.
My questions:
is using 10-fold cross validation on the testing set alone is similiar to dividing the data into 70% training, 10 % validation and 20% testing? It would be really helpful if you could provide me with research papers that adopt such technique.
Does this procedure makes sense, or am I doing something wrong?