Hey I want to build a model (choose significant variables) and validate it. Is this way correct?
- Divide data into train (80%) and test (20%) data
- Use train data to build a model (especially variable selection)
- When we have chosen variables, we can do k-fold validation of this model on TRAIN data
- If everything is OK, that is the results from k-fold validation are close, we can build a model on all train data and use it to check the accuracy of our model on test data
Is my way correct or I missed sth? My main point is when we select the variables for the model. And if everything is OK, should I build my model on ALL data or only on training data?