1

Hey I want to build a model (choose significant variables) and validate it. Is this way correct?

  1. Divide data into train (80%) and test (20%) data
  2. Use train data to build a model (especially variable selection)
  3. When we have chosen variables, we can do k-fold validation of this model on TRAIN data
  4. If everything is OK, that is the results from k-fold validation are close, we can build a model on all train data and use it to check the accuracy of our model on test data

Is my way correct or I missed sth? My main point is when we select the variables for the model. And if everything is OK, should I build my model on ALL data or only on training data?

Math122
  • 117
  • 4
  • 1
    Everything is wrong with that approach. See [Regression Modeling Strategies](https://hbiostat.org/rms) for details and safe approaches. – Frank Harrell May 23 '21 at 11:16
  • Can I ask what exactly ja wrong with that approach? Because I found it in many YT films – Math122 May 23 '21 at 12:25
  • 1
    Even if automated variable selection seems to work on your data sample, the resulting model generally won't work well on a new sample from the same population. See [this page](https://stats.stackexchange.com/q/20836/28500) among many others on this site, in addition to the extensive discussion (and better approaches) in the resources that Professor Harrell provides in his link. Professor Harrell's work has been vetted by peer review for decades. Who vets You Tube videos? – EdM May 23 '21 at 16:53

1 Answers1

3

Divide data into train (80%) and test (20%) data

  1. if you do not have separate test data then it does not make sense to divide your original data to train and test. Use k-fold cross validation instead. Test data is something that is not taken from the same population. a simple example would be data from two different hospitals H so data from H1 used for model building and H2 used for final validation.

Use train data to build a model (especially variable selection)

  1. Do variable selection using k-fold cv of train data so that variables are selected on train cv fold and validated on internal validation fold

When we have chosen variables, we can do k-fold validation of this model on TRAIN data

  1. this is wrong as you have already chosen a variable on the entire train data, so after variable selection doing k-fold validation will include your train data in validation split and results, in this case, would be biased

If everything is OK, that is the results from k-fold validation are close, we can build a model on all train data and use it to check the accuracy of our model on test data

  1. if you do not have external test data, stop at step 2 and report internal average train and test results from k folds, if you have separate test data as explained in step 1) you can build model using the entire train data with selected variables and apply model to external test data. There can be many ways of selecting variables e.g. one way could be selecting variables that have high occurrences in train folds of step 2).

I hope this helps

  • Thank you very much. I hope that this approach is easily applicable in 'caret' package – Math122 May 23 '21 at 13:57
  • 1
    This goes much, much deeper than understanding a software package. Read about the extensive problems with variable selection and with split sample validation for starters. And don't trust the majority of YouTube videos about statistics that are produced by non-statisticians. – Frank Harrell May 23 '21 at 15:14
  • for learning do not use package to do the entire job for you. e.g. rather than using train function from caret that does cross-validation all in one function try to do loops, understand the splitting of data in folds. This would help you to have a look at selected variables, their occurrences, scores, and much more. – Iram shahzadi May 23 '21 at 15:59