2

I want to build a machine learning model using the caret package in R. Some of the features in my dataset are dummies taking the value 0 or 1. I would like to know which resampling methods can be used in the presence of dummy variables.

k-fold cross-validation does not seem to be an option as explained in this post. I could potentially use leave-one-out cross-validation; however, this seems to be too expensive when N is large (my dataset has 100000+ observations).

Are there any suitable resampling methods in the presence of dummies and large N?

kanimbla
  • 564
  • 4
  • 15
  • 2
    (+1) "Not an option" is rather strong - if there aren't too few 1s (or 0s) it'll be very unlikely to get none in an 80% or 90% sample. See also *stratified cross-validation*. – Scortchi - Reinstate Monica Sep 06 '16 at 18:08
  • 2
    Thanks for the useful hint regarding stratiefied CV! I actually do have some very rare groups in my sample (lots of 0s) and, having looked at stratified CV, this seems to be a promising approach to deal with this issue, i.e. making sure that the training data includes also data on the rare class. A good discussion on the topic can be found [here](http://stats.stackexchange.com/questions/117643/why-use-stratified-cross-validation-why-does-this-not-damage-variance-related-b) – kanimbla Sep 06 '16 at 20:57

0 Answers0