1

I have a question about model selection using cross validation.

As far as I understood from many other replies related to model selection here, one should use nested cross validation in order to properly (1) select and then (2) assess a model. But almost all these questions were about relatively small data sets, so I assumed that this method is good for relatively small data sets.

My question is what should one do in the case of a large data set ($\sim18$ millions observations, trying to select input variables out of $\sim10^2$ features), since training one model on such data set could be quite time consuming? Would it be fair to leave out test set for model assessment (after CV is done) and perform cross validation on rest of the data to select a model?

amoeba
  • 93,463
  • 28
  • 275
  • 317
Ira Z.
  • 13
  • 6

1 Answers1

2

If we have $~18$ million observations, the first question I would attack is how much information are in these data, i.e., if we use a sample of data set, is it a bad idea?

An extreme case would be in these $18$ million observations say, 80% of them are redundant, (extreme example would be they are identical). So, we do not use all the data to do model building.

Learning curve is a tool (using cross validation) to tell us "how much data we need, and how complex the model we need.". Here is an example. How to know if a learning curve from SVM model suffers from bias or variance?

If you plot the learning curve for specific model, you may find

  • We are under fitting, which means the model is too simple, and data is very complex / contain a lot of information. Then, we can increase the model complexity

  • We are over fitting, which means the model is too complex, and data is relative simple (say 18 million observations does not contain that much information as we expected), then use a sample of data would be sufficient.

I would suggest you use cross validation in "learning curve" way, which means you start with a sample of data, and increase the sample size, to observe training and testing loss to make further decision s on model selection.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • Thank you very much @hxd1011. This is a great point. I would like to try out this approach. I just have several questions about it. I apologize in advance if the questions are stupid, I just probably need to read more about learning curve. I would greatly appreciate if you could answer or point me to the resources where I could find relevant information. My questions are: (1) Could you please give me an advise on how to increase the sample if the data I need to use is very skewed? Should I use some sort of stratification when I add samples or just add them randomly to the training set? – Ira Z. Feb 16 '17 at 16:21
  • (2) Regarding CV, should I apply such procedure as in (1) to each fold? I mean increasing the size of folds (I assume by taking future samples from some kind of poll?). (3) If I want to conduct model selection, do I use this procedure for each hyper-parameters/features/etc. combination? Thank you very much for your help! – Ira Z. Feb 16 '17 at 16:21
  • @IraZ. check this video. https://www.youtube.com/watch?v=e3edL-_fUTo – Haitao Du Feb 16 '17 at 16:28
  • @IraZ. no problem. You are thinking too much on how to sample the data. If you have 18 million data, how do you know it is skewed? we can start with a small sample and observe the trend and increase it. During the process you will know better about the data. – Haitao Du Feb 16 '17 at 16:38
  • Thanks again @hxd1011! I just have to ask one more question. How do I compare different models using this method? I have a big set of features from which I need to select the relevant ones, and for each of the feature set I would then need to find the optimal model parameters (number of neurons in the hidden layer of feedforward neural network in my case). Is there some conventional way to do it using the learning curve? – Ira Z. Feb 22 '17 at 12:26
  • @IraZ. you additional question is too big to be answered in comments. Even you start with a new question it might be closed for too broad. – Haitao Du Feb 22 '17 at 15:33
  • But how can I use the method for my work then? Or is this only a preliminary method and for actual model selection I should employ something else? (My apologies for too many questions.) – Ira Z. Feb 22 '17 at 16:07
  • Ok, thank you! I'm not sure I know how to transfer this conversation there. Could you please help me do that? – Ira Z. Feb 22 '17 at 16:13
  • @IraZ. please read this answer to see how to do model selection in general. http://stats.stackexchange.com/questions/261537/how-to-chose-the-order-for-polynomial-regression/261544#261544 – Haitao Du Feb 22 '17 at 16:17
  • @IraZ. does it help? – Haitao Du Feb 22 '17 at 16:30
  • Thank you, @hxd1011. As you wrote in the end, in my work I use the combination of two (both data driven and knowledge based approaches). I just need more details about implementation itself, so that I could conduct the model selection procedure in a 'fair' way and describe it for my academic advisor. I was hoping to somehow combine CV and learning curve approaches, but couldn't find any past studies doing that. That's why I asked you initially if you had any ideas about that. – Ira Z. Feb 22 '17 at 16:37
  • @IraZ. remember to upvote :) – Haitao Du Feb 22 '17 at 16:41
  • I would :) But unfortunately I can't with this reputation yet :( – Ira Z. Feb 22 '17 at 16:43
  • @IraZ. Yes, I also do not know how to use this site well, i think if we have too many comments it will give us option to chat. – Haitao Du Feb 22 '17 at 16:44
  • It should, but again unfortunately I need to have a reputation score more than 20 for that, but I'm new here and have it low so far. – Ira Z. Feb 22 '17 at 16:46
  • @IraZ. anyway, good luck for your work. The overall advice would be: try to understand data and model, try to think bias variance trade off [here](http://scott.fortmann-roe.com/docs/BiasVariance.html) and use tools to see if you are under fitting or over fitting – Haitao Du Feb 22 '17 at 16:49
  • That's a cool document! Thanks a lot for your time and help! – Ira Z. Feb 22 '17 at 17:02