4

Suppose we have training data set and a test data set. The outcome variable is binary. Is it usually necessary to split the training data set so that there is a cross validation data set? Or can you use the whole training data set to build a model and the use this model on the test data set? For logistic regression, for example, would cross validation really help? If so, what type would be best?

svmguy
  • 149
  • 2
  • 8
  • 3
    There are plenty of **great answers** on this site that try to answer your question. Please see [Training with the full dataset after cross-validation?](http://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation), [this answer](http://stats.stackexchange.com/a/72324/2798) and [many](http://stats.stackexchange.com/questions/79905/cross-validation-including-training-validation-and-testing-why-do-we-need-thr?rq=1) of the [other](http://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection) threads on this topic. – Amelio Vazquez-Reina Jul 16 '14 at 18:34
  • 1
    One more: [Internal vs external cross-validation and model selection](http://stats.stackexchange.com/questions/64147/internal-vs-external-cross-validation-and-model-selection) – Amelio Vazquez-Reina Jul 16 '14 at 19:10

2 Answers2

6

Cross validation has two purposes :

  • when you don't use cross validation and randomly select a part of data as train and other part as test, you may have a high accuracy in that part for train and test but when you select another train and test data you may have lower accuracy. Cross validation methods like n-fold cross validation or etc. will help to find best fit model based on your database. with lowest error on all parts of data.

  • In some cases cross validation will help to find some parameters of model like C in logistic regression that you can find some documentation about it in MATLAB help center or in R documentation files.

So as we discoursed cross validation has a critical rule to find a reliable model for your database. You should select best cross-validation technique based on your model structure and your sample size. 5-fold cross validation is a well known technique. You can increase the k in k-fold cross validation If you have more sample size.

user2991243
  • 3,621
  • 4
  • 22
  • 48
1

In general cross validation is always needed when you need to determine the optimal parameters of the model, for logistic regression this would be the $C$ parameter.

As a first start you can look into k-fold validation, if you are using R look at (http://caret.r-forge.r-project.org/training.html) or python (http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

mike1886
  • 924
  • 7
  • 15
  • 8
    The goal of CV is **not** to estimate parameters but to estimate the **generalization performance** and **stability** of your **full learning procedure**. If you choose to determine parameters with CV (you probably mean with a grid search), you **should** add another layer of cross validation, to cross-validate the grid search process itself. In other words, the full learning procedure, regardless of whether you are estimating parameters or hyper-parameters, should be cross validated. Please see the links I provided in the comments to the OP. – Amelio Vazquez-Reina Jul 16 '14 at 18:42
  • Am I right that in principle two CVs have to be performed? The first time to get the model and then again to find the best parameters? – Ben Oct 15 '19 at 05:59