Is cross validation needed?

Question

Suppose we have training data set and a test data set. The outcome variable is binary. Is it usually necessary to split the training data set so that there is a cross validation data set? Or can you use the whole training data set to build a model and the use this model on the test data set? For logistic regression, for example, would cross validation really help? If so, what type would be best?

There are plenty of **great answers** on this site that try to answer your question. Please see [Training with the full dataset after cross-validation?](http://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation), [this answer](http://stats.stackexchange.com/a/72324/2798) and [many](http://stats.stackexchange.com/questions/79905/cross-validation-including-training-validation-and-testing-why-do-we-need-thr?rq=1) of the [other](http://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection) threads on this topic. — Amelio Vazquez-Reina, Jul 16 '14 at 18:34
One more: [Internal vs external cross-validation and model selection](http://stats.stackexchange.com/questions/64147/internal-vs-external-cross-validation-and-model-selection) — Amelio Vazquez-Reina, Jul 16 '14 at 19:10

user2991243 · Answer 1 · 2015-04-22T21:12:24.643

Cross validation has two purposes :

when you don't use cross validation and randomly select a part of data as train and other part as test, you may have a high accuracy in that part for train and test but when you select another train and test data you may have lower accuracy. Cross validation methods like n-fold cross validation or etc. will help to find best fit model based on your database. with lowest error on all parts of data.
In some cases cross validation will help to find some parameters of model like C in logistic regression that you can find some documentation about it in MATLAB help center or in R documentation files.

So as we discoursed cross validation has a critical rule to find a reliable model for your database. You should select best cross-validation technique based on your model structure and your sample size. 5-fold cross validation is a well known technique. You can increase the k in k-fold cross validation If you have more sample size.

score 1 · Answer 2 · answered Jul 16 '14 at 18:11

1

In general cross validation is always needed when you need to determine the optimal parameters of the model, for logistic regression this would be the $C$ parameter.

As a first start you can look into k-fold validation, if you are using R look at (http://caret.r-forge.r-project.org/training.html) or python (http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)

answered Jul 16 '14 at 18:11

mike1886

924
7
15

8

The goal of CV is **not** to estimate parameters but to estimate the **generalization performance** and **stability** of your **full learning procedure**. If you choose to determine parameters with CV (you probably mean with a grid search), you **should** add another layer of cross validation, to cross-validate the grid search process itself. In other words, the full learning procedure, regardless of whether you are estimating parameters or hyper-parameters, should be cross validated. Please see the links I provided in the comments to the OP. – Amelio Vazquez-Reina Jul 16 '14 at 18:42
Am I right that in principle two CVs have to be performed? The first time to get the model and then again to find the best parameters? – Ben Oct 15 '19 at 05:59

Is cross validation needed?

2 Answers2