1

I have a couple conceptual questions on cross-validation and sample splitting.

  1. For sample splitting, there are times where we split into just train/test sets while other times we split into train/validation/test sets. What are the rules as to when to split into 2 vs 3 data sets?

So far the way I've explained it is we do 3 sets when we sometimes have hyperparameters, but we can do 2 if we do not. However I am not satisfied with the explanation. Is there a better way to explain it?

For example, if building a KNN, we have one hyperparameter k, meaning we need 3 data sets. Obviously, k = 1 when fit on training set. So we need a validation set that we use to determine a better k. Now for linear regression, we would only need train/test. The beta coefficients can be found using the training set and then we see how it performs on test set. However, if we do something like Lasso regression, we have a hyperparameter, lambda, but we can still split into 2 data sets.

What is a better way to explain when split into 3 vs. 2? Maybe a better explanation/rule is we split into 3 data sets if we have a parameter can lead to overfitting the training set?

  1. When doing cross validation like leave one out or k-fold, is cross validation built into the optimization process? And conceptually, how does this work? I know k-fold and leave one out CV methods are used when we have smaller data sets. How does it relate to sample splitting into 2 or 3 sets. Are the observations that are left out viewed as the "validation" or "test" sets?

If viewed as a validation set, we could jointly optimize:

For a linear regression, we can do cross validation and calculate the MSE on the data points not used to train the model. So we would have to go through the entire cross validation cycle for the entire parameter space? Let's say we have 10 observations where Yj is the observation that is left out. Would the cost function be:

$\sum_{j = 1}^{10}(\sum_{k = 1}^9((Y_{k} - b_{0}+b_{1}X))^2 + (Y_j - b_{0} + b_{1}X)^2)$

For something like KNN, would we would select the K with the lowest MSE via CV. However if we viewed the left out observations as the validation set, wouldn't we still need a test set to see how the model performs or generalizes?

3. I guess a better question is, besides using k-fold when data set is small, when else would you actually use it?

Would you use it to estimate the b0/b1 parameters of a linear regression? Or, is it used to pick among several linear regression models, each with a different number of features? I vaguely remember reading somewhere that after selecting a model, you rebuild your model with ALL the data available - something not done in sample splitting. However I forgot how this ties in with everything else. Maybe after determining you need 3 features instead of 2, you rebuild the 3 feature model using all the data.

Blah, I think I just have several concepts confused.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
confused
  • 2,453
  • 6
  • 26

0 Answers0