1

Suppose I have some data which I have split into k-folds (where $k$ is less than the number of data points). I train the model on the training folds, and want to test on the remaining fold.

For k-folds, where $k < n$ (and so the validation fold has multiple data points), I believe the following is correct: for each fold I calculate $\sqrt{\frac{1}{m}\sum_{i=1}^m (\hat{y_i}-y_i)^2}$ where $m$ is the number of data points in the fold. I then average these different errors to get the error for that model. I can then repeat for different tuning parameter values to see which tuning parameter value gives me the lowest overall error.

My question is then what do you do when $k=n$, (LOOCV)? There are two different approaches I can think of.

Approach 1: The same as above. However, in this case as each testing set consists of only a single data point, $\sqrt{\frac{1}{m}\sum_{i=1}^m (\hat{y_i}-y_i)^2} = \sqrt{(\hat{y}-y)^2} = |\hat{y}-y|$. I can then average each of these. This would be identical to calculating the mean absolute error.

Approach 2: square these errors, sum the errors from the different validation sets, and then square root. This wouldn't necessarily be the same as the MAE.

Is either of these the correct approach? And if so, why is one used over the other?

user112495
  • 297
  • 1
  • 9
  • Are you trying to ask what to do instead of $k$-fold cross-validation if you have only few datapoints? – Tim Nov 15 '21 at 12:28
  • @Tim Sorry if it wasn't clear - no. I guess I'm essentially asking how you calculate the root mean square error in LOOCV. Do you sum the squared errors over all the validation sets and then square root, or do you calculate the RMSE for each validation set and then average? – user112495 Nov 15 '21 at 12:29
  • Have you already seen https://stats.stackexchange.com/questions/85507/what-is-the-rmse-of-k-fold-cross-validation#:~:text=The%20RMSE,observations%20of%20CV%20instance%20j ? – jros Nov 15 '21 at 13:57

0 Answers0