5

I have a set with 16 samples and 250 predictors. I'm being asked to perform CV on the set. In the examples I've looked at, you create training and testing subsets. The sample size seems quite small to me to split to even smaller subsets. My question is, does CV make sense with a small sample.

Steffen Moritz
  • 1,564
  • 2
  • 15
  • 22
zach
  • 151
  • 4

3 Answers3

7

I have concerns about involving 250 predictors when you have 16 samples. However, let's set that aside for now and focus on cross-validation.

You don't have much data, so any split from the full set to the training and validation set is going to result in really very few observations on which you can train. However, there is something called leave-on-out cross validation (LOOCV) that might work for you. You have 16 observations. Train on 15 and validate on the other one. Repeat this until you have trained on every set of 15 with the 16th sample left out. The software you use should have a function to do this for you. For instance, Python's sklearn package has utilities for LOOCV. I'll include some code from the sklearn website.

# https://scikit-learn.org/stable/modules/generated/
# sklearn.model_selection.LeaveOneOut.html
#
>>> import numpy as np
>>> from sklearn.model_selection import LeaveOneOut
>>> X = np.array([[1, 2], [3, 4]])
>>> y = np.array([1, 2])
>>> loo = LeaveOneOut()
>>> loo.get_n_splits(X)
2
>>> print(loo)
LeaveOneOut()
>>> for train_index, test_index in loo.split(X):
...    print("TRAIN:", train_index, "TEST:", test_index)
...    X_train, X_test = X[train_index], X[test_index]
...    y_train, y_test = y[train_index], y[test_index]
...    print(X_train, X_test, y_train, y_test)
TRAIN: [1] TEST: [0]
[[3 4]] [[1 2]] [2] [1]
TRAIN: [0] TEST: [1]
[[1 2]] [[3 4]] [1] [2]

Do you, by any chance, work in genetics?

Dave
  • 28,473
  • 4
  • 52
  • 104
3

I'm being asked to perform CV on the set.

I'm going to assume that this cross validation will be for internal validation (part of verification) of the performance of the model you get from your 16 x 250 data set.
That is, you are not going to do any data-driven hyperparameter optimization (which can also use cross validation results).

Yes, cross validation does make sense here. The results will be very uncertain due to the fact that only 16 samples contribute to the validation results. But: given your small data set, repeated k-fold (8 fold would probably be the best choice) or similar resampling validation (out-of-bootstrap, repeated set validation) is the best you can do in this situation.

This large uncertainty, BTW, also means that data-driven optimization is basically impossible with such a small data set: this uncertainty due to the limited number of tested cases depends on the absolute number of tested cases - in validation there is no way to mitigate the small sample size (and unlike in training not even having fewer features can help).

As few cases and many features in training come with the risk of overfitting, it is important to check the stability of the modeling. This can be done in a very straightforward fashion from repeated (aka iterated) cross validation: any difference in the prediction for the same case between the runs (repetitions/iterations) cannot be due to the tested case, but must be due to differences in the model (i.e. the training does not lead to stable models).
Have a look at our paper for more details: Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations, Anal Bioanal Chem, 390, 1261-1271 (2008). DOI: 10.1007/s00216-007-1818-6

There are only 120 combinations of 2 cases out of 16, do you may want to consider running all those combinations instead of randomly assigned folds.

In contrast to @Dave and @oloney, I do not recommend leave-one-out CV, for two reasons:

  • LOO does not allow the aforementioned measurement of stability (each surrogate model is tested with exactly one case: we cannot distinguish wheter variation is due to the case or due to the model). But checking stability is really crucial with so small cases : features ratio.
  • The second reason refers to classification only: LOO on a classification task will always test a case that belongs to a class underrepresented in the respective training split. For very small sample sizes, this can cause huge pessimistic bias. If that's the case for you, you're probably better off doing a stratified resampling validation that does not (or hardly) disturb the relative frequencies.
cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
1

The theory behind cross validation works all the way down the case where $k = n$, which is called leave-one-out cross-validation. LOOCV is the best choice when $n$ is small. The upside to using cross validation is that your estimate of generalization error will be unbiased and you'll be able to form a non-parametric confidence intervals for estimated parameters. The downside is that it doesn't magically create sample from nothing; the generalization error will probably be very large, and the confidence intervals will be very wide.

If you're planning to use CV for model selection or feature selection, you probably won't have much luck with 16 observations and 250 features. Let's say you use BIC for model selection, and you consider all 250 models, each with a single predictor. You can use CV to estimate and draw a confidence interval around the BIC for each model, but you'll likely find the confidence intervals overlap considerably. There may be a "best" model with BIC $ = 10 \pm 50$ (lower is better), but if the other 249 models have BIC $= 11 \pm 50$, then it's extremely unlikely that the "best" model is actually the best. The upside is the CV will let you estimate confidence intervals, so you'll know if this is the case or not. The downside is that it won't necessary allow you to choose a single best model with any degree of confidence.

olooney
  • 2,747
  • 11
  • 23
  • 1
    The OP doesn't have 250 models, but 250 features, also known as independent variables. – jbowman Jul 30 '19 at 19:42
  • @jbowman I'm discussing [subset selection](https://en.wikipedia.org/wiki/Feature_selection#Subset_selection) to illustrate the limitations of CV. The two are often used together, so much so that when people say "cross-validation", they often really mean "subset selection using cross-validation." Subset selection is a special case of model selection that is also a feature selection method. In particular for greedy forward selection, we would start with one model per feature. Greedy forward selection is often used when sample size is small and the number of candidate features is large. – olooney Jul 30 '19 at 20:18