10

I understand the concept of k-fold cross validation, but I do not understand what a "fold" means. Quoting from the linked page on wikipedia:

The cross-validation process is then repeated k times (the folds)

This seems very vague. Does the 'fold' refer to each repeat of the process? Or is it a noun to refer to the paired training-testing dataset?

shadowtalker
  • 11,395
  • 3
  • 49
  • 109
Alex
  • 3,728
  • 3
  • 25
  • 46
  • 2
    I confess that I don't even know what cross-validation is, but isn't this just the usual English-language meaning of "$k$-fold" meaning "$k$ times", as in "There has been a fourfold increase in violent crime since the legalization of hand-held nuclear weapons." – David Richerby Sep 05 '16 at 20:43
  • that is a very good point. yet, as you can see in the answer the folds can be used to refer to the data. – Alex Sep 06 '16 at 23:39
  • Yeah, though that sounds an awful lot like a misunderstanding by a non-native speaker that's caught on. – David Richerby Sep 07 '16 at 07:24
  • I think you should reconsider which answer to accept. The only one that proceeds from the English-language background of the phrase is @eraould's, the rest is invention. – A. Donda Dec 23 '20 at 18:48
  • I updated the wikipedia entry after reading this question and making my reply below. It's fixed now for Wikipedia but we'll still run into this elsewhere in the community so it's great that the question appears here. – eraoul Dec 24 '20 at 20:21

3 Answers3

10

The wording is definitely awkward there.

Recall that cross-validation partitions a dataset into $K$ roughly equal "sub-datasets." Each one of these "sub-datasets" is called a "fold." $K$-fold cross validation requires re-fitting a model $K$ times, omitting exactly one fold from the data each time, so the term "fold" can also be used to refer to each repetition.

Since there is a one-to-one correspondence between folds and repetitions, there usually isn't a problem with this lax terminology. It is usually apparent from the context which usage is intended, and other times it doesn't make a difference.

shadowtalker
  • 11,395
  • 3
  • 49
  • 109
  • Right, so this interpretation makes each of the disjoint test sets a 'fold'. Thus, the training data can be referred to as 'data not in the fold'. Do you have a reference for this? – Alex Sep 05 '16 at 03:53
  • @Alex only my diploma – shadowtalker Sep 05 '16 at 03:54
  • 1
    And yes, "out-of-fold" is a valid term – shadowtalker Sep 05 '16 at 03:55
  • some references to usage like this on the internet or in papers would be very helpful to me, as I am trying to convince someone that your proposed usage is correct. – Alex Sep 05 '16 at 04:00
  • It's all over this site, for one thing... – shadowtalker Sep 05 '16 at 04:23
  • Look in the kaggle forums; barely a topic goes by without mention of cross-validation(CV). – JDL Sep 05 '16 at 07:32
  • well, I wouldn't exactly say that in general kaggle or this site are authoritative references. – Alex Sep 05 '16 at 10:59
  • 2
    The $k$ models are sometimes referred to as surrogate models, reference e.g. Braga-Neto UM, Dougherty ER.: Is cross-validation valid for small-sample microarray classification? Bioinformatics. 2004 Feb 12;20(3):374-80. http://dx.doi.org/10.1093/bioinformatics/btg419. "folds" is often used in distinction to a "run" (iteration/repetition) of the cross validation (a run then consists of $k$ folds in the "procedure" meaning) – cbeleites unhappy with SX Sep 05 '16 at 11:03
  • 2
    +1 but "data not in the fold" phrase sounds very awkward and extremely unclear @Alex. Don't use it. – amoeba Sep 06 '16 at 13:26
  • @amoeba Thanks. Similary, do you find 'data in the fold' unclear, as it could plausible refer to the test set, or the paired collection of test and training set? – Alex Sep 06 '16 at 23:37
  • @Alex Yes, I find this very unclear too. Actually, on reflection I think I disagree that "fold" can refer to the test set (ssdecontrol, what would be an example sentence with this meaning of "fold"?). I think "fold" refers to the repetition itself. For each fold, there is a training set and a test set. – amoeba Sep 06 '16 at 23:40
  • 1
    I often use "fold" lazily to mean each chunk of the data set. As in "fold 5 is imbalanced compared to the rest of the data" – shadowtalker Sep 07 '16 at 00:29
  • People may indeed call each partition a "fold", but for consistency and clarity they shouldn't; this is only due to unfamiliarity with the English language. More detail in my answer below. – eraoul Dec 24 '20 at 20:23
3

"Fold" refers to a partition (in the set-theoretic meaning of the word) of the sample, $S$, into a training set, $T_j$, and validation set, $V_j$. This means:

  1. $T_j \cap V_j = \emptyset$,
  2. $T_j \cup V_j = S$,

($1 \leq j \leq k$).

Note that in "classic" $k$-fold cross-validation (CV) an additional condition is placed on the validation sets:

  1. $V_i \cap V_j = \emptyset$ ($i \neq j$).

Finally, note that the $k$ in classic $k$-fold CV controls both the number of times the train-validate procedure is carried out, as well as the size of the validation and training sets: $|V_j| \approxeq \frac{1}{k} |S|$, therefore $|T_j| \approxeq \frac{k-1}{k} |S|$.

Jim
  • 1,912
  • 2
  • 15
  • 20
2

I agree with the OP that this terminology is awkward and confusing. Here's my take on it: native English speakers who are well-educated are used to terms such as "twofold" or "threefold", which sound just a bit antiquated but still usable. Critically, however, we don't see these words as containing the noun "fold"; "fold" is more of a suffix here, a funny special construction that is combined with a number to make a colorful variant on "double" or "triple", etc. It has absolutely nothing to do with the verb "to fold" or the noun "fold" that might come up while doing origami and referring to a folded piece of paper.

I suspect that the word "fold" started being used as a noun meaning "partition" in the context of k-fold cross validation when a speaker/writer not as familiar with English or with cross-validation thought that "k-fold" literally meant "making k 'folds' of the data". It's quite understandable that someone would come to this conclusion. However, "k-fold" doesn't mean "making k 'folds'" -- instead, it means "doing cross-validation k times", where the detail of having to also make k partitions of the data is implied.

Personally I never use "fold" in this strange way; I call the data segments in question "partitions", and it's much more clear.

Also, just because this usage has spread through the community doesn't make it reasonable English usage, IMO. I prefer straightforward and clear communication to inventing and using confusing new jargon.

eraoul
  • 121
  • 3