5

In Max Kuhn's Book "Applied Predictive Modeling" this is writen about K-cross validation:

As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller (i.e., the bias is smaller for k = 10 than k = 5).

I understand why the bias is smaller for k=10 because we fit our model on smaller subset, so we do not generalize and in fact increasing our variance.

But I do not understand what is meant under:

As k gets larger, the difference in size between the training set and the resampling subsets gets smaller.

If k get bigger, we will have less points in resampling subset and more point in training set. So, if k=2, and N =100, size of training set is 50 and of resampling subset is also 50. The difference in size is 0, although k =2. If k = 4 and N=100, size of training set is 75 and of resampling subset is 25. The difference in size is 75-25= 50. So, difference increases with bigger k.

I guess I am just not understanding what is meant by size of the sets.

Anni
  • 449
  • 2
  • 5
  • 12
  • Related, probably not duplicate: http://stats.stackexchange.com/questions/1826/cross-validation-in-plain-english – Sycorax Apr 02 '16 at 23:44
  • @Anni, I hope your confusion was cleared by the answers below. Would you mind *accepting* an answer if that's the case? If you still have doubts, let us know. – Vishal Apr 15 '16 at 21:17

2 Answers2

1

I broke out my copy. What I believe is being said is the training set is the original data set.

Example

With an original data set of N = 100

k = 2

subset size = 50, difference = 50

k = 4

subset size = 75, difference = 25

k = 10

subset size = 90, difference = 10

As k increase the difference between the original data set and the cross-validation subsets becomes smaller.

Matt L.
  • 739
  • 3
  • 10
1

Kuhn is comparing the size of the overall (original) training set with the size of each individual resampling subset that are used to build the models.

You are thinking that the comparison is between the original training set with the size of each hold-out subset.

Let's say:

$$ n = Size \ of \ the \ training \ set $$ $$ k = Number \ of \ folds $$ $$ n_k = Size \ of \ each \ resampling \ subset $$

Let's consider the example used in this book (in the same section):

enter image description here

In this example: $$ k = 3 $$ $$ n= 12 $$ $$ n_k = 8 $$ $$ Hence, \ the \ difference: \ n \ - \ n_k = 4 $$

Now, if we take the extreme case: $$ k = 11 $$ $$ n= 12 $$ $$ n_k = 11 $$ $$ Hence, \ the \ difference: \ n \ - \ n_k = 1 $$

Notice that $n$ would always remain the same. And as you can see, the $difference$ reduces as $k$ increases.

Vishal
  • 1,134
  • 9
  • 14