3

I've been really enjoying the Introduction to Statistical Learning textbook so far, and I'm currently working my way through chapter 6. I realize that I am very confused by the process used in lab 3 of this chapter (page 256-258).

First, they use the pcr() function's cross validation option and the entire training data set to calculate the optimal number of principle components. Great! All set (I thought...)

pcr.fit=pcr(Salary∼., data=Hitters, scale=TRUE, validation ="CV")

Next, they "perform PCR on the training data and evaluate its test set performance":

pcr.fit=pcr(Salary∼., data=Hitters, subset=train, scale=TRUE, validation ="CV")

I'm confused because I thought that cross-validation (which they did first) is basically a better version of doing exactly this! To make me even more confused, they go on to say they that with the training/test set approach, they get the "lowest cross-validation error" when 7 components are used. It seems like they are using a validation set together with cross-validation?

amoeba
  • 93,463
  • 28
  • 275
  • 317

1 Answers1

3

It is indeed not very clearly explained in the text, but here is what I think is going on.

First, they perform cross-validation on the whole dataset. They say that "the smallest cross-validation error occurs when $M = 16$ components are used", but also remark that the difference between different values of M is very small.

Second, they split the dataset intro training and validation set. They put the validation set aside, and use cross-validation on the training set only to get the optimal value of $M$. Curiously, they say that "the lowest cross-validation error occurs when $M = 7$ component are used" (there is no comment on why it's now so much smaller than 16). Then they use the model with $M=7$ and test its performance on the validation set.

It seems like they are using a validation set together with cross-validation?

Yes, exactly! This is a very sensible thing to do, because you want to measure the performance of your algorithm on a dataset that was not used for training in any way, including hyper-parameter tuning. So you use validation set for measuring the performance and training set to build the model, but in order to choose the value of $M$ you need to do cross-validation on the training set; i.e. the training set gets additionally split into training-training and training-test many times.

I'm confused because I thought that cross-validation (which they did first) is basically a better version of doing exactly this

Not exactly. When you perform a single cross-validation, you get a good estimate of optimal $M$, but a potentially bad estimate of the out-of-sample performance.

There are two ways of doing it properly:

  1. Have a separate validation set and do cross-validation on the training set to tune hyperparameters. (That's what they do here.)

  2. Perform nested cross-validation. Search our site for "nested cross-validation" to read up on it. For example:

amoeba
  • 93,463
  • 28
  • 275
  • 317