I've been really enjoying the Introduction to Statistical Learning textbook so far, and I'm currently working my way through chapter 6. I realize that I am very confused by the process used in lab 3 of this chapter (page 256-258).
First, they use the pcr()
function's cross validation option and the entire training data set to calculate the optimal number of principle components. Great! All set (I thought...)
pcr.fit=pcr(Salary∼., data=Hitters, scale=TRUE, validation ="CV")
Next, they "perform PCR on the training data and evaluate its test set performance":
pcr.fit=pcr(Salary∼., data=Hitters, subset=train, scale=TRUE, validation ="CV")
I'm confused because I thought that cross-validation (which they did first) is basically a better version of doing exactly this! To make me even more confused, they go on to say they that with the training/test set approach, they get the "lowest cross-validation error" when 7 components are used. It seems like they are using a validation set together with cross-validation?