why k-fold cross validation (CV) overfits? Or why discrepancy occurs between CV and test set?

Question

Recently I am working on a project and I found my cross-validation error rate very low but the testing set error rate is high, which might indicate the overfiting of my model. But why my cross-validation does not overfit while test set overfits??

More specifically, I have about 2 million data with 100 variables (n>>p). I randomly 8/2 split the dataset into train set and test set. Then I fit a model (i.e. XGboost) using a 5-fold cross validation on the train set, and the estimated error rate is pretty low. Then, I used the same parameter setting and use the entire train set to fit the model. Surprisingly when I use the test set to evaluation the performance of the model, the error rate is significantly higher than the CV error rate. WHY?

+++ 1. Edit about the error rate +++

The error rate is actually multinomial logloss. I achieved a CV error rate of 1.320044 (+/- 0.002126) and a testing error rate of 1.437881! They might seem close by staring at these two numbers, but actually, they are not. I don't know how to justify this but I am sure that they are different within the scale of performance of this project, which is from ~1.55 to ~1.30.

The way of 5-fold cross validation is like following,

divide the train set into 5 sets.
iteratively fit a model on 4 sets and test the performance on the rest set.
average the performance of all the five iterations.

I mean, if my parameter settings make model overfit, then I should see it at this cross-validation procedure, right? But I don't see it until I use the test set. Under what circumstances on the earth this could happen?

Thanks!

++++++++++++++ 2. Added ++++++++++++++

The only reason I could think of that why CV error rate diffs from test set error rate is

Cross-Validation will not perform well to outside data if the data you do have is not representative of the data you'll be trying to predict! -- here

But I randomly 8/2 split the 2-million-sample data set and I believe that the train set and test set should have from a same distribution of variables.

++++ 3. Edit about data leakage ++++

From the comments, @Karolis Koncevičius and @darXider raised an interesting guess, data leakage. I think this might be the devil here. I wonder what is data leakage? And how to avoid data leakage? And how to detect data leakage? I'll do more research about it.

THANKS!

Interesting issue. Maybe you can add more concrete information? For example now you say "test error is significantly higher" - what makes you say it is significantly higher? I would include the actual numbers instead. And as a first guess - you can try checking error rates in each fold of cross validation separately. Maybe the procedure has variable accuracy and your final classifier falls on the bad end of it. But with 2 million samples that should hardly be the case... — Karolis Koncevičius, Mar 01 '17 at 21:38
Thanks, @KarolisKoncevičius! Edited accordingly. The sd is in the (+/-). — Frank Fan, Mar 01 '17 at 21:50
Have you tried implementing a *nested* CV scheme? It'll give you a better estimate of the real performance of your model(s). — darXider, Mar 01 '17 at 21:54
Hi @darXider, I could try that but I am not sure if nested CV helps because based on my previous experience, they are quite close to each other. But I am happy to try (even though nested CV is slower..)! However, still, I am curious why my traditional k-fold cross validation doesn't work? Or generally, under what circumstances k-fold cross validation fails? — Frank Fan, Mar 01 '17 at 21:58
A few other guesses: 1) Was this the first model you tried? If you tried a lot of models and selected the best performing based on cross validation then that model would most likely be overfitting at least to some degree. 2) Maybe you are doing something invalid - like subtracting the mean from the whole training set, instead of doing it in each fold separately? — Karolis Koncevičius, Mar 01 '17 at 22:00
Hi @KarolisKoncevičius. 1) No this is not the first model I tried. I did many feature engineering on the entire data set (BEFORE the train/test split). At the beginning, I found my CV error rate almost the same with test error rate. But after I added some specific features, CV error rate starts to disagree with test error rate. But why the heck does CV error rate differs from test error rate? This doesn't make sense to me. 2) I need to think about this and more. But I tried my best not to screw it this way by feature engineering the entire data and THEN split it. — Frank Fan, Mar 01 '17 at 22:05
Let me try to understand the situation better first. How many times have you randomly split your data set in 8/2 **AND** repeated the 5-fold CV? Are you using the same random seed when splitting (in other words, do you do the same splitting every time)? If not, do you get more or less the same performance every time you repeat your different random split + CV? The +/- SD is not the standard deviation of your scores from trying your 5-fold CV many times over, is it (I am guessing it's just the SD of the 5 scores from *ONE* run of 5-fold CV)? — darXider, Mar 01 '17 at 22:05
Hi @darXider, thanks. I only tried **one** time of 8/2 split and 5 fold CV. I could do it another time with a different seed. But I guess given I have 2-million samples, I think a similar issue would be observed. Correct, the SD of the 5 scores is from ONE run of 5 fold CV. So would you suggest run the same procedure again with a different seed of splitting and see if I still get the same issue (which would cost another ~10hours of fitting)? — Frank Fan, Mar 01 '17 at 22:09
@Karolis Koncevičius has a good point. Sometimes when feature engineering, you have to be careful to avoid any data leak between training and test sets. For example, if you do a PCA on your original, untouched data, use PC1 and PC2 as "new" features, and then split your dataset into train and test, you are leaking information from training set into test set. That will boost your score up. You mentioned that after some feature engineering, your CV score and test score started to disagree. That could suggest some sort of information leak between training set and test set. — darXider, Mar 01 '17 at 22:10
I'd say do *at least* another run with a different seed and compare your CV score and test score from the new run to those from the original run. The reason for doing a nested CV is to get a better estimate of the variation of the performance with different test sets. Also, identify which of your engineered features resulted in a large disagree between CV score and test score; there could be some information leak there, and you might need to to a proper `Pipeline()`ing and `FeatureUnion`ing (if you are using Python, that is). — darXider, Mar 01 '17 at 22:14
Interesting! Interesting about the data leaking! Is it true that even I do feature engineering BEFORE splitting, I still can encounter data leaking issue? This is my first-time encounter (and even heard about) the "leaking" problem. Would you guys please provide some literature (blog, debate, or anything, etc) talking about this data leaking? — Frank Fan, Mar 01 '17 at 22:16
Sure! [one](http://stats.stackexchange.com/questions/55718/pca-and-the-train-test-split) and [two](http://stats.stackexchange.com/questions/239898/is-it-actually-fine-to-perform-unsupervised-feature-selection-before-cross-valid). I think there is some mention of this scenario in the book Elements of Statistical Learning (which is free and available online as a PDF file). — darXider, Mar 01 '17 at 22:24
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/54650/discussion-between-frank-fan-and-darxider). — Frank Fan, Mar 01 '17 at 22:25
In some sense, you should expect this. You used the cross validation error rate to tune hyperparameters in your model. In the same way that the decisions made in your algorithm biases the training estimate of the error rate, the decisions you make us\ing CV biases the cross validation error rate. Also, the CV and hold out errors measure slightly different things, CV averages of training sets, hold out does not. The comments above about feature engineering on the entire data set are also correct, that is going to leak information from the CV validation sets. — Matthew Drury, Mar 01 '17 at 22:56
Thanks @MatthewDrury. I agree with you. I may optimize my hyperparameter too optimistically, even though the *number of iteration* in XGBoost, is the only parameter I am tuning using CV. On the other hand, I only face this issue after some features being added. I need to think why my features leak information. I wonder do you have any suggestion about how to identify which feature is leaking, besides by experience? — Frank Fan, Mar 01 '17 at 23:07
The two general approaches I use are 1) varaibles that show a high importance score are always worth suspicion, 2) Sit in an armchair and think. But given what you said, anytime you do transformations using information obtained from looking at your entire data set, you're cheating, and you're always going to leak some information. — Matthew Drury, Mar 01 '17 at 23:42
@MatthewDrury Thanks. Your comment is thought-provoking. Indeed, any type of entire data set feature engineering causes data leakage. However, even though information leaks from the train set to test set (or vice versa), this shouldn't make the CV error differ from the test error rate. Right? I did feature engineering BEFORE splitting. — Frank Fan, Mar 02 '17 at 17:42
@darXider I've encountered the similar problem [here](http://datascience.stackexchange.com/questions/17755/will-cross-validation-performance-be-an-accurate-indication-for-predicting-the-t). In my case, I did not do any feature engineering. So I don't think I have information leakage issues. I found that the worse my CV score is, the better my test score is. I have specific numbers. — KevinKim, Mar 26 '17 at 00:35

why k-fold cross validation (CV) overfits? Or why discrepancy occurs between CV and test set?

0 Answers0