Recently I am working on a project and I found my cross-validation error rate very low but the testing set error rate is high, which might indicate the overfiting of my model. But why my cross-validation does not overfit while test set overfits??
More specifically, I have about 2 million data with 100 variables (n>>p). I randomly 8/2 split the dataset into train set and test set. Then I fit a model (i.e. XGboost) using a 5-fold cross validation on the train set, and the estimated error rate is pretty low. Then, I used the same parameter setting and use the entire train set to fit the model. Surprisingly when I use the test set to evaluation the performance of the model, the error rate is significantly higher than the CV error rate. WHY?
+++ 1. Edit about the error rate +++
The error rate is actually multinomial logloss. I achieved a CV error rate of 1.320044 (+/- 0.002126) and a testing error rate of 1.437881! They might seem close by staring at these two numbers, but actually, they are not. I don't know how to justify this but I am sure that they are different within the scale of performance of this project, which is from ~1.55 to ~1.30.
The way of 5-fold cross validation is like following,
- divide the train set into 5 sets.
- iteratively fit a model on 4 sets and test the performance on the rest set.
- average the performance of all the five iterations.
I mean, if my parameter settings make model overfit, then I should see it at this cross-validation procedure, right? But I don't see it until I use the test set. Under what circumstances on the earth this could happen?
Thanks!
++++++++++++++ 2. Added ++++++++++++++
The only reason I could think of that why CV error rate diffs from test set error rate is
Cross-Validation will not perform well to outside data if the data you do have is not representative of the data you'll be trying to predict! -- here
But I randomly 8/2 split the 2-million-sample data set and I believe that the train set and test set should have from a same distribution of variables.
++++ 3. Edit about data leakage ++++
From the comments, @Karolis Koncevičius and @darXider raised an interesting guess, data leakage. I think this might be the devil here. I wonder what is data leakage? And how to avoid data leakage? And how to detect data leakage? I'll do more research about it.
THANKS!