Background:
I am having some issues generalizing my data in robustness tests. I have a large data set (a little over 6.5 million images) which I am feeding through a CNN-SVM pipeline. The CNN is used only to extract the features and then the SVM is used for training.
My SVM from the R package "e1071" which uses libsvm and is a polinomial Kernel, degree == 4. I cannot train all of this data in the SVM because it simply takes too long so the solution I thought would be best was to divide my data in smaller sets and vote.
I have divided my training set (75% of the data so a little less than 5 million) into 41 chunks since I have 39 classes and 41 is the next coprime number. In case everyone votes different I should still have something to go on. This gives me about 120k samples which is really about the maximum of what I can train efficiently (a couple of days).
Test Results:
With 10-fold-Cross-Validation I get right around 80% accuracy in every one of the chunks. This was excellent for my problem. So I saved the models and voted to predict my test data and got only 15% accuracy. I have found that if I test one chunks model on any other chunks data I get the same 15% results more or less. Of course if I test a chunk against a model trained on that chunk I get 80%. I thought that maybe something was wrong with my code so I tested a XOR which game out perfectly.
Question:
It seems odd that the difference between cross validation and unseen test data would be so different, especially given the guarantees that SVM is so famously resilient to over-fitting. I always believed that cross validation was good estimator for generalization error/empirical risk. I am hoping that someone who understands this better than I do can point out where I went wrong.
If anyone requires more information I would be happy to provide it.
More information requested:
Here are my model parameters: KERNEL="polynomial" DEGREE = 4 model = svm(trainSet, trainClasses, type="C-classification", kernel = KERNEL, degree = DEGREE, coef0=1, cost=1000, cachesize = 10000, cross = 10)
About Cross-Validation:
The data was divided randomly when split into the 41 chunks and I believe that the Cross-Fold-Validation in R does this again within the chunks. (I did also test this without doing the random split just for fun and the result was basically the same since my dataset is fairly balanced). The data is all from the same source and conditions. I don't believe that the data is very different between one chunk and another.
Also I am running 10 folds and taking the mean.