1

Background:

I am having some issues generalizing my data in robustness tests. I have a large data set (a little over 6.5 million images) which I am feeding through a CNN-SVM pipeline. The CNN is used only to extract the features and then the SVM is used for training.

My SVM from the R package "e1071" which uses libsvm and is a polinomial Kernel, degree == 4. I cannot train all of this data in the SVM because it simply takes too long so the solution I thought would be best was to divide my data in smaller sets and vote.

I have divided my training set (75% of the data so a little less than 5 million) into 41 chunks since I have 39 classes and 41 is the next coprime number. In case everyone votes different I should still have something to go on. This gives me about 120k samples which is really about the maximum of what I can train efficiently (a couple of days).

Test Results:

With 10-fold-Cross-Validation I get right around 80% accuracy in every one of the chunks. This was excellent for my problem. So I saved the models and voted to predict my test data and got only 15% accuracy. I have found that if I test one chunks model on any other chunks data I get the same 15% results more or less. Of course if I test a chunk against a model trained on that chunk I get 80%. I thought that maybe something was wrong with my code so I tested a XOR which game out perfectly.

Question:

It seems odd that the difference between cross validation and unseen test data would be so different, especially given the guarantees that SVM is so famously resilient to over-fitting. I always believed that cross validation was good estimator for generalization error/empirical risk. I am hoping that someone who understands this better than I do can point out where I went wrong.

If anyone requires more information I would be happy to provide it.

More information requested:

Here are my model parameters: KERNEL="polynomial" DEGREE = 4 model = svm(trainSet, trainClasses, type="C-classification", kernel = KERNEL, degree = DEGREE, coef0=1, cost=1000, cachesize = 10000, cross = 10)

About Cross-Validation:

The data was divided randomly when split into the 41 chunks and I believe that the Cross-Fold-Validation in R does this again within the chunks. (I did also test this without doing the random split just for fun and the result was basically the same since my dataset is fairly balanced). The data is all from the same source and conditions. I don't believe that the data is very different between one chunk and another.

Also I am running 10 folds and taking the mean.

badner
  • 131
  • 6
  • Check [this](http://stats.stackexchange.com/questions/35276/svm-overfitting-curse-of-dimensionality) answer. I think you should adjust your tuning parameter. If this doesn't work, try a different kernel. Gaussian radial basis function tends to perform well in general. – Marcel10 Nov 07 '16 at 10:49
  • it sounds like you are doing crossvalidation wrong but you don't give enough details to debug – seanv507 Nov 07 '16 at 11:01
  • @Marcel10 my C is set at 1000 which I found by gridsearch. I will update my question. – badner Nov 07 '16 at 11:32
  • @seanv507 please let me know what you need and I will post here, so as not to pollute with too much info – badner Nov 07 '16 at 11:34
  • so since the crossvalidation is not working as expected you have to explain how the crossvalidation and data splitting is set up. it sounds like it is not random ( can you think of any reason why the data might change between chunks? eg different data sources etc). is each chunk getting the same classes? Please describe this as fully as possible.[more info is better than playing 20 questions!] – seanv507 Nov 07 '16 at 13:11
  • @seanv507 ok I will update my question – badner Nov 07 '16 at 14:18
  • Is there any dependency in the data, such that some images will be more similar than others? For example, perhaps images could be grouped as coming from the same person, place, or time? If this is the case, then you need to make sure that images from each group are kept together and not allowed to reach across folds. – Jeffrey Girard Nov 07 '16 at 14:35
  • @Jeffrey Girard In my case I have images from acoustic parameters of my speech corpus for my PhD. The corpus is of non-natives but the setup if very similar to the TIMIT database. (phonetically rich and balanced and very repetitive from speaker to speaker). Their level of English may vary. There is quite a bit of shuffling going on which I believed to be the correct approach so as not to create a strong bias for any speaker or class. Could you explain a little better how your solution would work and what it could guarantee me statistically that my current approach would not? – badner Nov 07 '16 at 14:57

0 Answers0