What would be the best way of setting train/validation/test indices when dealing with classification of independent subjects?

Question

I have data for 200 subjects and am creating a neural network to classify (binary state) for each subject. I am doing so in Matlab and have to divide up my data into training, validation, and testing sets. Originally, I created a 6-fold cross-validation method wherein I tested on a different set of 33 subjects, repeated 6 times and the accuracies averaged.

I am unsure if this is the best way to go about cross-validation or divvying up my data.

Other things I have tried include a simple 50% training, 15% validation, and 35% testing split that I repeat many times. Since I cannot guarantee that the same 35% (albeit unlikely) is selected each time, I went with the 6-fold method described above.

Now, I am trying to train/validate on all subjects aside from one, and then repeating that process 200 times for each subject.

Any ideas on what is the best/common way of dividing my data?

score 2 · Accepted Answer · answered Dec 11 '17 at 22:16

2

Probably the most common method is k-fold cross validation.

Lately I've been combining k-fold cross-validation with bagging.

The algorithm is as follows:

Assign folds $1:K$ to your data. For each $k$ in $K$:
Select hyperparameters (weight decay, dropout, number of layers/nodes, etc.) to minimize prediction error on $k$th fold, using data not in $k$th fold.
Step 2 provides $K$ networks. When predicting, average the predictions of all $K$ models. Averaging reduces the variance of the prediction.

Another motivation for this sort of averaging is that you avoid "throwing away" all of the work you did in each fold.

This can be computationally-intensive however when working with large datasets. With N = 200 however, do you really think that a neural net is the best tool for your problem?

answered Dec 11 '17 at 22:16

generic_user

11,981
8
40
63

I tried SVM and KNN but neural networks were getting me the best accuracy. Now I'm just trying to increase the accuracy based on those initial tests – a13a22 Dec 11 '17 at 22:17
What about a random forest or boosted trees? – generic_user Dec 11 '17 at 22:35
I've tried random forests and my initial results still showed the NN to function at around 96%, while the forests and various ensembles barely crossed 85%. I'll be using NN for now, but do you see any downside to the (n-1) validation? Is the method I described of training on all but one subject the same as 200 fold cv? – a13a22 Dec 11 '17 at 23:02
See here re leave-one-out V's k-fold. Also read the relevant chapter of elements of statistical learning, which is free online. https://stats.stackexchange.com/questions/154830/10-fold-cross-validation-vs-leave-one-out-cross-validation – generic_user Dec 12 '17 at 00:20
When you say that you got 96% from the net, was that in-sample or out of sample? – generic_user Dec 12 '17 at 00:21
The 96% was obtained by doing leave one out and then averaging all of those accuracies (train on n-1, test on remaining 1 and repeat, then average all accuracies) – a13a22 Dec 12 '17 at 00:24
How balanced are your classes? – generic_user Dec 12 '17 at 00:28
It's roughly evenly split (110 labeled as 1 and 90 labeled as 0) – a13a22 Dec 12 '17 at 00:38
Sounds like you've got a pretty good model! – generic_user Dec 12 '17 at 00:46
Thanks; I think that since my model is small enough, I can just use leave-one-out rather than k-fold – a13a22 Dec 12 '17 at 00:50
If I were you I'd try k-fold. The results should be similar. If they aren't, I'd poke around looking for why. – generic_user Dec 12 '17 at 00:51
Results are very similar between k-fold and leave-one-out. Is there any reason to pick one over the other? I tried 10-fold with 100 iterations and it gave me comparable accuracy. From what I gather, k-fold may be preferable because it likely has a different training set each time (billions of possibilities of subsets of 10%). Only advantage I see to k-fold is that it offers that variation in training set, but does that matter here? – a13a22 Dec 12 '17 at 00:55
It doesn't seem to matter much. With such a small training set however, I'd be on the lookout for covariate shift as soon as the model begins to be used. – generic_user Dec 12 '17 at 00:57
Could you explain that issue of covariate shift? – a13a22 Dec 12 '17 at 01:00
That is when you fit a model on some data, and then use it to predict. But new data coming in is different from that which is used to change the model, because the world shifts over time. Similar to stationarity. If the population is large relative to the sample, it is possible that the sample doesn't capture much of the support of the population, which may or may not be stationary. – generic_user Dec 12 '17 at 01:02
Thank you for your help; I now think k-fold is the best way to go (I read some more posts and the overlap leads to higher variance. Additionally, since there is one sample in each test, the accuracy of 0 or 100% is noisy.) – a13a22 Dec 12 '17 at 01:07

score 0 · Answer 2 · answered Dec 14 '17 at 13:27

0

If computational resources are not a problem, then LOOCV (leave one out cross validation) is the optimal approach. That is, training 200 models, each predicting the one left out sample, like you are already doing.

The reason for that is that cross-validation techniques generally overestimate the out-of-sample error, since they are trained on a smaller amount of samples, since you leave out a few observations. Thus, the model is undertrained with respect to all your available data, and will make less accurate predictions.

The less data you use to train the model (the larger the average hold-out-sample is), the worse your estimator.

answered Dec 14 '17 at 13:27

Sam

758
3
14

Is there any downside to the fact that the training set is usually the same for every test? – a13a22 Dec 14 '17 at 14:12
Since the training set is always the same, the 200 estimated out-of-sample errors CAN have quite a high correlation, and hence the central limit theorem does not kick in and the average can have large variance. However, in terms of bias, it's the optimal measure, and usually the variance part of it is not too bad – Sam Dec 15 '17 at 08:22

What would be the best way of setting train/validation/test indices when dealing with classification of independent subjects?

2 Answers2