Structuring data appropriately for training and testing

Question

I have labelled data for a set of experiments, where 100 experiments were conducted and each experiment is associated with 1000 distinct data points that I have labelled. I am using classification methods but am a little uncertain on how to best partition my entire dataset. For example, I've observed that a Random Forest classifier's accuracy will be different if it's trained only on experiments 1-80 and tested on 81-100 versus mixing all the data from all shots and randomly choosing 80% for my training set and the remaining 20% for my test set.

In the above scenario, is there a general consensus on which approach to take? My worry is that choosing the latter option will result in the classifier overfitting the data as there is a lot of similarity between data points in a single shot and typically larger disparity for data points in separate shots. Thus it may perform poorly on future shots. Of course this is where a validation set becomes essential, which I am doing, but I am simply wondering if there is any general guidelines on optimally structuring the training/test set.

score 1 · Accepted Answer · answered Apr 03 '18 at 06:35

1

Don't do a single test, you should do k-fold cross validation:

Fold 1 = you train on 1-80 and test on 81-100,
Fold 2 = train on 1-60 and 81-100 then test on 61-80,
Fold 3 = train on 1-40 and 61-100 then test on 41-60.
etc.

In the end you can average the different test results, this should even out the influence of chosing a particular test set.

And yes: Keeping a validation set is always a good idea.

Whether you should chose your folds randomly or ordered depends on the context. Ideally, your observations should be completely independent and thus it should not matter asymptotically. In practice people usually prefer randomized folds because it kills any bias introduced by your (presumably arbitrary) ordering of observations.

answered Apr 03 '18 at 06:35

Denwid

702
5
14

1

Thank you! My primary worry is stemming from the fact that my dataset is imbalanced (i.e. 90% of total data points are in class 0, and 10% in class 1) and it is possible a certain experiment consists of purely class 0 data points. After testing, when selecting randomly from the entire dataset, the RF classifier's accuracy is nearly 99% for both class 0 and 1 classification, while it's about 96% for class 0 and 75% for class 1 with ordered selecting, indicating possible overfitting. (When utilizing logistic regression or naive bayes, the accuracy similarly improves, but not as significantly.) – Mathews24 Apr 03 '18 at 19:47
1

Then you definitely shouldn't be using Accuracy to measure model performance, it's a bad idea anyways: https://stats.stackexchange.com/questions/312780/why-is-accuracy-not-the-best-measure-for-assessing-classification-models – Denwid Apr 03 '18 at 19:56
And to solve the imbalance issue, I highly recommend reading https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/ – Denwid Apr 03 '18 at 19:57
Thank you for the references. I've been using confusion matrices but those links have certainly helped solidify the concepts in my mind. – Mathews24 Apr 03 '18 at 20:33

Structuring data appropriately for training and testing

1 Answers1