What is the more appropriate way to create a hold-out set: to remove some subjects or to remove some observations from each subject?

Question

I have a dataset with 26 features and 31000 rows. It is the dataset of 38 subjects. It is for a biometric system. So I want to be able to identify subjects.

In order to have a testing set, I know I have to remove some values.

So what is it better to do and why?

(a) keep 30 subjects as training set and remove 8 subjects as testing set
(b) keep the 38 subjects, but remove some rows of each one. So at the end I will end up with a training set: 24800 rows of 38 subjects AND a testing set: 6200 rows of 38 subjects

If you want to identify subjects, how can you possibly remove "entire" subjects for a test set? The model trained on 30 subjects will only be able to identify these 30 subjects, not the 8 subjects it has never seen. Your clarification about "biometric system" might render most existing answers nonapplicable. — amoeba, Oct 14 '16 at 15:04
It would be helpful to clarify: (1) the *precise* goal, what's the outcome variable you're trying to predict (2) what data you have and possibly (3) what approach you use. — Matthew Gunn, Oct 14 '16 at 15:26
Once you have trained the system, how would it be deployed? Is the goal to use it on people not in the training set? Will new data be from the same 38 subjects? I am trying to understand how the system is supposed to be used and whether you are in case (1) or case (2) of @AmiTavory's answer. — Matthew Gunn, Oct 14 '16 at 16:17
@amoeba I am not very experienced :( I asked the question because it is recommended to split data (training, validation and testing). So it is either to remove some subjects or some observations of each one OR use a dataset available online. So 1) I want to be able to identify subjects. To whom the features belong. 2) I am using EEG (so time-series). 3) I am using Stratified-fold. But it is because I got a higher accuracy with it than with kfold. I only knew of k-fold, stratified and leave-one-out. As I have always thought it was to compensate imbalance in data. But I am open to suggestions. — Aizzaac, Oct 14 '16 at 16:50
@MatthewGunn I think it would not be possible to use it with people who is not in the dataset. That would be very difficult but very interesting if achieved. So I would say case 1. — Aizzaac, Oct 14 '16 at 16:55

Matthew Gunn · Answer 1 · 2016-10-14T17:22:13.403

A critical distinction is whether you want to:

[Most common case]: Construct an estimate of performance on new subjects (drawn from the same population as your data).
Construct an estimate of performance on new observations from the same subjects as in your sample.

The far more common case is case number (1). Eg., how well do you predict heart attacks for someone that's coming into the emergency room? And if you're in case (1), you almost certainly should do (a) subject-wise cross-validation rather than (b) record-wise cross-validation. Doing record-wise validation in case (1) likely will lead to unreasonably high, bogus estimates of performance on new subjects.

I don't precisely understand what you're trying to do (and perhaps it is self-study so the question isn't entirely realistic). I don't know what case you're in. If you're in the much less common case (2), record wise validation may be ok.

A general theme in statistics is to think carefully about what's independent and what's correlated. Generally speaking, an independent observation tends to be a different subject. If you want to predict performance on new subjects, you must test on subjects you didn't train on!

Why subject-wise cross validation rather than record-wise?

In typical settings, repeated observations of the same individual are correlated with each other even after conditioning on features. Hence with record-wise cross-validation, your test set isn't independent of your training set! In the extreme case of perfect correlation, you'd have the exact same observations in the training set and the test set! You'd be training on the test set! The performance measured in cross-validation would not be predictive of performance on new subjects.

For example, this recent paper calls record-wise cross-validation, ``Voodoo Machine Learning."

What to do with so few subjects though...

Perhaps some commenters more experienced with cross-validation than me could chime-in, but to me, this looks like a possible candidate for $k=n$ (aka leave out one cross-validation)?

To maximize data for training, something you could do is leave out one subject for cross validation. Each iteration, test on a different hold-out subject and train on all the others.

If the subjects are all very different, you may effectively have close to $n = 38$ and may want to put as many independent subjects in the training set.

I know the dataset has to be divided in 3: training, validation and testing. In cross-validation "training and validation" are used. The hold out set is the "testing". So are you suggesting: for training and validation use 37 subjects and test in 1 subject? And that I can do any type of cross-validation: k-fold, stratified, etc? — Aizzaac, Oct 13 '16 at 16:28
OP made an edit saying that "It is for a biometric system. So I want to be able to identify subjects". If I correctly understand what it means, it implies that only record-wise CV makes sense because the goal is to predict subject id based on a sample. — amoeba, Oct 14 '16 at 15:07
@amoeba Only if new data will come from those *same 38* subjects. Eg. if the system is supposed to say whether fingerprints match or don't match and it will be deployed at the front door of my house (i.e. must identify strangers as strangers), training and testing on the same subjects would be problematic (you'd like to know how often it grants entry to people it has never seen before, never trained on). I agree though that "It is for a biometric system" raises questions... — Matthew Gunn, Oct 14 '16 at 15:36

score 9 · Accepted Answer · answered Oct 13 '16 at 19:16

I think Matthew Gunn's answer is correct, but it seems to me that "record-wise" CV can cover two different concepts:

Records are randomly partitioned into folds, regardless of subject or time.
Records are partitioned into time-based folds, with no fold used for training that contains data from after the beginning of the test fold.

In the first case, any time-series nature of your data is being compromised, since your training set can include data from both before and after your test set. The principle of Train/Test is that Training data represents data known to the present, and Test data represents as-yet-unseen data (perhaps literally from the future).

Perhaps time series autocorrelation compromises option #2. Perhaps the time element of the model is not really important and so "past" and "future" observations are likely to be the same. In these cases, neither #1 or #2 is the way to go.

If there is only seasonality and not trends, it seems like it's okay to include "the future" in training for some subjects in order to predict new subjects (who will be affected by the same seasonality). If there are trends, training on the future should learn something about the future that you really would not know when using the model in production.

In the OP's example, it sounds like subject-wise is good. But if trends and time-series concerns were part of the model, I'd try to use subject- and time-based partitioning so you trained on one set of subjects prior to a specific point in time, then test on the other subjects after that specific point in time.

+1 That's a very important point that if there's time series structure, you shouldn't use the future to predict the past! This is a common and dangerous error in finance, using information unavailable at time $t$ (because it isn't known till the future) to predict returns, defaults, etc... at time $t$. With complicated projects and tons of data, it can be very easy to make mistakes and let your training algorithm effectively peak into the future, achieving results not possible in reality. — Matthew Gunn, Oct 13 '16 at 20:02
I'd say that the application decides what the approprate splitting is. E.g. even with time series, it may be the task to predict the dependent variable for later measurements of unknown subjects which would mean splitting must be done subject-wise as well as time-wise. — cbeleites unhappy with SX, Oct 14 '16 at 11:52
@cbeleites: Agreed. This is what I did on a recent project: split the data into train and test subjects, and only use pre-some-date data for training and post-that-date data for testing, so we were testing on subjects we'd never seen, during a time period we'd not yet seen. You want to stack the deck against you -- without going overboard -- if you want to really know how well your model will do in practice. — Wayne, Oct 14 '16 at 14:02

score 3 · Answer 3 · answered Oct 13 '16 at 14:40

3

It really depends on how you envision the setting you wish to emulate with the train/test split. To make things concrete, say each row describes the sleep quality of a subject in a single night.

It's possible that, in the future, each subject will first come to a sleep lab, and provide you with the sleep quality for a few nights. Following that, you will need to predict future nights' sleep quality for these subjects. In this case, you would use your option b). In sklearn, you could use sklearn.model_selection.StratifiedShuffleSplit
It's possible that, in the future, a few subjects will first come to a sleep lab, and provide you with the sleep quality for a few nights. Following that, you will ned to predict future nights' sleep quality for other subjects. In this case, you would use your option a). In sklearn, you could use sklearn.cross_validation.LeavePLabelOut

answered Oct 13 '16 at 14:40

Ami Tavory

4,410
11
20

1

Realistically though, the question is almost certainly your case (2) and option (a), i.e. subject-wise cross-validation, is what's appropriate. – Matthew Gunn Oct 13 '16 at 15:28
1

@MatthewGunn You're right. Case 2 is far far more common. I must say I *have* run into real-life cases where 1 was the right choice. – Ami Tavory Oct 13 '16 at 15:35
1

Interesting. I could see (1) being right for predicting new observations for some large, established customer base. Point taken. Perhaps I'm reacting so strongly because I recently read [this paper](http://biorxiv.org/content/early/2016/06/19/059774) on widespread misuse of record-wise cross-validation in the medical area. – Matthew Gunn Oct 13 '16 at 16:08
What do you think of this: I use the 38 subjects with any type of cross validation; and in order to test if my model can be used with different subject; I use an online dataset? I think the idea of a held out set is to have a "new" dataset in order to test the model. – Aizzaac Oct 13 '16 at 16:35

score 3 · Answer 4 · answered Oct 14 '16 at 12:13

To chime in, I assume that the application is to predict unknown subjects. That means (regardless of whether you have time series or inherently unordered repeated measurements) that the splitting needs to be done so that unknown subjects are tested => splitting a)

Considering that you have only 38 subjects, you should put some thought into resampling validation, though. From my experience working with similarly small sample sizes (though more features), here are some recommendations in a nutshell:

Go for subject-wise out-of-bootstrap or iterated cross validation. They allow to assess the stability of your models which is crucial in small sample size problems. The results can even be used for an aggregated model in case instability is an issue.
Do not do leave-one-subject-out. It neither allows to measure model stability, nor to reduce it. In addition, there are situations where it is subject to a large pessimistic bias because of small sample size (as opposed to the minimal pessimistic bias that is expected).
If you are using the typical classification figures of merit such as accuracy, sensitivity, specificity, etc. and the task is to correctly classify subjects: beware that the crucial problem is measuring the performance because the uncertainty of the test results depends on the absolute number of test cases. As an example, observing 17 correct predictions out of 17 truly positive subjects corresponds to a 95% confidence interval for sensitivity ranging from about 80% to 100%.
In other words, you won't be able to do data-driven model optimization based on that.
This also means that you do not need to set up a three-set splitting (nested cross validation), because you'd waste resources on statistically meaningless comparisons here - your optimization is almost guaranteed to fail (though you may not notice that if you don't check for stability).
The same calculation for binomial confidence interval for the proposed 5 (3) test/optimization/validation subjects yields 95% c.i. for all correct ranging down to 50% (30%) so based on perfect test results your confidence interval would still include 50/50 guessing!

Some papers we wrote about these subjects:

What is the more appropriate way to create a hold-out set: to remove some subjects or to remove some observations from each subject?

4 Answers4

Why subject-wise cross validation rather than record-wise?

What to do with so few subjects though...

Linked