Cross Validation with duplicates and (un)balanced data

Question

I am currently working on a student project were we do a binary classification. But the data is highly screwed!

The train AND test data contains a huge amount of duplicates , were every row is identical. We dont know exactly how to handle this problem because in the test data we have a lot of duplicates as well and we cannot delete rows there because we need to predict the binary outcome for them.

My qustion is now related to how we can do a Cross Validation with the train data to estimate our models performance with the best atttributes and parameters for the unknown test data.

If we keep the duplicates in our training data, the binary classification problem is balanced, so we have almost the same amount of 0s and 1s for the binary class. But if we delete the duplicates we get a lot more 0s then 1s (2/3 are 0s and 1/3 are 1). As already mentioned we have also alot of duplicates in the test data, so we assume that in the test data the 0s and 1s are balanced as well, and without the duplicates we have more 0s and less 1s.

How do we do a good Cross Validation for this problem?

Do we leave the training data as it is and dont delete any dublicate rows, or do we delete dublicate rows and then do on the unbalanced data the Cross Validation and prediction? Or do we have to delete the duplicates per CV-Fold for the Training-Fold and then balance it and then predict on the unbalanced CV-Test-Fold where we didnt delete any rows?

*Where* did your duplicates come from? Are these data errors? If so, remove them (and don't worry about unbalancedness). If they are correct observations that just happen to be identical, then leave them in. — Stephan Kolassa, Jan 14 '18 at 14:40
The duplicates are in the data for some reasons, we assume that there is a problem in the ETL process, and we cannot change this. — janbauer, Jan 14 '18 at 18:09
If there is a problem in the ETL process and you can't change that, then you will probably have to clean up someone else's mess. Welcome to a statistician's job. Seriously, if you already *know* part of your data is spurious, then you shouldn't be thinking about whether or not to use it in your analysis, but about how to identify the bad parts and remove them. — Stephan Kolassa, Jan 14 '18 at 18:29
This is just a student project, we got some data of a company and have to do a prediction on unknown data of them. Our main question was just how we have to handle the cross valdiation with the duplicats because they are in the unknown data to predict as well. — janbauer, Jan 14 '18 at 18:57
@janbauer: The real world answer is to go to the company and ask them about the meaning of the duplicates. You also cannot do a proper splitting for the cross validation without knowing what the independent cases are in your data (you can do proper cross validation with e.g. repeated measurements, but you need to know which ones are repetitions of which) — cbeleites unhappy with SX, Jan 14 '18 at 20:44

PickleRick · Accepted Answer · 2018-01-15T08:05:29.580

-1

If you have exact training data with equal labels (absolutely the same data point with the same labels are repeated over and over) delete the duplicates. Make sure they are exactly the same in observation and label. (e.g, for some reason, the same exact data, from the same subject, is repeated.)

After removing your duplicates, create your folds with respect to the number of observations in each class. Try to represent the overall distribution of classes in each fold. Then each fold represents the true distribution of your dataset.

If you do so, you train with all the folds except one, and test it on the one remaining. and the test fold has also realistic class distribution.

The way you use the training data to train your model in cross validation though, is another story. If you use neural nets or something that requires minibatch training, use stratified minibatch. If not, you can randomly select equal number of observations from both classes for training or use other tricks. But create your folds realistically and keep this decision apart from fixing the training issues of unbalanced data.

Here are some more information regarding removing duplicates in different scenarios:

1) It does not make any sense to have one data point more than once. It does not add any new information. If the duplicates are not exactly the same series, or are only in some of the feature dimensions, that's another story. Keep them as they are.

2) if you have two separate time series, recorded from the same subject, and the values of all time steps are identical, then I think you can remove one of them, considering the time step unit is small enough (e.g, is a second, not a year) .

4) But if your data is something like stock market price and the time series are prices during a month, it is possible that two identical series are not actually the same (e.g, price didn't change at all two months in a row) and this information is indeed valuable and should be kept.

UPDATE: I added the summary of our discussions with @janbauer and @cbeleites in the comments to the answer.

edited Jan 15 '18 at 08:05

answered Jan 14 '18 at 08:29

PickleRick

688
6
19

Thank you for your answer. What do you think about the problem that we have a lot of duplicates in the test data as well? We cannot delete them because we have to submit a prediction for them. So how do we have to do the cross validation to estimate the model performance for the unknown test data (which contains duplicates)? The difficilty is also, that the data contains time series as well, so we cannot do random sampling for cross validation. – janbauer Jan 14 '18 at 08:44
Duplicates in Test set does not matter. If you train a good model, your predictions will be valid. Even if they are repeated. One thing is still not clear for me tho. If you have time series, do you have only one , very long observation or multiple observation series? E.g, if it's an audio recording, do you have a long recording for hours or do you have multiple audio files? if you have multiple audio files, then you can select the files randomly. if not, you can segment them and select different segments. But it depends on your data and what you are trying to do. – PickleRick Jan 14 '18 at 08:46
There should be standard ways for different data and tasks to deal with the unbalanced data problem. But CV is another story in my opinion. – PickleRick Jan 14 '18 at 08:50
Thank you. And do we have to balance the training folds of the cross validation? Because after deleting the duplicates we have unbalanced data, and we asume that in the test set with the unknown class we have a balanced class proportion. – janbauer Jan 14 '18 at 08:51
The data contains observations over time of the same subjects. But the different subjects are very similar. – janbauer Jan 14 '18 at 08:52
I wrote my suggestion in the answer. I summarize here one more time: 1) delete duplicates 2) create CV folds 3) find a solution for unbalanced data to train your models. try to keep the subject ratio in each fold as it is in the whole dataset, if you want your CV represent your true data distribution. – PickleRick Jan 14 '18 at 08:52
My pleasure, hope it helps! – PickleRick Jan 14 '18 at 08:56
(-1) I'm sorry, but at the current state of information we have, deleting is *not* a good recommendation: We do not yet know whether the duplicates are just identical features that happen to be observed for independent cases, or whether there are repeated data of the same cases - and if so, why they were repeated. Without this information, it is not possible to give a sound recommendation how to proceed. (I'd be happy to switch to +1, if OP gives this information and it then turns out that deleting is actually the way to go) – cbeleites unhappy with SX Jan 14 '18 at 17:08
I agree on this with @cbeleites. As I wrote in my answer, if we are sure the data is identical, then it's fine to remove. E.g, we know it's exactly the same data recorded from the exact same subject, then it can be removed. Otherwise, I can not tell more since I am not sure what kind of duplicate we are talking about. – PickleRick Jan 14 '18 at 17:12
They are 100% identical. But as far as we know they are somehow always in the data. – janbauer Jan 14 '18 at 18:07
So if you have two separately time series, recorded from the same subject, and the values of all time steps are identical, then I think you can remove one of them. What value does it add to the training? You could repeat any of the files you have and it would be the same. – PickleRick Jan 14 '18 at 18:10
The accuracy on the training data is dependent of how we do the cross validation. We dont have another estimator for the accuracy as our own cross valdiation. If we delete the duplicates in the test fold as well as in the training fold and balance both folds (training and testing) our accurecy goes up. If we dont delete the duplicates in the test fold and delete the duplicates in the train fold and balance only the training fold it is a bit under the score before. I hope this is somehow understandable. – janbauer Jan 14 '18 at 18:53
@janbauer: we (or rather: you) need to know that *somehow*. Having rows with identical values can be right or wrong. Stefan Kolassas comment is spot on: you need to find out why the data is the way it is. Then judge whether it should be as it is or different. Sure, both your models and the validation results will change depending on whether the data has identical rows or not. But without external knowlege about task and data generation you cannot know which of your models and which test set is right and which is wrong. – cbeleites unhappy with SX Jan 14 '18 at 19:01
1

@PickleRick: added information: 2 separately recorded time series with identical values add the information that this particular development/behaviour/time series is more likely than others. If you often observe identical time series in independent recordings of the same subject but not from other subjects, you learn that intra-subject variance is very low compared to inter-subject variance. Which is an extremely important finding. In contrast, if your data base query is wrong and copies recordings, then you learn nothing (besides hopefully how to correctly query your data base). – cbeleites unhappy with SX Jan 14 '18 at 19:06
1

@cbeleites I think you are right about that, which leads us to the first question: what kind of data is it, and why there are exact matches. In images and audio, they are mostly just error in the data. But if your data is stock market price and the time series are prices during a month, it is possible that two identical series are not actually the same (e.g, price didn't change at all two months in a row) and this information is indeed valuable and should be kept. But it is very hard to judge without more information about the data and the task. – PickleRick Jan 14 '18 at 19:17
The data contains an observation of different subjects over one month (over 100 different subjects). Over this month there were randomly ~50k observations made, at a current timestamp we get the binary class and additional attributes. It is maybe good to mention that the different subjects are due to location at the same time very likely to have the same binary outcome. The duplicates are identical (time, binary class, etc.). Basend on this we have to build a classification for unknown data. The unknown data is from the next month, and has the same proportion of duplicates as the train data. – janbauer Jan 14 '18 at 19:36
@janbauer So if the time of the same object is exactly the same, and the data is also identical, then it should only be an error, because one subject can appear at the same time, twice. Right? (unless we are considering parallel universes ;) ) Is it possible to measure a subject, at the same time, twice in your case? – PickleRick Jan 14 '18 at 19:41
@PickleRick: OTOH, sometimes the values are on very coarse scales, and if you have short time series on a coarse scale, there may be truly indepent but repeated patterns (imagine one or more subjects report headache 4 out of 5 on one-week time series and always on saturdays at home after friday night at the local pub) – cbeleites unhappy with SX Jan 14 '18 at 19:48
1

lol...this might go on forever, alright, what does the time mean in your data @janbauer? is it days? hour? millisec? – PickleRick Jan 14 '18 at 19:50
It is a timestamp, but i think you guys helpt me enough, now i know how to do the cv right. thanks! – janbauer Jan 15 '18 at 07:52

Cross Validation with duplicates and (un)balanced data

1 Answers1

Linked