Removing duplicates before train test split

Question

Let's say you have a dataset generated from real world sampling which has lots of duplicates (the dependent and independent variables are identical) and you want to train a classifier to predict the dependent variable in future samples from the real world. If you were to take a train test split from this dataset you'd inevitably have records which appear in both the train and the test dataset, so to avoid this you would deduplicate the original before doing a train test split.

My question is - is it the right thing to deduplicate and train on deduplicated data and test on totally unseen data, or is it the right thing to train and test from the original distribution coming from the real world, and accept that we would do well on previously seen examples anyway?

I would treat the duplicates as one entity, that is they are either all in the training set or all in the test set, but not both. — user2974951, Feb 13 '19 at 08:31

score 5 · Answer 1 · answered Feb 13 '19 at 01:42

Interesting question.

The effect of duplicates in training data is slightly different than the effect of duplicates in the test data.

If an element is duplicated in the training data, it is effectively the same as having its 'weight' doubled. That element becomes twice as important when the classifier is fitting your data, and the classifier becomes biased towards correctly classifying that particular scenario over others.

It's up to you whether that's a good or bad thing. If the duplicates are real (that is, if the duplicates are generated through a process you want to take into account), then I'd probably advise against removing them, especially if you're doing logistic regression. There are other questions about dealing with oversampled and undersampled datasets on this SE. When it comes to neural networks and the like, other people may be able to answer better whether it is necessary to worry about this.

If your dataset is, for example, tweets, and you are trying to train a natural language processor, I would advise removing duplicate sentences (mainly retweets) as that doesn't really help to train the model for general language use.

Duplicated elements in the test data serve no real purpose. You've tested the model on that particular problem once, why would you do it again, when you'd expect the exact same answer? If there is a high proportion of the same duplicated entries in the test set as are in the training set, you'll get an inflated sense of how well the model performs overall, because the rarer scenarios are less well represented, and the classifier's poor performance with them will contribute less to the overall test score.

If you are going to remove duplicates, I'd recommend doing it before splitting the dataset between train and test.

Do you know of any Python packages that do a train-test split with duplicate records, but then pull those out of the test set that are also in train? — chimpsarehungry, Mar 09 '20 at 16:51
You wrote, and I agree, that duplicates might be meaningful in the training dataset, but they are unnecessary in the testing dataset. And then, in the end you recommend removing them from both datasets. I think the end clashes with your previous reasoning. — Helen, Dec 18 '21 at 18:25

score 3 · Answer 2 · answered Feb 13 '19 at 11:24

I'd like to add 2 points to @Ingolifs nice answer:

The main idea behind recommending to deduplicate or not is to think what that amounts to wrt. your application. Both have their point, but wrt. testing a slightly different kind of generalization ability:
- If you want to test the ability to correctly predict new (say, future) cases, there's nothing that would guarantee statistically independent future cases not to have independents that your model encountered during training. So deduplication here would lead to a bias in your sample distribution. Which may or may not influence your model and if it does have an influence on the model, that influence may or may not be what you actually want.
- On the other hand, it may still be of interest how your model peforms for cases where the independent variable vector is unknown in the sense that it has not been encountered during training. Note that while this means a constraint in splitting so that no equal independent vectors can appear in both training and test sets, it does not imply deduplication.
  I'm analytical chemist and I do something similar when I test performance for concentrations that are (slightly) different matrix* composition.
The second point is that if your original sample is representative for your population, then the dedupicated sample will be biased. In other words, do the duplicates occur naturally or were they produced artificially?
- In the former case, I'd say you need very good argumentation why you want to change your sample distribution.
- You can still deduplicate, but it is up to you to correctly account for that treatment. As @Ingolifs said already, you may be able to save computation by replacing duplicates with appropriate weights. That holds for testing as well.
- For deduplicated test sets, you'll have to be particularly careful about the conclusions you draw. I'm thinking of the somewhat related issue of reporting predictive values that can be seriously wrong if they do not take into account the actual class distributions - see Buchen: Cancer: Missing the mark, Nature 471, 428-432 (2011). for a famous example.

* in analytical chemistry matrix is the stuff surrounding the analyte you're interested in. Say I'm looking at ethanol in wine, then the water, acids and all other substances except the ethanol form the matrix.

Removing duplicates before train test split

2 Answers2

Linked

Related