Should I remove duplicates from my dataset for my machine learning problem?

Question

I am new to machine learning and have a problem.

I have a dataset with 1191 samples with 10 features which belong to 5 different classes. I have trained a neural network with this dataset and obtained a good accuracy of about 0.9. I noticed that about 350 samples are duplicated. All random selections of data for testing the network, contain about 180 samples which also have been in training data.

My question is should I remove the duplicates from the dataset? Do they contribute to this accuracy?

Hi, welcome. Yes, these duplicates add weight to the fit of those specific observations (cases). Wether this effect is big or small is hard to tell from the information you have provided. — Jim, Sep 22 '19 at 09:14

mkt · Answer 1 · 2019-09-24T08:50:34.267

You should probably remove them. Duplicates are an extreme case of nonrandom sampling, and they bias your fitted model. Including them will essentially lead to the model overfitting this subset of points.

I say probably because you should (1) be sure they are not real data that coincidentally have values that are identical, and (2) try to figure why you have duplicates in your data. For example, sometimes people intentionally 'oversample' rare categories in training data, though why this is done is not clear, as it probably helps only in rare cases.

As a side note, it's worth reading this thread: Why is accuracy not the best measure for assessing classification models?

Should I remove duplicates from my dataset for my machine learning problem?

1 Answers1