I am currently working on a student project were we do a binary classification. But the data is highly screwed!
The train AND test data contains a huge amount of duplicates , were every row is identical. We dont know exactly how to handle this problem because in the test data we have a lot of duplicates as well and we cannot delete rows there because we need to predict the binary outcome for them.
My qustion is now related to how we can do a Cross Validation with the train data to estimate our models performance with the best atttributes and parameters for the unknown test data.
If we keep the duplicates in our training data, the binary classification problem is balanced, so we have almost the same amount of 0s and 1s for the binary class. But if we delete the duplicates we get a lot more 0s then 1s (2/3 are 0s and 1/3 are 1). As already mentioned we have also alot of duplicates in the test data, so we assume that in the test data the 0s and 1s are balanced as well, and without the duplicates we have more 0s and less 1s.
How do we do a good Cross Validation for this problem?
Do we leave the training data as it is and dont delete any dublicate rows, or do we delete dublicate rows and then do on the unbalanced data the Cross Validation and prediction? Or do we have to delete the duplicates per CV-Fold for the Training-Fold and then balance it and then predict on the unbalanced CV-Test-Fold where we didnt delete any rows?