Splitting Data With High Percentage of Repeated Values

Question

I am building a model in which my data set has a high percentage of repeated values.

I am concerned that if I do traditional hold-out or k-fold cross-validaton, that I will get unreliable results as the test sets will have many examples that are basically the same as those included in the train sets.

Is there an approach I can use so that my test sets do not include examples that are basically the same as the examples I have in my training set?

Well, if you are really concerned about duplicates biasing predictions, you can train your model with unique values. But keep in mind that now the sample is not representative of the actual population. Further details about the problem would make us possible to improve this guess — David, Jul 05 '19 at 11:24
Your validation set should be as similar to the data you want to apply the model on. Does that set have the same duplicated values? If not, you should find out some method of splitting the data so that no duplicates are spear over the different cross validation groups. — Heikki Pulkkinen, Jul 05 '19 at 12:01
Welcome to cross validated! I think the answers to the linked question essentially answers your question as well. Feel free to say why not or what you require in addition to the answers there - in that case, we can reopen your question. — cbeleites unhappy with SX, Jul 05 '19 at 18:48

Splitting Data With High Percentage of Repeated Values

0 Answers0