Data
My data is the observation of a set of signal-broadcasting devices. For each second of collected data, my feature set has a 1 for devices that are observed and a 0 for those that are not (absence of a device is important). The way I collected this data uses a sampling method where I stand in several spots around the room for 2 minutes each, repeating with different signal-collecting devices. In other words, my situation is like this one except I use 1s and 0s only to indicate presence, not signal strength, and I have different signal-collecting devices.
The Model
Now, I am using this data to make a model to classify what room I'm in (using decision trees for now - sci-kit learn), and I'm using 10-fold cross-validation on all the data (no distinction between signal-collecting devices). It recently came to my attention that my method of data sampling has a high percentage of duplicates (over 60%). I assume that this is due to my standing in one spot for 2 minutes, as in the ideal situation I will see a particular set of broadcasting devices in a particular spot. This is ideal, however, and it is possible that a device may be missed or added for a particular second, due to the fluctuation of signals.
The Problem
With this situation, I have no guarantee that my duplicates will not be in both the training and testing set when the folds are created, but the duplicates seem to be valuable information because of noise. I have seen this post, this one, and this one. All 3 (including comments) seem to at least imply that in my situation, keeping the duplicates in the training data is important because of the noise. But keeping them in the testing set is not clear, and my colleagues have told me that keeping them in the testing set will inflate my model's performance (accuracy, recall, and F1 scores).
What should I do with the duplicates? Can I keep them in the testing set? If I do have to take them out for testing, how to do 10-fold cross-validation?