Is it a big no no to double data in order to increase sample size? I have a sample size of 27 which really doesn't give me much to work with when running tests. Would this be statistically incorrect? Probably an easy answer to this but I can't find anything about it.
-
3For sure, you do not gain anything, as the "information" contained in the data does not increase. Whether you actually lose something depends on which methods you use. For instance, many methods assume i.i.d. data, which would not hold for the doubled dataset, and thus can not be applied. Finally, consider the simple case where you only have 1 sample, and you replicate it N times to create a dataset with sample size N. Is this correct? Do you actually gain anything? – George Jul 05 '17 at 17:08
-
I have had success using "jittered" data, with small values less than the margin of error of measurement adding to/subtracting from the data. This helps with data where the margin of error is significant, but works best if you have a pretty large amount of data to begin with. – Josiah Jul 10 '17 at 18:13
2 Answers
If this really worked, why wouldn't you just collect one data point, then duplicate it as many times as needed?
The issue here is usually independence. Common statistical procedures assume your data was collected as independent observations, which allow you to pool them together to gain more information about the thing you are studying. Doubling data like this obviously violates any independence assumptions, and you do not gain any new information from the duplicated data points.

- 33,314
- 2
- 101
- 132
My opinion would be that even if you chose an algorithm to Oversample your Data for instance: bootstrap, SMOTE, ADASYN. You still have far less samples to actually prove that the oversampling technique you used would give you more or less the same output as with the actual Data that could have been.
I would work with the these Data. Hope also some other fellows give their input. Moreover, I took arbitrary into account that you want to do Supervised Learning please clarify if that is not your purpose.

- 101
- 6
-
Hey Philip, I'm not sure exactly what Supervised Learning is lol but I don't believe that's my purpose – Mad Jul 05 '17 at 17:10
-
-
-
I assumed that you want to use your Data to predict or Classify, never mind. – Grzegorz Jul 05 '17 at 17:15
-
-
@MichaelChernick https://stats.stackexchange.com/questions/112147/can-bootstrap-be-seen-as-a-cure-for-the-small-sample-sizeas they suggest it would overfit, as I described it wouldn't be relative to the actual population. (Picking samples with replacement) – Grzegorz Jul 07 '17 at 17:22
-
1Bootstrap mimics the relationship between the sample and the population. It is not a method that adds data even though the mechanism is to sample without replacement from the original data. – Michael R. Chernick Jul 07 '17 at 18:21