Dataset distribution - should I use ratios to keep the distribution in a subset?

Question

I have a huge dataset (1.8mln items). I would like to create a small subset out of it - let's say 1k or 10k items. My data is in text format, let's say free form text. The length of the text is varying from less than 10 letters to tens of thousands. I try to explore the data in buckets <10, <100, <1000, etc.

If I have found the number of items in each bucket and I can say what's the bucket's proportion out of the whole dataset, should I use the same proportions/ratios/probability to create a subset, so that the ratio between buckets remains the same in my subset? Or should I use completely random pick out of the whole dataset?

My goal is to later use this dataset (the items would be labeled) to predict a label/class for an item.

Stephan Kolassa · Accepted Answer · 2018-07-26T08:01:56.563

3

Sampling separately by bucket is a case of "stratified sampling". We have a couple of threads on this, as well as a tag stratification. It makes sense to do this so your sample is representative of your population, which may not be the case if you do straight random sampling.

Note that you should probably not oversample the rarer buckets.

edited Jul 26 '18 at 08:01

answered Jul 25 '18 at 15:04

Stephan Kolassa

95,027
13
197
357

Thanks! This was very informative - I didn't know about the formal term `stratification`. – stan0 Jul 26 '18 at 15:02

Dataset distribution - should I use ratios to keep the distribution in a subset?

1 Answers1