I have a huge dataset (1.8mln items). I would like to create a small subset out of it - let's say 1k or 10k items. My data is in text format, let's say free form text. The length of the text is varying from less than 10 letters to tens of thousands. I try to explore the data in buckets <10, <100, <1000, etc.
If I have found the number of items in each bucket and I can say what's the bucket's proportion out of the whole dataset, should I use the same proportions/ratios/probability to create a subset, so that the ratio between buckets remains the same in my subset? Or should I use completely random pick out of the whole dataset?
My goal is to later use this dataset (the items would be labeled) to predict a label/class for an item.