For data reduction, what is this technique called?

Question

I have observations X along with their labels Y. I then create a histogram of Y. I then remove observations such that the histogram still retains the same distinct shape.

Does anyone know what data reduction technique this is called?

If this is done purely visually, perhaps the nicest name for it would be "data fudging." If you are applying a definite algorithm, and are willing to disclose it in your post, then perhaps more could be said. — whuber, Jul 03 '14 at 20:19
Actually, I just thought of this in my head, I thought that there would be something like this already out there. I was thinking of a way to reduce the amount of data that I had, but still retain the original distribution. — , Jul 03 '14 at 20:38
Such data reduction is usually obtained by simple random sampling of the data: "subsampling," but in some cases other forms of *controlled* sampling can be used: I would expect any answers to explain some of the possibilities. For problems that might occur in using histograms for subsampling please see http://stats.stackexchange.com/questions/51718/assessing-approximate-distribution-of-data-based-on-a-histogram. — whuber, Jul 03 '14 at 21:08
It might be a uniform sub-sampling. Might be class-weighted subsampling. Might be related to an unscented transform (think unscented-Kalman Filter). Without more clear details it is hard to tell. — EngrStudent, Jul 03 '14 at 21:39

score 2 · Accepted Answer · answered Jul 03 '14 at 21:21

2

What you're describing would qualify as a form of Stratified Random Sampling. (Though typically you'd stratify according to things like "Sex" and "Nationality" and not according to the bins of a histogram...)

answered Jul 03 '14 at 21:21

Steve S

1,064
8
17

I've heard of Stratified Random sampling but, my output is a continuous value. I haven't seen Stratified Random sampling used with continuous outputs before, is it common? – Jul 03 '14 at 21:23
1

Well, what the technique is and whether or not it's the best technique for the job are two different questions... – Steve S Jul 03 '14 at 21:39
What you initially describe would qualify as stratified random sampling but you would probably want to do something more along the lines of bootstrapping/resampling. Thus, to deal with what is (presumably) a *huge* dataset you'd end up taking lots of manageably-sized samples from your dataset, etc., etc... Note: For more in-depth answers you'd probably need to be more specific w/r/t what it is you're trying to do and what you have, etc] – Steve S Jul 03 '14 at 21:46

For data reduction, what is this technique called?

1 Answers1