1

I have observations X along with their labels Y. I then create a histogram of Y. I then remove observations such that the histogram still retains the same distinct shape.

Does anyone know what data reduction technique this is called?

  • 3
    If this is done purely visually, perhaps the nicest name for it would be "data fudging." If you are applying a definite algorithm, and are willing to disclose it in your post, then perhaps more could be said. – whuber Jul 03 '14 at 20:19
  • Actually, I just thought of this in my head, I thought that there would be something like this already out there. I was thinking of a way to reduce the amount of data that I had, but still retain the original distribution. –  Jul 03 '14 at 20:38
  • 1
    Such data reduction is usually obtained by simple random sampling of the data: "subsampling," but in some cases other forms of *controlled* sampling can be used: I would expect any answers to explain some of the possibilities. For problems that might occur in using histograms for subsampling please see http://stats.stackexchange.com/questions/51718/assessing-approximate-distribution-of-data-based-on-a-histogram. – whuber Jul 03 '14 at 21:08
  • It might be a uniform sub-sampling. Might be class-weighted subsampling. Might be related to an unscented transform (think unscented-Kalman Filter). Without more clear details it is hard to tell. – EngrStudent Jul 03 '14 at 21:39

1 Answers1

2

What you're describing would qualify as a form of Stratified Random Sampling. (Though typically you'd stratify according to things like "Sex" and "Nationality" and not according to the bins of a histogram...)

Steve S
  • 1,064
  • 8
  • 17
  • I've heard of Stratified Random sampling but, my output is a continuous value. I haven't seen Stratified Random sampling used with continuous outputs before, is it common? –  Jul 03 '14 at 21:23
  • 1
    Well, what the technique is and whether or not it's the best technique for the job are two different questions... – Steve S Jul 03 '14 at 21:39
  • What you initially describe would qualify as stratified random sampling but you would probably want to do something more along the lines of bootstrapping/resampling. Thus, to deal with what is (presumably) a *huge* dataset you'd end up taking lots of manageably-sized samples from your dataset, etc., etc... Note: For more in-depth answers you'd probably need to be more specific w/r/t what it is you're trying to do and what you have, etc] – Steve S Jul 03 '14 at 21:46