0

Say I have a distribution, either described by a probability density function that is integrable and continuous, or by a set of discrete probabilities over a finite set of symbols.

I want to efficiently generate samples from this distribution. I don't care about references to other algorithms, I want to know if this idea is essentially correct, or how to improve it.

Here is how I currently do it for the discrete case :

  1. form the cumulative probability function ( defined at the ith symbol as the sum of all probabilities from the first symbol to the ith symbol ).
  2. Then, there is an interval, [0,max(CDF)] and it is trivial to output a IID random number anywhere in that interval using a regular PRNG.
  3. Since every symbol is now associated with an interval, we output the symbol in whose interval the IID number landed.

It seems completely obvious to me that the output sequence will have the characteristics of the given discrete distribution, and is optimal up to the original PRNG.

I want now to extend to the continuous case (this may be the Ziggurat algorithm, and if so, boy am I happy I jumped straight to that). But can we not simply integrate the PDF to get a CDF, again form the interval [0,max(CDF)] and again shoot IID randoms at this interval, and again, and this is the tricky part for me, associate each sample symbol (of which there are infinitely many) with an interval. Then output the sample symbol in whose interval the IID number landed.

So, my question is : how can I associate the sample symbols, with intervals? Presumably by discretizing (putting into buckets) the sample symbol interval, and giving anything in a bucket the same probability. The size of buckets could be worked out so as to, over the intended output stream length, not cause any detectable deviation from the underlying continuous distribution.

Also, I am doing this in JavaScript. It's a DIY so I am not interested in trusting or using someone else's code. But nor am I interested in spending 2 days on it. I'm asking here as I am not an expert in distributions, and I want to have the ideas corrected before I try to implement this.

  • Actually [Ziggurat](http://en.wikipedia.org/wiki/Ziggurat_algorithm) just sounds easier. Maybe I should just do that. I can't see anything wrong with the above idea...but if someone can make an answer about it I'm sure I will learn something. – Cris Stringfellow Mar 12 '13 at 05:34
  • I lost you in the middle right where you tried to characterize the "continuous case" in terms of "sample symbols" and "intervals." This sounds like a discrete distribution, not a continuous one. Perhaps you could clarify this by specifying how your distribution will be described and what output you are hoping to get in the continuous case. BTW, if your policy is not to "use someone else's code," yet you know you're not an expert, then you're really saying you think that your lack of expertise makes you better qualified than the experts to write correct code in this tricky situation. Good luck! – whuber Mar 12 '13 at 16:11
  • @whuber not really my meaning about the experts. Just that I like to make stuff in order to learn. Even if it breaks first. But actually what I am describing is just the inverted cumulative density function. I admit...it's a bit nastily worded...but sometimes that happens. – Cris Stringfellow Mar 12 '13 at 16:14
  • (I think all your readers will appreciate the motivation for learning and the benefits of writing your own code for comparison to the state of the art.) But then what is your question? If you're inverting a CDF, there are no symbols or intervals involved; that's what's confusing me--and will likely confuse other readers too. – whuber Mar 12 '13 at 16:17
  • @whuber it's just that I didn't understand iCDF when I asked this this morning. Now I do. But I still don't know how I can invert the CDF for say a normal distribution. – Cris Stringfellow Mar 12 '13 at 16:18
  • 2
    How accurate do you need the inversion to be? What programming capabilities do you have? There are many, many algorithms for this, ranging from continued fraction approximations through numerical integration or even just interpolation in precomputed tables. You can get some sense of the nature of the techniques from http://stats.stackexchange.com/questions/7200, which discusses computing the Normal CDF; some of the links and references in those answers probably include methods to compute the inverse CDF, too. – whuber Mar 12 '13 at 16:24
  • @whuber cool, thanks. I prefer accurate to 'expensive', and prefer clean algorithms to ones that require handling lots of corner cases or have multiple 'correction/numerical' steps. – Cris Stringfellow Mar 12 '13 at 16:35
  • Inverting CDFs in general, with high accuracy, usually requires an array of techniques: some have explicit inverses (in closed form); others need to be inverted through zero-finding algorithms; yet others--such as the Normal and Gamma--enjoy properties that lead to good approximations and asymptotic methods; still others are inverted by algebraically relating them to other distributions. You will have to think about how general your solution needs to be. A truly general-purpose one can quickly be built with a good zero-finding procedure; it works for any distribution whose PDF you can compute. – whuber Mar 12 '13 at 17:46
  • @whuber that's great to know about the general one. Thanks for focusing me on that. So even we can sample from an unknown, approximated distribution by taking a few observations to estimate the PDF, then inverting it with a zero finding algorithm. – Cris Stringfellow Mar 13 '13 at 03:41
  • That's right. Moreover, if you estimate the PDF using the EDF (empirical distribution function, which jumps by $1/n$ at each data value), then sampling from the EDF is tantamount to *resampling* from the data. Because this is useful, and because estimated CDFs can have level sections (kernel density estimates do this), it's a good idea to use a robust, fully general root finder that can cope with interval-valued roots and with CDFs that have jumps. – whuber Mar 13 '13 at 03:59
  • @whuber well I am glad I have the right feel for this topic. But you lost me at EDF and 1/$n$, could you give me a reference so I can study more about EDF, *resampling*, level sections, kernel density, and maybe your recommendation for a good root finder that can deal with jumps. – Cris Stringfellow Mar 13 '13 at 05:31
  • The EDF (or ECDF) is a fundamental statistical object--you will find much about it in many textbooks. For a start, though, you might consider [searching here](http://stats.stackexchange.com/search?q=ecdf) :-). $n$, by the way, is the number of values in your dataset. A jump of $1/n$ gives each value equal weight. – whuber Mar 13 '13 at 05:33
  • @whuber Okay good. – Cris Stringfellow Mar 13 '13 at 05:39

0 Answers0