11

I have developed a simple Kernel Density Estimator in Java, based on a few dozen points (maybe up to one hundred or so) and a Gaussian kernel function. The implementation gives me the PDF and CDF of my probability distribution at any point.

I would now like to implement a simple sampling method for this KDE. An obvious choice would of course be to draw from the very set of points making up the KDE, but I would like to be able to retrieve points that are slightly different from the ones in the KDE.

I haven't found so far a sampling technique that I could easily implement to solve this problem (without depending on external libraries for numerical integration or complex computations). Any advices? I don't have specially strong requirements when it comes to precision or efficiency, my main concern is to have a sampling function that works and can be easily implemented. Thanks!

Pierre Lison
  • 771
  • 6
  • 17
  • 4
    This is detailed in page 5 of [this document](http://www.stat.cmu.edu/~cshalizi/350/lectures/28/lecture-28.pdf). –  Nov 15 '12 at 18:21
  • thanks, that was useful! And simpler than I thought ;-) – Pierre Lison Nov 15 '12 at 19:49
  • @user10525 the code provided is incorrect, it should be: `rnorm(n, sample(dx$x, n, prob = dx$y, replace = TRUE), dx$bw)` where `dx` is output from `density` function. Argument `prob` has to be provided because otherwise you sample uniformly. – Tim Dec 22 '15 at 20:29

1 Answers1

19

As mentioned by Procrastinator, there's a simple way to sample from a Kernel density estimator:

  1. Draw one point $x_i$ from the set of points $x_1$,...$x_n$ included in the KDE
  2. Once you have the point $x_i$, draw a value from the kernel associated with the point. In this case, draw from the Gaussian $\mathcal{N}(x_i,h)$ centered at $x_i$ and of variance $h$ (the bandwidth)
jonsca
  • 1,790
  • 3
  • 20
  • 30
Pierre Lison
  • 771
  • 6
  • 17
  • (+1) For sharing your solution. –  Nov 19 '12 at 10:15
  • Is $x_i$ one of the original points? If so, looks like we don't really need to construct the actual KDE at all. Just sampling from one of the original points, and $N (x_i,h)$ should suffice? – Ram Apr 08 '13 at 23:19
  • Yes indeed, if you are only using the KDE distribution for sampling, you do not need to explicitly construct the PDF: the only information necessary for the sampling operation is the set of points and the bandwidth. – Pierre Lison Apr 09 '13 at 06:28
  • just to add to Pierre Lison: In step 2.: For sampling from a Gaussian kernel, the bandwidth h should be taken as the standard deviation of the Gaussian distribution around the point x_i, not the variance. –  Dec 22 '15 at 18:52
  • Wouldn't you want to sample using standard deviation 1/h or something? As written, the less likely x_i is, the more likely you are to sample another unlikely point nearby because the standard deviation of N is low. – chris Jul 03 '19 at 21:24