4

Given a histogram obtained using given data points, how do I randomly sample from the distribution predicted by the histogram?

Any conceptual comment / R code would be welcome.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
yaitzme
  • 43
  • 1
  • 5
  • 2
    Your title seems to be asking how to sample from a histogram-as-population-pdf while the body text seems to be asking how to sample from a kernel-density-estimate-as-population-pdf (two different problems, the second of which is solved [here](http://stats.stackexchange.com/questions/43674/simple-sampling-method-for-a-kernel-density-estimator) and [here](http://stats.stackexchange.com/questions/82797/how-to-draw-random-samples-from-a-non-parametric-estimated-distribution)). – Glen_b Jan 21 '16 at 09:50

2 Answers2

7

Since the sampling from a kernel density estimate is solved once or twice already, I'll focus on sampling from a histogram-as-population-pdf.

The idea is simply

For each observation in the new sample

  1. choose a histogram bin according to the proportions of 
     the original sample (treated as a discrete pmf)

  2. sample uniformly from that bin-interval

For example in R:

#create an original histogram
x=rgamma(200,4)
xhist=hist(x,freq=FALSE)

#sample from it
samplesize=400
bins=with(xhist,sample(length(mids),samplesize,p=density,replace=TRUE)) # choose a bin
result=runif(length(bins),xhist$breaks[bins],xhist$breaks[bins+1]) # sample a uniform in it
hist(result,freq=FALSE,add=TRUE,bord=3)

Just for completeness, (since sampling from the kernel density estimate* is very simple):

repeat nsim times:
  sample (with replacement) a random observation from the data
  sample from the kernel, and add the previously sampled random observation

* note that some kernels - like fourth order kernels - are not densities and this assumes that the kernel is a density

In R, for a Gaussian kernel and bandwidth h, with data in x:

 dnorm(nsim,m=sample(x,nsim,replace=TRUE), s=h)
Glen_b
  • 257,508
  • 32
  • 553
  • 939
0

I think what you want is:

y <- density(x)
x.new <- rnorm(length(x), sample(x, size = length(x), replace = TRUE), y$bw)
plot(y)
lines(density(x.new), col = "blue")

Note that the density function uses kernel = "gaussian" as default. So to generate gaussian random numbers from a particular density you just need to use rnorm with the mean values equal to the original series and the standard deviation equal to the smoothing bandwidth. In this example I am resampling those mean values to generate different simulations. You can see this example in the density function documentation here.

Regis A. Ely
  • 831
  • 7
  • 14
  • Can you provide some text to explicate this code? – gung - Reinstate Monica Jan 21 '16 at 11:01
  • Sorry, I've added some clarification. Note that this example is taken from the density function documentation. – Regis A. Ely Jan 21 '16 at 12:13
  • As far as I can see you're producing a sample from a particular Gaussian. That's not in general the distribution predicted by any histogram (or kernel density estimate). There seems to be confusion here between the kernel and the data. – Nick Cox Jan 21 '16 at 17:41
  • I had the same feeling when I read the documentation of the density package, but they seem to justify it by saying that a kernel density fit is an equally-weighted mixture. – Regis A. Ely Jan 21 '16 at 18:24
  • I don't use R except very occasionally, so I am guessing what the code. But the bandwidth of a kernel will usually be much less than the SD of the data. Either way, the assumption that the data are Gaussian will in general be quite false. – Nick Cox Jan 21 '16 at 18:27