R: Density estimation vs Histogram Estimation

Question

To draw random samples from a custom distribution, I recall reading that KDE's are better than histograms. (See hadley's comment here.)

When I experimented in R, I am finding that the KDE method yields results significantly different from the histogram.

I use the annual return rate of S&P500 (from 1926 to 2012) and draw from it.

Reproducible Code below:

#SP500 annual returns
x<- c(11.62, 37.49, 43.61, -8.42, -24.9, -43.34, -8.19, 53.99, -1.44, 47.67, 33.92, -35.03, 31.12, -0.41, -9.78, -11.59, 20.34, 25.9, 19.75, 36.44, -8.07, 5.71, 5.5, 18.79, 31.71, 24.02, 18.37, -0.99, 52.62, 31.56, 6.56, -10.78, 43.36, 11.96, 0.47, 26.89, -8.73, 22.8, 16.48, 12.45, -10.06, 23.98, 11.06, -8.5, 4.01, 14.31, 18.98, -14.66, -26.47, 37.2, 23.84, -7.18, 6.56, 18.44, 32.5, -4.92, 21.55, 22.56, 6.27, 31.73, 18.67, 5.25, 16.61, 31.69, -3.11, 30.47, 7.62, 10.08, 1.32, 37.58, 22.96, 33.36, 28.58, 21.04, -9.11, -11.89, -22.1, 28.68, 10.88, 4.91, 15.79, 5.49, -37, 26.46, 15.06, 2.11, 16)


#Drawing from the histogram of X
hist(x)
h <- hist(x, probability=TRUE, breaks=40)
scale <- sum(h$density)
sum(h$density/scale) # check if 1
#F-inverse function (cdf) for the histogram, scaled to total 1
cumprob <- cumsum(h$density/scale)

#To Generate one random sample from the histogram
ret.bucket<- sum(ifelse(runif(1)>cumprob,1,0)) 
h$mids[ret.bucket+1] #This would be the obtained sample return

#Draw from the Kernel Density Estimate  
kde <- density(x, bw=10) #I experimented with various buckets
kde
plot(kde)
kdf <- as.data.frame(kde$x) #512 rows to choose from
names(kdf) <- "sp500Return"

Update: The following is wrong. Cannot simply draw uniformly from the KDE.

#To generate one sample using the KDE  
kdf[runif(1,1,512),]

Now let's generate a few samples and compare the two methods:

#Compare the 2 methods
ret.kde = NULL
ret.hist = NULL
rndi <- NULL
for (i in 1:5000) {
  rnd <- runif(1)
  rndi[i] <- rnd
  rndrow <- as.integer(512 * rnd)
  ret.kde [i] <- kdf[rndrow+1,]  
  bucket<- sum(ifelse(rnd > cumprob, 1, 0))
  ret.hist[i] <- h$mids[bucket+1]
}

  mean(ret.kde)
  sd(ret.kde)
  mean(ret.hist)
  sd(ret.hist)

  cbind(rndi, ret.kde, ret.hist)

Results I got in one run:

> mean(ret.kde);  mean(ret.hist)
[1] 5.691015 #wrong implementation
[1] 11.8404
> sd(ret.kde);    sd(ret.hist)
[1] 45.87001 #wrong implementation. See below for update
[1] 20.15519

Question: Why are these two methods yielding results that are so different? Is my implementation flawed, or is it a matter of needing to clip the KDE at its ends?

UPDATE:

I found the answer in this response to this CV question. My original method of drawing from the KDE is flawed.

Here's an updated implementation of drawing from the KDE:

#Two Step process To Draw from the Kernel Density Estimate  
#Step 1 First get one of the original points, randomly.
sampleFromKDE <- function(x, bw) {
  rnd <- sample(length(x),1)
  xi <- x[rnd]
  #Step 2: Get a N(xi, bw) around the selected Kernel
  #bw=10
  return(rnorm(1, xi, bw) ) 
}

k<-NULL
for (i in 1:5000) {
k[i] <- sampleFromKDE(x, 3)  
}

leading to

#> mean(k)
#[1] 12.20025
#> sd(k)
#[1] 19.88572

score 3 · Accepted Answer · answered Mar 19 '13 at 23:26

The mean and standard deviation of the histogram method are quite close to the sample mean and sample standard deviation of the raw data. The KDE is way off.

I tried this in Matlab (because I do not know R) with the same kernel width and generated 5000 random points. My mean and standard deviation look good:


x = [11.62, 37.49, 43.61, -8.42, -24.9, -43.34, -8.19, 53.99, -1.44, 47.67, 33.92, -35.03, 31.12, -0.41, -9.78,-11.59,20.34, 25.9, 19.75, 36.44, -8.07, 5.71, 5.5, 18.79, 31.71, 24.02, 18.37, -0.99, 52.62, 31.56, 6.56, -10.78, 43.36,11.96, 0.47, 26.89, -8.73, 22.8, 16.48, 12.45, -10.06, 23.98, 11.06, -8.5, 4.01, 14.31, 18.98, -14.66, -26.47, 37.2,23.84, -7.18, 6.56, 18.44, 32.5, -4.92, 21.55, 22.56, 6.27, 31.73, 18.67, 5.25, 16.61, 31.69, -3.11, 30.47, 7.62,10.08,1.32, 37.58, 22.96, 33.36, 28.58, 21.04, -9.11, -11.89, -22.1, 28.68, 10.88, 4.91, 15.79, 5.49, -37, 26.46, 15.06,2.11,16]';

kd = fitdist(x, 'kernel', 'width', 10);
randdata = kd.random(1000,1);
mean(randdata)
std(randdata)

which gives me


ans =

   10.9047


ans =

   23.0171

I guess there a bug in your code.

+1 The code does not generate data from the KDE correctly: it chooses values uniformly from the *density* rather than the distribution. — whuber, Mar 19 '13 at 23:31
Thanks, Atul, @whuber. I will look around for how to choose values from the KDE distribution. — Ram, Mar 20 '13 at 01:10

R: Density estimation vs Histogram Estimation

1 Answers1