How to get percentiles from empirical density in R?

Question

The density() function in R allows me to enter observations and get an empirical density that I can plot x and y values. I like it because it allows me to weight observations according to how important they are, and it allows me to specify the smoothing bandwidth I want.

My question is once I run the density() function how do I obtain percentiles from this density? Note this isnt the same as just getting the sample percentiles from my data, because I'd like to use weights on the observations.

There's an answer here, but it might be unsatisfying: https://stat.ethz.ch/pipermail/r-help/2008-July/168947.html — shadowtalker, Jul 17 '14 at 18:57

Gordimer · Answer 1 · 2014-07-24T11:59:43.103

The command density(), although very useful for a quick inspection of the KDE, is also very restrictive since it only returns the values on a grid. I prefer to code my own KDE (usually with a Gaussian kernel). This can be obtained as shown below (1-line code):

rm(list=ls())
# Constructing your own KDE
set.seed(123)
sample = rnorm(1000,10,1)
# Bandwidth used by density()
hT = bw.nrd0(sample)
kde <- Vectorize(function(x) mean(dnorm((x-sample)/hT)/hT))
# Comparison
plot(density(sample))
curve(kde,6,13,add=T,col="red")

The corresponding nonparametric estimator of the CDF can be obtained as follows:

# Obtaining the corresponding kernel distribution estimator 

KDE <-  Vectorize(function(x) mean(pnorm((x-sample)/hT)))
curve(KDE,6,13,col="blue")

Using these functions, you can manually approximate the percentiles if you can provide an interval where the quantile of interest lies:

# Manual calculation of the percentile (requires the probability and an interval containing the quantile)

QKDE <- function(p,Interval){
tempf <- function(t) KDE(t)-p
return(uniroot(tempf,Interval)$root )
}

QKDE(0.5,c(8,12))

This may not be the most efficient way, but it works, and it is fast and accurate. I hope this helps.

Looks promising. Would help if you would annotate a little more, e.g., for "bw.nrd0" which is not commonly seen. Also, one person's "extremely simple" is another person's bugaboo... — rolando2, Jul 24 '14 at 11:55
@rolando2 Thank you for your feedback. There is a comment about `bw.nrd0`, which is the bandwidth used by the command `density()`. I have added a link to the description of this command. I will try to soften my words also. — Gordimer, Jul 24 '14 at 11:58
It is obvious that the author of this question may not need it already but for all those looking and utilizing this answer I would like to point out: optimal bandwidth (as such used for density command) for PDF and CDF estimation are different as bandwidth for PDF is proportional to $n^{-1/5}$ while for CDF it is proportional to $n^{-1/3}$. — EmptyHead, Jul 11 '17 at 13:38

user603 · Answer 2 · 2015-06-10T22:34:57.763

Why re-invent the wheel? I advise you to use the ewcdf function in the spatstat library. If I understand your question correctly, it does exactly what you want:

library(spatstat)
  x <- rnorm(100)    #data
   w <- runif(100)   #weights
   a1<-ewcdf(x,w)    #empricial *weighted* cdf and quantile function
 quantile(a1,.2)     #calls quantile.ecdf()
 #which is different from quantile because of the effects of the weights:
    quantile(x,.2)

How to get percentiles from empirical density in R?

2 Answers2

Linked