6

The density() function in R allows me to enter observations and get an empirical density that I can plot x and y values. I like it because it allows me to weight observations according to how important they are, and it allows me to specify the smoothing bandwidth I want.

My question is once I run the density() function how do I obtain percentiles from this density? Note this isnt the same as just getting the sample percentiles from my data, because I'd like to use weights on the observations.

Danica
  • 21,852
  • 1
  • 59
  • 115
user52291
  • 61
  • 1
  • 3

2 Answers2

4

The command density(), although very useful for a quick inspection of the KDE, is also very restrictive since it only returns the values on a grid. I prefer to code my own KDE (usually with a Gaussian kernel). This can be obtained as shown below (1-line code):

rm(list=ls())
# Constructing your own KDE
set.seed(123)
sample = rnorm(1000,10,1)
# Bandwidth used by density()
hT = bw.nrd0(sample)
kde <- Vectorize(function(x) mean(dnorm((x-sample)/hT)/hT))
# Comparison
plot(density(sample))
curve(kde,6,13,add=T,col="red")

The corresponding nonparametric estimator of the CDF can be obtained as follows:

# Obtaining the corresponding kernel distribution estimator 

KDE <-  Vectorize(function(x) mean(pnorm((x-sample)/hT)))
curve(KDE,6,13,col="blue")

Using these functions, you can manually approximate the percentiles if you can provide an interval where the quantile of interest lies:

# Manual calculation of the percentile (requires the probability and an interval containing the quantile)

QKDE <- function(p,Interval){
tempf <- function(t) KDE(t)-p
return(uniroot(tempf,Interval)$root )
}

QKDE(0.5,c(8,12))

This may not be the most efficient way, but it works, and it is fast and accurate. I hope this helps.

Gordimer
  • 43
  • 4
  • Looks promising. Would help if you would annotate a little more, e.g., for "bw.nrd0" which is not commonly seen. Also, one person's "extremely simple" is another person's bugaboo... – rolando2 Jul 24 '14 at 11:55
  • @rolando2 Thank you for your feedback. There is a comment about `bw.nrd0`, which is the bandwidth used by the command `density()`. I have added a link to the description of this command. I will try to soften my words also. – Gordimer Jul 24 '14 at 11:58
  • It is obvious that the author of this question may not need it already but for all those looking and utilizing this answer I would like to point out: optimal bandwidth (as such used for density command) for PDF and CDF estimation are different as bandwidth for PDF is proportional to $n^{-1/5}$ while for CDF it is proportional to $n^{-1/3}$. – EmptyHead Jul 11 '17 at 13:38
2

Why re-invent the wheel? I advise you to use the ewcdf function in the spatstat library. If I understand your question correctly, it does exactly what you want:

library(spatstat)
  x <- rnorm(100)    #data
   w <- runif(100)   #weights
   a1<-ewcdf(x,w)    #empricial *weighted* cdf and quantile function
 quantile(a1,.2)     #calls quantile.ecdf()
 #which is different from quantile because of the effects of the weights:
    quantile(x,.2)
user603
  • 21,225
  • 3
  • 71
  • 135