Suppose we want to get a density estimate of some data X
. One way is to compute the empirical CDF,
N <- 1e5
x <- seq(min(X), max(X), length=N)
F <- ecdf(X)(x)
and then the density, by taking the derivative of F
.
Question 1
Suppose the density
D <- diff(X, lag=1) / diff(x, lag=1)
Then the area under the density A <- sum(D) * diff(x)[1]
will equal 1 (here assuming that x
is an evenly spaced vector). However it may be desirable to smooth the estimate by taking a lag
value greater than 1. e.g.
lg <- 50
D <- diff(X, lag=lg) / diff(x, lag=lg)
What do we need to do to D
such that A <- sum(D) * diff(x)[1]
is equal to unity? Namely, when taking lg > 1
, how do we maintain that D
is a density.
Question 2
Supposing x
is the natural domain of our observable, we may instead want to consider a re-scaled domain. This can be done in two ways:
- Take
F
on the natural domainx
and then graphxp
vs.D
wherexp
is for instance a constant re-scaling ofx
:xp <- x / max(X)
. - Take
F
on the re-scaled domain and proceed as before with density estimation. e.g.F <- ecdf(X / max(X))(x)
wherex <- seq(0, 1, length=N)
.
I am interested in the second point, but have essentially the same problem as question 1. In point 2 we introduce a change of variables, so D
should be multiplied by something to account for this. It seems to make sense on paper, but I'm not sure how to implement numerically - plus, a constant re-scaling seems too trivial to require computing inverse functions etc. Bonus points if you can combine Q1 and Q2.