Density estimation from ECDF - numerical derivatives and scaled domains

Question

Suppose we want to get a density estimate of some data X. One way is to compute the empirical CDF,

N <- 1e5
x <- seq(min(X), max(X), length=N)
F <- ecdf(X)(x)

and then the density, by taking the derivative of F.

Question 1

Suppose the density

D <- diff(X, lag=1) / diff(x, lag=1)

Then the area under the density A <- sum(D) * diff(x)[1] will equal 1 (here assuming that x is an evenly spaced vector). However it may be desirable to smooth the estimate by taking a lag value greater than 1. e.g.

lg <- 50
D <- diff(X, lag=lg) / diff(x, lag=lg)

What do we need to do to D such that A <- sum(D) * diff(x)[1] is equal to unity? Namely, when taking lg > 1, how do we maintain that D is a density.

Question 2

Supposing x is the natural domain of our observable, we may instead want to consider a re-scaled domain. This can be done in two ways:

Take F on the natural domain x and then graph xp vs. D where xp is for instance a constant re-scaling of x: xp <- x / max(X).
Take F on the re-scaled domain and proceed as before with density estimation. e.g. F <- ecdf(X / max(X))(x) where x <- seq(0, 1, length=N).

I am interested in the second point, but have essentially the same problem as question 1. In point 2 we introduce a change of variables, so D should be multiplied by something to account for this. It seems to make sense on paper, but I'm not sure how to implement numerically - plus, a constant re-scaling seems too trivial to require computing inverse functions etc. Bonus points if you can combine Q1 and Q2.

ECDFs are step function which have discontinuity points, therefore the derivatives of ECDFs are definitely not well suited for density estimation. You have to define a probability measure via the dirac delta function. There is some clarification of the process, and issues as well, [here](http://courses.ieor.berkeley.edu/ieor165/lecture_notes/ieor165_lec10.pdf). Another source that describes density estimation via ECDF is [here](https://eml.berkeley.edu/~powell/e241a_sp06/ndnotes.pdf). — Double Ought Not, Apr 20 '21 at 18:21
Hmm. Thanks. I guess @MattF. has the right idea in asking what I'm trying to do with it. I think in reality I don't need an estimation of the PDF - I am looking at densities which vary over time and I'm checking how they can be collapsed on top of each other by a rescaling. As far as density estimation goes, I don't think the in built kernel density estimates (e.g. R's `density`) are very helpful at all, too much B.T.S. Histograms are O.K., but in both of these cases one needs to involve parameters (number of bins, etc). The ECDF tells you exactly what the data represents, no more, no less. — algae, Apr 21 '21 at 00:31
A big problem with histograms in my case is that you can't adequately estimate the tails of the distribution due to inconsistent sampling rates. i.e. you have bins with 0 counts stuffed in between bins with non-zero counts and plotting on a log scale is a pain. With the ECDF I can choose a `lag` value which smooths this out and doesn't wrongly impact on the data. — algae, Apr 21 '21 at 00:33

Matt F. · Accepted Answer · 2021-04-22T17:04:15.567

1

In comments, you say that your goal is to compare whether two cdfs are equal after scaling. For this goal, there is no need to compute densities, and the numerical procedures will be more stable without them.

Given a cdf $F$, we can compute the cdf for a rescaled variable by $$G(x) = F(m+sx)$$ where $m$ is some measure of central tendency and $s$ is some measure of dispersion. We could use

$m$ as the population mean and $s$ as the population standard deviation, if $F$ was constructed from a population as an empirical cdf
$m$ as the median ($F^{-1}(\frac12)$) and $s$ as half of the interquartile range ($F^{-1}(\frac34)-F^{-1}(\frac14)$), if $F$ was presented in some other way and robust statistics are appropriate
$m$ as the mean and $s$ as the standard deviation calculated directly from $F$, as described at this question. If the distribution has minimum $a$ and maximum $b$, then we can write these as $m=b - \int_a^b F(x)dx$, $s=\sqrt{v}$, $v=b^2-\int_a^b 2xF(x)dx -m^2$.

Once we have two rescaled cdfs, $G_1$ and $G_2$, we can test their equality via a Kolmogorov-Smirnov test. This test uses a statistic like $\max |G_1(x)-G_2(x)|$, and again does not require the density.

edited Apr 22 '21 at 17:04

answered Apr 21 '21 at 06:09

Matt F.

1,656
4
20

Thanks. I still need to try this. Why specifically a measure of central tendency and dispersion, i.e. $m$ and $s$ for the rescaling? What about just $G(x) = F(x / c)$ for some constant. – algae Apr 22 '21 at 23:16
The transformation $G(x)=F(x/c)$ can be good, if you have a good choice for $c$ and if you consider the shift of a distribution to be different from the original. My guess is that you would end up with $c$ which is 1 over a measure of dispersion. But if you consider a shifted version of a distribution to be the same as the original, then you would want a measure of central tendency in the transformation too. Perhaps you have an application in mind where that’s not necessary. – Matt F. Apr 23 '21 at 00:54
@algae, if this is is helpful, does it deserve an upvote? – Matt F. Apr 27 '21 at 15:42
I suppose so. But it doesn't really answer my question in the OP, which presumably is quite trivial. – algae Apr 28 '21 at 00:45

Density estimation from ECDF - numerical derivatives and scaled domains

1 Answers1