Interpolating the empirical cumulative function

Question

The empirical cumulative distribution function of a random variable, given observations $x_\left( k \right) > x_\left( k-1 \right)$, $k \in \mathbb N$, $k \le n$, is defined as $F_{emp}(x_\left( k \right) > X \ge x_\left( k-1 \right)) = \frac k {n+1}$ and $F_{emp}(X \ge x_\left(n\right))=1$.

Why? As long as we're interpolating, wouldn't it make sense to use some interpolation method with less error? A simple nearest neighbour or piecewise average interpolant would be an improvement, and a cubic interpolant would get us a differentiable empirical density function, too.

The above definition won't even give you the piecewise infimum of the cdf, because the variable is random. It certainly approaches the true function as $n\to\infty$, but then so would any other interpolant. Surely at least linear interpolants were considered.

score 4 · Answer 1 · answered May 13 '11 at 23:07

4

The empirical CDF is just one estimator for the CDF. It's consistent, converges pretty quickly in general, and is dead simple to understand. If you want something fancier you could certainly get a kernel density estimate for the PDF and integrate it to get another estimate for the CDF, which would do some kind of interpolation as you suggest. But if it ain't broke....

answered May 13 '11 at 23:07

JMS

4,660
1
22
32

2

Indeed, in the case of a random sample, the empirical CDF converges *uniformly*. [Massart's refinement](http://projecteuclid.org/euclid.aop/1176990746) of the [Dvoretzky-Kiefer-Wolfowitz inequality](http://en.wikipedia.org/wiki/Dvoretzky%E2%80%93Kiefer%E2%80%93Wolfowitz_inequality) shows that this occurs *exponentially* fast in probability. – cardinal May 13 '11 at 23:26
@cardinal Wouldn't it also be the case that any reasonable interpolator of the EDF would also have the same convergence properties? – whuber May 14 '11 at 00:08
@whuber, I suppose one could quibble with the word *reasonable*. But, I think my answer in any event would be "no". Consider any discrete, non-constant random variable and a linear interpolant. Certainly, the uniform convergence would fail in this case. – cardinal May 14 '11 at 00:33
1

@cardinal--oh, of course. I was thinking only of continuous CDFs. It wouldn't make much sense to interpolate discrete distributions. – whuber May 14 '11 at 01:30
@whuber: Yes, I figured that was likely the case. (Funny, I didn't get pinged for this. Bug due to the dashes?) – cardinal May 14 '11 at 02:12
1

@whuber: I suspect that for continuous cdfs it must be the case. I'll see if I can come up with a complete and simple argument tomorrow. Intuitively, I would expect that the left-continuous with right-hand limits modification of the empirical cdf could also be shown to satisfy the DKW inequality in the continuous case. That result would then seem to imply your statement. – cardinal May 14 '11 at 02:15
3

I wonder then if there's an interpolated estimator with better finite sample properties, besides trivial cases like $N=2$ or something. Though I imagine it'd require extra assumptions about the true distribution. The efficiency of the ecdf is quite remarkable, really. – JMS May 14 '11 at 02:40
@cardinal This is just a test to confirm that you are getting notified via the "@" mechanism. If you do get pinged with this, you must be correct about the dashes (and I'll be careful about that in the future). – whuber May 14 '11 at 18:13
@whuber I got this ping. I'm assuming you got mine that included a colon after your name. So, I'm guessing the dashes confused things. Should we report as a possible bug via meta? – cardinal May 14 '11 at 18:39
2

@JMS, I would guess that even restricting to continuous distributions may not yield a *uniformly* better interpolated estimator. I say this because we can approximate a discrete distribution by a continuous one to arbitrary precision. (Even with continuously differentiable $F$ we may not get such a statement.) It's a good question. – cardinal May 14 '11 at 18:42

score 4 · Answer 2 · answered Nov 25 '10 at 14:34

4

The EDF is the CDF of the population constituted by the data themselves. This is exactly what you need to describe and analyze any resampling process from the dataset, including nonparametric bootstrapping, jackknifing, cross-validation, etc. Not only that, it's perfectly general: any kind of interpolation would be invalid for discrete distributions.

answered Nov 25 '10 at 14:34

whuber

281,159
54
637
1,101

For discrete distributions, there is no point in defining the ecdf at all in $\mathbb R$. It makes no sense to ask "what proportion of families have at most 1.5 children", for example. As for sampling from the sample - I have never thought of that as a particularly good idea to begin with, either. – sesqu Nov 25 '10 at 15:02
5

@sesqu Hmm... You ought to rethink that last statement, because it seems you have rejected out of hand some significant advances in statistical concepts and methodology. The first statement seems to misunderstand what the ecdf is; it perhaps is related to an unusual conception of what a distribution function is. Are you using a standard definition such as that given at http://mathworld.wolfram.com/DistributionFunction.html ? – whuber Nov 26 '10 at 16:01

user14717 · Answer 3 · 2019-09-27T16:39:20.190

I can't answer this question in full generality, but I think I can state one circumstance where it certainly is not useful: The Anderson-Darling test: \begin{align*} A^2/n &:= \int_{-\infty}^{\infty} \frac{(F_{n}(x) -F(x))^2}{F(x)(1-F(x))} \, \mathrm{d}F(x) \\ &= \int_{-\infty}^{x_0} \frac{F(x)}{1-F(x)} \, \mathrm{d}F(x) + \int_{x_{n-1}}^{\infty} \frac{1-F(x)}{F(x)} \, \mathrm{d}F(x) + \sum_{i=0}^{n-2} \int_{x_i}^{x_{i+1}} \frac{(F_n(x) - F(x))^2}{F(x)(1-F(x))} \mathrm{d}F(x) \end{align*}

Here, $F$ is the cumulative distribution function of the normal distribution, namely, $$ F(x) := \frac{1}{2}\left[1 + \mathrm{erf}\left(\frac{x}{\sqrt{2}}\right) \right] $$ and $F_n$ is the empirical cumulative distribution function $$ F_n(x) := \frac{1}{n} \sum_{i=0}^{n-1} \mathbb{1}_{x_i \le x} $$ (We will abuse notation a bit and let $F_{n}$ denote the linearly interpolated version of $F_n$ as well.)

I repeatedly generated $n$ $~N(0,1)$ random numbers, sorted them, and then considered $F_n$ first as a step function, and then as a sequence of linear interpolants. Each interior integral was computed via Gaussian quadrature of ridiculously high degree, and the tails via exp-sinh.

Did the empirical distribution fit the cumulative distribution better with linear interpolation than step interpolation? No, in fact they are indistinguishable as $n\to \infty$ and one is not uniformly better than the other for small $n$:

Code to reproduce:

#include <iostream>
#include <random>
#include <utility>
#include <boost/math/distributions/anderson_darling.hpp>
#include <quicksvg/scatter_plot.hpp>

template<class Real>
std::pair<Real, Real> step_vs_linear(size_t n)
{
    std::random_device rd;
    Real mu = 0;
    Real sd = 1;
    std::normal_distribution<Real> dis(mu, sd);

    std::vector<Real> v(n);
    for (size_t i = 0; i < n; ++i) {
        v[i] = dis(rd);
    }
    std::sort(v.begin(), v.end());
    Real Asq = boost::math::anderson_darling_normality_step(v, mu, sd);
    Real step = Asq;
    //std::cout << "n = " << n << "\n";
    //std::cout << "Step: A^2 = " << Asq << ", Asq/n = " << Asq/n << "\n";

    Asq = boost::math::anderson_darling_normality_linear(v, mu, sd);
    Real line = Asq;
    //std::cout << "Line: A^2 = " << Asq << ", Asq/n = " << Asq/n << "\n";

    return std::pair<Real, Real>(step, line);
}

int main(int argc, char** argv)
{
    using std::log;
    using std::pow;
    using std::floor;
    size_t samples = 10000;
    std::vector<std::pair<double, double>> linear_Asq(samples);
    std::vector<std::pair<double, double>> step_Asq(samples);
    std::default_random_engine generator;
    std::uniform_real_distribution<double> distribution(3, 18);

#pragma omp parallel for
    for(size_t sample = 0; sample < samples; ++sample) {
        size_t n = floor(pow(2, distribution(generator)));
        auto [step , line] = step_vs_linear<double>(n);
        step_Asq[sample] =  std::make_pair<double, double>(std::log2(double(n)), std::log(step/n));
        linear_Asq[sample] = std::make_pair<double, double>(std::log2(double(n)), std::log(line/n));
        if (sample % 10 == 0) {
            std::cout << "Sample " << sample << "/" << samples << "\n";
        }
    }

    std::string title = "Linear (blue) vs step (orange) Anderson-Darling test";
    std::string filename = "ad.svg";
    std::string x_label = "log2(n)";
    std::string y_label = "ln(A^2/n)";
    auto scat = quicksvg::scatter_plot<double>(title, filename, x_label, y_label);
    scat.add_dataset(linear_Asq, false, "steelblue");
    scat.add_dataset(step_Asq, false, "orange");
    scat.write_all();
}

Anderson-Darling tests:

#ifndef BOOST_MATH_DISTRIBUTIONS_ANDERSON_DARLING_HPP
#define BOOST_MATH_DISTRIBUTIONS_ANDERSON_DARLING_HPP

#include <cmath>
#include <algorithm>
#include <boost/math/distributions/normal.hpp>
#include <boost/math/quadrature/exp_sinh.hpp>
#include <boost/math/quadrature/gauss_kronrod.hpp>

namespace boost { namespace math {

template<class RandomAccessContainer>
auto anderson_darling_normality_step(RandomAccessContainer const & v, typename RandomAccessContainer::value_type mu = 0, typename RandomAccessContainer::value_type sd = 1)
{
    using Real = typename RandomAccessContainer::value_type;
    using std::log;
    using std::pow;
    if (!std::is_sorted(v.begin(), v.end())) {
        throw std::domain_error("The input vector must be sorted in non-decreasing order v[0] <= v[1] <= ... <= v[n-1].");
    }

    auto normal = boost::math::normal_distribution(mu, sd);

    auto left_integrand = [&normal](Real x)->Real {
        Real Fx = boost::math::cdf(normal, x);
        Real dmu = boost::math::pdf(normal, x);
        return Fx*dmu/(1-Fx);
    };
    auto es = boost::math::quadrature::exp_sinh<Real>();
    Real left_tail = es.integrate(left_integrand, -std::numeric_limits<Real>::infinity(), v[0]);

    auto right_integrand = [&normal](Real x)->Real {
        Real Fx = boost::math::cdf(normal, x);
        Real dmu = boost::math::pdf(normal, x);
        return (1-Fx)*dmu/Fx;
    };
    Real right_tail = es.integrate(right_integrand, v[v.size()-1], std::numeric_limits<Real>::infinity());


    auto integrator = boost::math::quadrature::gauss<Real, 30>();
    Real integrals = 0;
    int64_t N = v.size();
    for (int64_t i = 0; i < N - 1; ++i) {
        auto integrand = [&normal, &i, &N](Real x)->Real {
            Real Fx = boost::math::cdf(normal, x);
            Real Fn = (i+1)/Real(N);
            Real dmu = boost::math::pdf(normal, x);
            return (Fn - Fx)*(Fn-Fx)*dmu/(Fx*(1-Fx));
        };
        auto term = integrator.integrate(integrand, v[i], v[i+1]);
        integrals += term;
    }


    return v.size()*(left_tail + right_tail + integrals);
}


template<class RandomAccessContainer>
auto anderson_darling_normality_linear(RandomAccessContainer const & v, typename RandomAccessContainer::value_type mu = 0, typename RandomAccessContainer::value_type sd = 1)
{
    using Real = typename RandomAccessContainer::value_type;
    using std::log;
    using std::pow;
    if (!std::is_sorted(v.begin(), v.end())) {
        throw std::domain_error("The input vector must be sorted in non-decreasing order v[0] <= v[1] <= ... <= v[n-1].");
    }

    auto normal = boost::math::normal_distribution(mu, sd);

    auto left_integrand = [&normal](Real x)->Real {
        Real Fx = boost::math::cdf(normal, x);
        Real dmu = boost::math::pdf(normal, x);
        return Fx*dmu/(1-Fx);
    };
    auto es = boost::math::quadrature::exp_sinh<Real>();
    Real left_tail = es.integrate(left_integrand, -std::numeric_limits<Real>::infinity(), v[0]);

    auto right_integrand = [&normal](Real x)->Real {
        Real Fx = boost::math::cdf(normal, x);
        Real dmu = boost::math::pdf(normal, x);
        return (1-Fx)*dmu/Fx;
    };
    Real right_tail = es.integrate(right_integrand, v[v.size()-1], std::numeric_limits<Real>::infinity());


    auto integrator = boost::math::quadrature::gauss<Real, 30>();
    Real integrals = 0;
    int64_t N = v.size();
    for (int64_t i = 0; i < N - 1; ++i) {
        auto integrand = [&](Real x)->Real {
            Real Fx = boost::math::cdf(normal, x);
            Real dmu = boost::math::pdf(normal, x);
            Real y0 = (i+1)/Real(N);
            Real y1 = (i+2)/Real(N);
            Real Fn = y0 + (y1-y0)*(x-v[i])/(v[i+1]-v[i]);
            return (Fn - Fx)*(Fn-Fx)*dmu/(Fx*(1-Fx));
        };
        auto term = integrator.integrate(integrand, v[i], v[i+1]);
        integrals += term;
    }


    return v.size()*(left_tail + right_tail + integrals);
}

}}
#endif

I find this graphic impossible to interpret due to the overplotting of the orange and blue points and the lack of any description of what the axes even mean. — whuber, Sep 26 '19 at 16:59
The x-axis is the logarithm of the number of samples. The y-axis is $\ln(A^2/n)$. If linear interpolation was distinctly better than step interpolation, then you'd expect that the blue points would in general lie below the orange points. (Sadly, the overplotting of the orange and blue points is difficult to avoid with SVG.) — user14717, Sep 26 '19 at 17:06
@whuber: Should be much better now, with code to reproduce the graphic posted. Originally, I missed the measure $dF(x)$. — user14717, Sep 26 '19 at 20:08
If, as you write, you are extrapolating out into the tails, then all bets are off: the results will depend on the extrapolation and that's not relevant to this thread. However, I still cannot see any systematic difference between the orange and blue dots in the graphic, even at its full original scale. — whuber, Sep 26 '19 at 22:13
The extrapolation of the ecdf into the tails is identically zero for the left tail, and identically 1 for the right. I don't see why all bets are off due to that. In addition, you can easily verify that the measure of the integral in the tails is negligible. — user14717, Sep 27 '19 at 00:06
The CDF of the Gaussian has measure to the left and right of any experimental observations. It is the tails of the integral, not tails of the ecdf. — user14717, Sep 27 '19 at 14:18
I'm sorry, I cannot follow what you are doing. It sounds like you are performing some kind of parametric estimation and extrapolation. — whuber, Sep 27 '19 at 14:20
The "quadrature" is only of the "linear interpolants" you mention in your answer. I cannot see where Gaussians should be involved for integrating these expressions when the ecdf $\hat F$ is a piecewise linear function: they are rational functions and can be directly and exactly integrated; and certainly the first two integrals in your sum (which extrapolate the ecdf) should not be included. — whuber, Sep 27 '19 at 14:28
They do not extrapolate the ecdf. The ecdf is identically zero before any observation, and identically 1 after all observations. — user14717, Sep 27 '19 at 14:30
If you have an expression for direct and exactly integrated integrals; I would gladly use it. I ran it through Mathematica to search for it before doing the integrals numerically and it found nothing. In either case, whether the integrals are computed exactly or via quadrature is irrelevant to the argument. — user14717, Sep 27 '19 at 14:41
I think I might have been misinterpreting your notation. Would you mind indicating explicitly what you mean by "$F_n$" and "$F$"? — whuber, Sep 27 '19 at 15:02
Thank you: now the need for quadrature and for the first two (tail) integrals is perfectly clear. (It probably was before, but I simply misread the notation.) — whuber, Sep 27 '19 at 17:49
@whuber: No problem; my communication skills always need work. (BTW I've been reading your stuff on this site; all excellent.) — user14717, Sep 27 '19 at 18:37

Interpolating the empirical cumulative function

3 Answers3

Linked