4

The empirical cumulative distribution function of a random variable, given observations $x_\left( k \right) > x_\left( k-1 \right)$, $k \in \mathbb N$, $k \le n$, is defined as $F_{emp}(x_\left( k \right) > X \ge x_\left( k-1 \right)) = \frac k {n+1}$ and $F_{emp}(X \ge x_\left(n\right))=1$.

Why? As long as we're interpolating, wouldn't it make sense to use some interpolation method with less error? A simple nearest neighbour or piecewise average interpolant would be an improvement, and a cubic interpolant would get us a differentiable empirical density function, too.

The above definition won't even give you the piecewise infimum of the cdf, because the variable is random. It certainly approaches the true function as $n\to\infty$, but then so would any other interpolant. Surely at least linear interpolants were considered.

Firebug
  • 15,262
  • 5
  • 60
  • 127
sesqu
  • 631
  • 7
  • 7

3 Answers3

4

The empirical CDF is just one estimator for the CDF. It's consistent, converges pretty quickly in general, and is dead simple to understand. If you want something fancier you could certainly get a kernel density estimate for the PDF and integrate it to get another estimate for the CDF, which would do some kind of interpolation as you suggest. But if it ain't broke....

JMS
  • 4,660
  • 1
  • 22
  • 32
  • 2
    Indeed, in the case of a random sample, the empirical CDF converges *uniformly*. [Massart's refinement](http://projecteuclid.org/euclid.aop/1176990746) of the [Dvoretzky-Kiefer-Wolfowitz inequality](http://en.wikipedia.org/wiki/Dvoretzky%E2%80%93Kiefer%E2%80%93Wolfowitz_inequality) shows that this occurs *exponentially* fast in probability. – cardinal May 13 '11 at 23:26
  • @cardinal Wouldn't it also be the case that any reasonable interpolator of the EDF would also have the same convergence properties? – whuber May 14 '11 at 00:08
  • @whuber, I suppose one could quibble with the word *reasonable*. But, I think my answer in any event would be "no". Consider any discrete, non-constant random variable and a linear interpolant. Certainly, the uniform convergence would fail in this case. – cardinal May 14 '11 at 00:33
  • 1
    @cardinal--oh, of course. I was thinking only of continuous CDFs. It wouldn't make much sense to interpolate discrete distributions. – whuber May 14 '11 at 01:30
  • @whuber: Yes, I figured that was likely the case. (Funny, I didn't get pinged for this. Bug due to the dashes?) – cardinal May 14 '11 at 02:12
  • 1
    @whuber: I suspect that for continuous cdfs it must be the case. I'll see if I can come up with a complete and simple argument tomorrow. Intuitively, I would expect that the left-continuous with right-hand limits modification of the empirical cdf could also be shown to satisfy the DKW inequality in the continuous case. That result would then seem to imply your statement. – cardinal May 14 '11 at 02:15
  • 3
    I wonder then if there's an interpolated estimator with better finite sample properties, besides trivial cases like $N=2$ or something. Though I imagine it'd require extra assumptions about the true distribution. The efficiency of the ecdf is quite remarkable, really. – JMS May 14 '11 at 02:40
  • @cardinal This is just a test to confirm that you are getting notified via the "@" mechanism. If you do get pinged with this, you must be correct about the dashes (and I'll be careful about that in the future). – whuber May 14 '11 at 18:13
  • @whuber I got this ping. I'm assuming you got mine that included a colon after your name. So, I'm guessing the dashes confused things. Should we report as a possible bug via meta? – cardinal May 14 '11 at 18:39
  • 2
    @JMS, I would guess that even restricting to continuous distributions may not yield a *uniformly* better interpolated estimator. I say this because we can approximate a discrete distribution by a continuous one to arbitrary precision. (Even with continuously differentiable $F$ we may not get such a statement.) It's a good question. – cardinal May 14 '11 at 18:42
4

The EDF is the CDF of the population constituted by the data themselves. This is exactly what you need to describe and analyze any resampling process from the dataset, including nonparametric bootstrapping, jackknifing, cross-validation, etc. Not only that, it's perfectly general: any kind of interpolation would be invalid for discrete distributions.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • For discrete distributions, there is no point in defining the ecdf at all in $\mathbb R$. It makes no sense to ask "what proportion of families have at most 1.5 children", for example. As for sampling from the sample - I have never thought of that as a particularly good idea to begin with, either. – sesqu Nov 25 '10 at 15:02
  • 5
    @sesqu Hmm... You ought to rethink that last statement, because it seems you have rejected out of hand some significant advances in statistical concepts and methodology. The first statement seems to misunderstand what the ecdf is; it perhaps is related to an unusual conception of what a distribution function is. Are you using a standard definition such as that given at http://mathworld.wolfram.com/DistributionFunction.html ? – whuber Nov 26 '10 at 16:01
0

I can't answer this question in full generality, but I think I can state one circumstance where it certainly is not useful: The Anderson-Darling test: \begin{align*} A^2/n &:= \int_{-\infty}^{\infty} \frac{(F_{n}(x) -F(x))^2}{F(x)(1-F(x))} \, \mathrm{d}F(x) \\ &= \int_{-\infty}^{x_0} \frac{F(x)}{1-F(x)} \, \mathrm{d}F(x) + \int_{x_{n-1}}^{\infty} \frac{1-F(x)}{F(x)} \, \mathrm{d}F(x) + \sum_{i=0}^{n-2} \int_{x_i}^{x_{i+1}} \frac{(F_n(x) - F(x))^2}{F(x)(1-F(x))} \mathrm{d}F(x) \end{align*}

Here, $F$ is the cumulative distribution function of the normal distribution, namely, $$ F(x) := \frac{1}{2}\left[1 + \mathrm{erf}\left(\frac{x}{\sqrt{2}}\right) \right] $$ and $F_n$ is the empirical cumulative distribution function $$ F_n(x) := \frac{1}{n} \sum_{i=0}^{n-1} \mathbb{1}_{x_i \le x} $$ (We will abuse notation a bit and let $F_{n}$ denote the linearly interpolated version of $F_n$ as well.)

I repeatedly generated $n$ $~N(0,1)$ random numbers, sorted them, and then considered $F_n$ first as a step function, and then as a sequence of linear interpolants. Each interior integral was computed via Gaussian quadrature of ridiculously high degree, and the tails via exp-sinh.

Did the empirical distribution fit the cumulative distribution better with linear interpolation than step interpolation? No, in fact they are indistinguishable as $n\to \infty$ and one is not uniformly better than the other for small $n$:

enter image description here

Code to reproduce:

#include <iostream>
#include <random>
#include <utility>
#include <boost/math/distributions/anderson_darling.hpp>
#include <quicksvg/scatter_plot.hpp>

template<class Real>
std::pair<Real, Real> step_vs_linear(size_t n)
{
    std::random_device rd;
    Real mu = 0;
    Real sd = 1;
    std::normal_distribution<Real> dis(mu, sd);

    std::vector<Real> v(n);
    for (size_t i = 0; i < n; ++i) {
        v[i] = dis(rd);
    }
    std::sort(v.begin(), v.end());
    Real Asq = boost::math::anderson_darling_normality_step(v, mu, sd);
    Real step = Asq;
    //std::cout << "n = " << n << "\n";
    //std::cout << "Step: A^2 = " << Asq << ", Asq/n = " << Asq/n << "\n";

    Asq = boost::math::anderson_darling_normality_linear(v, mu, sd);
    Real line = Asq;
    //std::cout << "Line: A^2 = " << Asq << ", Asq/n = " << Asq/n << "\n";

    return std::pair<Real, Real>(step, line);
}

int main(int argc, char** argv)
{
    using std::log;
    using std::pow;
    using std::floor;
    size_t samples = 10000;
    std::vector<std::pair<double, double>> linear_Asq(samples);
    std::vector<std::pair<double, double>> step_Asq(samples);
    std::default_random_engine generator;
    std::uniform_real_distribution<double> distribution(3, 18);

#pragma omp parallel for
    for(size_t sample = 0; sample < samples; ++sample) {
        size_t n = floor(pow(2, distribution(generator)));
        auto [step , line] = step_vs_linear<double>(n);
        step_Asq[sample] =  std::make_pair<double, double>(std::log2(double(n)), std::log(step/n));
        linear_Asq[sample] = std::make_pair<double, double>(std::log2(double(n)), std::log(line/n));
        if (sample % 10 == 0) {
            std::cout << "Sample " << sample << "/" << samples << "\n";
        }
    }

    std::string title = "Linear (blue) vs step (orange) Anderson-Darling test";
    std::string filename = "ad.svg";
    std::string x_label = "log2(n)";
    std::string y_label = "ln(A^2/n)";
    auto scat = quicksvg::scatter_plot<double>(title, filename, x_label, y_label);
    scat.add_dataset(linear_Asq, false, "steelblue");
    scat.add_dataset(step_Asq, false, "orange");
    scat.write_all();
}

Anderson-Darling tests:

#ifndef BOOST_MATH_DISTRIBUTIONS_ANDERSON_DARLING_HPP
#define BOOST_MATH_DISTRIBUTIONS_ANDERSON_DARLING_HPP

#include <cmath>
#include <algorithm>
#include <boost/math/distributions/normal.hpp>
#include <boost/math/quadrature/exp_sinh.hpp>
#include <boost/math/quadrature/gauss_kronrod.hpp>

namespace boost { namespace math {

template<class RandomAccessContainer>
auto anderson_darling_normality_step(RandomAccessContainer const & v, typename RandomAccessContainer::value_type mu = 0, typename RandomAccessContainer::value_type sd = 1)
{
    using Real = typename RandomAccessContainer::value_type;
    using std::log;
    using std::pow;
    if (!std::is_sorted(v.begin(), v.end())) {
        throw std::domain_error("The input vector must be sorted in non-decreasing order v[0] <= v[1] <= ... <= v[n-1].");
    }

    auto normal = boost::math::normal_distribution(mu, sd);

    auto left_integrand = [&normal](Real x)->Real {
        Real Fx = boost::math::cdf(normal, x);
        Real dmu = boost::math::pdf(normal, x);
        return Fx*dmu/(1-Fx);
    };
    auto es = boost::math::quadrature::exp_sinh<Real>();
    Real left_tail = es.integrate(left_integrand, -std::numeric_limits<Real>::infinity(), v[0]);

    auto right_integrand = [&normal](Real x)->Real {
        Real Fx = boost::math::cdf(normal, x);
        Real dmu = boost::math::pdf(normal, x);
        return (1-Fx)*dmu/Fx;
    };
    Real right_tail = es.integrate(right_integrand, v[v.size()-1], std::numeric_limits<Real>::infinity());


    auto integrator = boost::math::quadrature::gauss<Real, 30>();
    Real integrals = 0;
    int64_t N = v.size();
    for (int64_t i = 0; i < N - 1; ++i) {
        auto integrand = [&normal, &i, &N](Real x)->Real {
            Real Fx = boost::math::cdf(normal, x);
            Real Fn = (i+1)/Real(N);
            Real dmu = boost::math::pdf(normal, x);
            return (Fn - Fx)*(Fn-Fx)*dmu/(Fx*(1-Fx));
        };
        auto term = integrator.integrate(integrand, v[i], v[i+1]);
        integrals += term;
    }


    return v.size()*(left_tail + right_tail + integrals);
}


template<class RandomAccessContainer>
auto anderson_darling_normality_linear(RandomAccessContainer const & v, typename RandomAccessContainer::value_type mu = 0, typename RandomAccessContainer::value_type sd = 1)
{
    using Real = typename RandomAccessContainer::value_type;
    using std::log;
    using std::pow;
    if (!std::is_sorted(v.begin(), v.end())) {
        throw std::domain_error("The input vector must be sorted in non-decreasing order v[0] <= v[1] <= ... <= v[n-1].");
    }

    auto normal = boost::math::normal_distribution(mu, sd);

    auto left_integrand = [&normal](Real x)->Real {
        Real Fx = boost::math::cdf(normal, x);
        Real dmu = boost::math::pdf(normal, x);
        return Fx*dmu/(1-Fx);
    };
    auto es = boost::math::quadrature::exp_sinh<Real>();
    Real left_tail = es.integrate(left_integrand, -std::numeric_limits<Real>::infinity(), v[0]);

    auto right_integrand = [&normal](Real x)->Real {
        Real Fx = boost::math::cdf(normal, x);
        Real dmu = boost::math::pdf(normal, x);
        return (1-Fx)*dmu/Fx;
    };
    Real right_tail = es.integrate(right_integrand, v[v.size()-1], std::numeric_limits<Real>::infinity());


    auto integrator = boost::math::quadrature::gauss<Real, 30>();
    Real integrals = 0;
    int64_t N = v.size();
    for (int64_t i = 0; i < N - 1; ++i) {
        auto integrand = [&](Real x)->Real {
            Real Fx = boost::math::cdf(normal, x);
            Real dmu = boost::math::pdf(normal, x);
            Real y0 = (i+1)/Real(N);
            Real y1 = (i+2)/Real(N);
            Real Fn = y0 + (y1-y0)*(x-v[i])/(v[i+1]-v[i]);
            return (Fn - Fx)*(Fn-Fx)*dmu/(Fx*(1-Fx));
        };
        auto term = integrator.integrate(integrand, v[i], v[i+1]);
        integrals += term;
    }


    return v.size()*(left_tail + right_tail + integrals);
}

}}
#endif
user14717
  • 185
  • 6
  • I find this graphic impossible to interpret due to the overplotting of the orange and blue points and the lack of any description of what the axes even mean. – whuber Sep 26 '19 at 16:59
  • The x-axis is the logarithm of the number of samples. The y-axis is $\ln(A^2/n)$. If linear interpolation was distinctly better than step interpolation, then you'd expect that the blue points would in general lie below the orange points. (Sadly, the overplotting of the orange and blue points is difficult to avoid with SVG.) – user14717 Sep 26 '19 at 17:06
  • @whuber: Should be much better now, with code to reproduce the graphic posted. Originally, I missed the measure $dF(x)$. – user14717 Sep 26 '19 at 20:08
  • If, as you write, you are extrapolating out into the tails, then all bets are off: the results will depend on the extrapolation and that's not relevant to this thread. However, I still cannot see any systematic difference between the orange and blue dots in the graphic, even at its full original scale. – whuber Sep 26 '19 at 22:13
  • The extrapolation of the ecdf into the tails is identically zero for the left tail, and identically 1 for the right. I don't see why all bets are off due to that. In addition, you can easily verify that the measure of the integral in the tails is negligible. – user14717 Sep 27 '19 at 00:06
  • What, then, do you mean by "and the tails via exp-sinh"?? – whuber Sep 27 '19 at 13:04
  • The CDF of the Gaussian has measure to the left and right of any experimental observations. It is the tails of the integral, not tails of the ecdf. – user14717 Sep 27 '19 at 14:18
  • I'm sorry, I cannot follow what you are doing. It sounds like you are performing some kind of parametric estimation and extrapolation. – whuber Sep 27 '19 at 14:20
  • No. I am only computing a quadrature. – user14717 Sep 27 '19 at 14:22
  • The "quadrature" is only of the "linear interpolants" you mention in your answer. I cannot see where Gaussians should be involved for integrating these expressions when the ecdf $\hat F$ is a piecewise linear function: they are rational functions and can be directly and exactly integrated; and certainly the first two integrals in your sum (which extrapolate the ecdf) should not be included. – whuber Sep 27 '19 at 14:28
  • They do not extrapolate the ecdf. The ecdf is identically zero before any observation, and identically 1 after all observations. – user14717 Sep 27 '19 at 14:30
  • If you have an expression for direct and exactly integrated integrals; I would gladly use it. I ran it through Mathematica to search for it before doing the integrals numerically and it found nothing. In either case, whether the integrals are computed exactly or via quadrature is irrelevant to the argument. – user14717 Sep 27 '19 at 14:41
  • I think I might have been misinterpreting your notation. Would you mind indicating explicitly what you mean by "$F_n$" and "$F$"? – whuber Sep 27 '19 at 15:02
  • @whuber: Good point; fixed. – user14717 Sep 27 '19 at 16:40
  • Thank you: now the need for quadrature and for the first two (tail) integrals is perfectly clear. (It probably was before, but I simply misread the notation.) – whuber Sep 27 '19 at 17:49
  • 1
    @whuber: No problem; my communication skills always need work. (BTW I've been reading your stuff on this site; all excellent.) – user14717 Sep 27 '19 at 18:37