2

enter image description here

Does anyone know that why in the Kolmogorov-Smirnov Test, the empirical distribution function is compared with the cumulative distribution function and not the probability distribution function? Is there a reason behind this?

Source: https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
stats_noob
  • 5,882
  • 1
  • 21
  • 42
  • 4
    Difficult to know what kind of explanation would satisfy here. The empirical density function for a continuous variable is just a series of spikes at the observed values. That's hard to work with. – Nick Cox Oct 07 '21 at 07:59
  • 1
    The empirical cdf is an estimator of the theoretical cdf, so it is hard to fathom why it should be compared with the theoretical pdf. – Xi'an Oct 07 '21 at 08:27
  • 1
    @Xian I guess the OP is asking why is K-S not about comparing empirical and theoretical PDFs. That is a different problem is a short reply. – Nick Cox Oct 07 '21 at 09:40
  • 1
    The comparison of cdfs is (i) well defined and (ii) distribution-free. With pdfs you get neither - e.g. what's the definition of sample pdf? – Glen_b Oct 08 '21 at 05:23
  • 1
    This is a very strange question. K-S test is on CDFs. If you somehow build a test on PDFs then it wouldn't be K-S test, right? It's like asking why ping-pong isn't played with hockey sticks. – Aksakal Nov 09 '21 at 01:57

3 Answers3

6

I think it's because comparing empirical pdf with theoretical pdf is not feasible (or hard to do).

Empirical pdf is just a finite sum of Diracs : it is not a function with computable values so how would you compare it with a nice continuous theoretical pdf?

You could try comparing a histogram with the theoretical pdf, but then comes the problem of choosing the width of the histogram bars. Same holds for any estimator density: if you were to use kernel estimation, you would need to chose a bandwidth, and probably make this bandwidth shrink with your sample size.

You would also need some asymptotic results on the distribution of the mismatch between the estimated and true pdf, which I'm not sure exist.

Alexis
  • 26,219
  • 5
  • 78
  • 131
Pohoua
  • 2,003
  • 2
  • 15
  • Why doesn't the point about bandwidth also hold for eCDFs? That is, why are histograms somehow problematic to compare, but eCDFs are not? – Alexis Nov 08 '21 at 06:20
  • 1
    Ecdf doesn't require any bandwidth: to plot it you just need to sort your observations, this will give you the x values, and plot as y values regularly spaced points ranging from 0 to 1, that's it. – Pohoua Nov 09 '21 at 01:53
  • I think I see... eCDFs lend themselves to an order statistic representation without any need for binning, whereas histograms require bins, or some form of smoother. – Alexis Nov 09 '21 at 02:10
  • 1
    Yes. And as the pdf is the derivative of the cdf, you can interpret the problem of chosing a bin size for histogram to the problem of computing the derivative a function while only having access to some points of it. – Pohoua Nov 09 '21 at 13:14
  • 1
    Comparing the PDF or quantiles might require bins (or a kernel smoothener), but does it give worse results? – Sextus Empiricus Nov 09 '21 at 15:11
1

The kernel density of probabilistic function would be hard to interpret, the width of each individual Dirac can change, that is why cumulative distribution is chosen which gives a wholistic picture about the distribution at any point in cdf.

0

The Kolmogorov Smirnov test uses the cumulative distribution function, because that's what the Kolmogorov Smirnov test is.

There are other tests that use a comparison with the density function rather than the cumulative distribution function, and they have different names.

Well known tests are Pearson's chi-squared test and the G-test.

Those tests work best for distributions describing a categorical or discrete variable, but they can be applied to continuous distributions as well. With continuous distributions you first transform the data to discrete data by binning (this post provides a lot of information about this binning when it is done based on the data).

In a certain sense this binning (based on the data) is almost the same as the Kolmogorov Smirnov test where the binning is done in a cumulative way.

Sextus Empiricus
  • 43,080
  • 1
  • 72
  • 161