4

I've been reading the Wikipedia articles about various statistical tests, and trying to code up programs that perform them. However, the article on the Kolmogorov-Smirnov test is rather unclear to me. Can somebody explain, in simple language, how I would write a program that takes a list of real numbers and a CDF, and computes a p-value for how closely they match?

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
MathematicalOrchid
  • 2,430
  • 3
  • 13
  • 15
  • 1
    If you are willing to use R, you can use the command [ks.test](http://www.astrostatistics.psu.edu/su07/R/library/stats/html/ks.test.html). –  May 08 '12 at 10:54
  • is your aim just to be able to do a test including with existing software, or to be able to write a program from first principles (eg for learning purposes)? – Peter Ellis May 08 '12 at 10:59
  • I'd like to understand what the processes is, for curiosity's sake if nothing else. I'm nosey like that. ;-) – MathematicalOrchid May 08 '12 at 11:09
  • 2
    Calculating the K-S *statistic* should be straightforward; getting a p-value from that is more complex though, as the distribution of the test statistic (even the asymptotic distribution) isn't a standard one. If you're doing this for learning purposes, would you be satisfied with calculating the statistic? – onestop May 08 '12 at 11:24
  • @onestop: That will do. :-) I guess I can always look up critical values from a table or something. – MathematicalOrchid May 08 '12 at 11:30

5 Answers5

2

Below I've written for you SPSS code of one-sample K-S test following SPSS algorithms document. In my example, an observed variable X is tested against uniform theoretical distribution. Despite that the snippet is not pseudocode you will understand easily what it is doing if you compare the comments with the document.

*Compute empirical cumulative distribution cum_e, and also sample size n.
rank X /rfraction into cum_e /n into n /ties= high /print= no .

*Let's obtain parameters for theoretical uniform distribution from the data (but one could specify arbitrary values).
aggregate /outfile= * mode= addvari /p1= min(X) /p2= max(X).

*Compute theoretical cumulative distribution.
compute cum_t= (X-p1)/(p2-p1).

*The two differences between the two cumulative curves.
*d1 are the differences when the empirical curve holds the right of the theoretical curve.
*d2 are the differences when the empirical curve holds the left of the theoretical curve.
*(they are needed both because theoretical distribution is continuous whereas empirical one is willy-nilly descrete).
*The utmost negative difference ("D-") is the minimum in d1.
*The utmost positive difference ("D+") is the maximum in d2.
compute casenum= $casenum.
sort cases by X.
compute d1= lag(cum_e)-cum_t.
compute d2= cum_e-cum_t.
sort cases by casenum.
aggregate /outfile= * mode= addvari over= yes /break= break /d1= min(d1) /d2= max(d2).
         /*Values "D-" (as d1 now) and "D+" (as d2 now)

*The test value, z.
do if $casenum=1.
-compute z= sqrt(n)*max(abs(d1),abs(d2)).

*Its 2-tailed significance sig.
-do if z>=0 and z<.27.
- compute sig= 1.
-else if z>=.27 and z<1.
- compute sig= exp(-1.233701*z**-2).
- compute sig= 1-(2.506628/z)*(sig+sig**9+sig**25).
-else if z>=1 and z<3.1.
- compute sig= exp(-2*z**2).
- compute sig= 2*(sig-sig**4+sig**9-sig**16).
-else if z>=3.1.
- compute sig= 0.
-end if.
end if.
execute.
ttnphns
  • 51,648
  • 40
  • 253
  • 462
2

I can't make any sense of the code that ttnphns posted. But the linked algorithms document has helped.

First, it appears that if you take your $n$ samples and sort them into ascending order, then the emperical cumulative distribution function can be defined as

$$ \hat{F}(x) = \left\{ \begin{array}{ll} 0 & \text{if } x < x_0 \\ i/n & \text{if } x_i \leq x < x_{i+1} \\ 1 & \text{if } x_n < x \\ \end{array} \right. $$

(Assuming that the $x_i$ are your samples in ascending order.)

It now remains only to discover the maximum distance between $\hat{F}(x)$ and your target CDF $F(x)$.

It appears that finding the minimum distance would be quite hard, since it could be located anywhere in a given span. (The CDF might cross the EDF at any point.) However, since a CDF is, by definition, monotonically increasing, it follows that the maximum distance has to be at one end or the other of each span.

Edit: Corrected formulas. (Oops!)

So, it seems the complete algorithm is this:

  • Take your samples $s_i$ and sort them into ascending order $x_i$.

  • For $i=0$ to $n$, compute $d_L = |i/n - F(x_i)|$ and $d_H = |i/n - F(x_{i-1})|$.

  • The largest $d_L$ or $d_H$ found is the KS statistic.

To get a p-value, you need to compare the statistic against the CDF of the Kolgomorov distribution. Apparently this involves the elliptic theta function or something... o_O

PS. It seems to not matter whether you choose $d_H = |i/n - F(x_{i-1})|$ or $d_H = |(i-1)/n - F(x_i)|$. Both do essentially the same thing.

MathematicalOrchid
  • 2,430
  • 3
  • 13
  • 15
  • 1
    The _CDF_ of the uniform is not constant, it is linear. For arbitrary CDF $F$, note that it is monotonic non-decreasing. I believe this allows you to bound the maximum difference by looking only at the 'nodes' $x_i$, but you may have to 'shift' the curves. That is, I think that $|F(x_i) - \hat{F}(x_i)|\vee |F(x_{i+1}) - \hat{F}(x_i)|$ is an upper bound for $|F(x) - \hat{F}(x)|$ for $x \in \left[x_i,x_{i+t}\right).$ – shabbychef May 08 '12 at 20:13
  • Well, I did say "$F(x)$ is a straight line". That sounds fairly linear to me. ;-) Regardless, I will ponder the general case further... – MathematicalOrchid May 08 '12 at 20:19
  • right you are, oops. I misread it as $F(x)$ is constant. For sufficiently large $n$, this is not going to really make any difference if you are looking up p-values in a table, BTW. – shabbychef May 08 '12 at 20:36
  • 2
    Note that there is *no* difference between the uniform case and the general (absolutely continuous) case. Though it deals with the *two-sample* version, some hints as to why this is true can be found in the comments [here](http://stats.stackexchange.com/questions/17495/kolmogorov-smirnov-2-sample-p-values). That is, the K-S statistic is invariant under the transform $F$. – cardinal May 09 '12 at 12:58
  • Theta functions were invented a couple hundred years ago in part because they make computation with elliptic functions *easy*: they converge incredibly quickly. You could compute them (for most arguments, anyway) with a spreadsheet :-). – whuber May 10 '12 at 15:18
  • @whuber: Perhaps you can provide a relevant answer to my question on CDF approximation then. :-) – MathematicalOrchid May 10 '12 at 15:29
2

Here is an answer to a related question that as part of the answer gives an example of computing the statistic and a graph demonstrating the general idea.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159
1

I recently posted the T-SQL code below for a Kolmogorov-Smirnov Test implementation on my blog. To test it against distributions other than the normal, you'd have to replace the NormalDistributionSingleCDFFunction.sql function it references with a different CDF. The parameters allow you to run it on any column in any SQL Server database you have access to. Keep in mind that I'm an amateur at this and have only tested it on a couple small datasets, so there may be undiscovered issues with the code; it's more of an example of how it could be coded than a finished, thoroughly tested solution. One of the advantages, however, is that it can handle tens of millions of records effortlessly even on a desktop computer. There are some unsung benefits to using SQL for certain data mining/machine learning/statistical tasks...The code would have to be tinkered with to work on Oracle, MySQL or another database server platform besides SQL Server, but the principles behind it would basically be the same. If you don't have SQL Server, be aware that there are free editions that I believe can host up to 10 GB; I'm fairly certain that its competitors have free versions in the same size range. For an in-depth discussion of how the code works, including performance tests and possible "gotchas," see my blog post. I hope that helps...

CREATE PROCEDURE [Calculations].[GoodnessOfFitKolomgorovSmirnovAndKuipersTestsSP]
@Database1 as nvarchar(128) = NULL, @Schema1 as nvarchar(128), @Table1 as nvarchar(128),@Column1 AS nvarchar(128), @OrderByCode as tinyint
AS

DECLARE @SchemaAndTable1 nvarchar(400),@SQLString nvarchar(max)
SET @SchemaAndTable1 = @Database1 + ‘.’ + @Schema1 + ‘.’ + @Table1 

DECLARE @Mean float,
@StDev float,
@Count  float

DECLARE @EDFTable table
(ID bigint IDENTITY (1,1),
Value float,
ValueCount bigint,
EDFValue float,
CDFValue decimal(38,37),
EDFCDFDifference decimal(38,37)) 

DECLARE @ExecSQLString nvarchar(max), @MeanOUT nvarchar(200),@StDevOUT nvarchar(200),@CountOUT nvarchar(200), @ParameterDefinition nvarchar(max)
SET @ParameterDefinition = ‘@MeanOUT nvarchar(200) OUTPUT,@StDevOUT nvarchar(200) OUTPUT,@CountOUT nvarchar(200) OUTPUT ‘
SET @ExecSQLString = ‘SELECT @MeanOUT = Avg(‘ + @Column1 + ‘),@StDevOUT = StDev(‘ + @Column1 + ‘),@CountOUT = Count(‘ + @Column1 + ‘)
       FROM ‘ + @SchemaAndTable1 + ‘
       WHERE ‘ + @Column1 + ‘ IS NOT NULL’

EXEC sp_executesql @ExecSQLString,@ParameterDefinition, @MeanOUT = @Mean OUTPUT,@StDevOUT = @StDev OUTPUT,@CountOUT = @Count OUTPUT
SET @SQLString =  ‘SELECT Value, ValueCount, SUM(ValueCount) OVER (ORDER BY Value ASC) / CAST(‘ +
CAST(@Count as nvarchar(50)) + ‘AS float) AS EDFValue
       FROM (SELECT DISTINCT  ‘ + @Column1 + ‘ AS      Value, Count(‘ + @Column1 + ‘) OVER (PARTITION BY ‘ + @Column1 + ‘ ORDER BY ‘ + @Column1 + ‘) AS ValueCount
              FROM ‘ + @SchemaAndTable1 + ‘
WHERE ‘ + @Column1 + ‘ IS NOT NULL) AS T1‘             


INSERT INTO @EDFTable
(Value, ValueCount, EDFValue)
EXEC (@SQLString) 

UPDATE T1
SET CDFValue = T3.CDFValue, EDFCDFDifference = EDFValue – T3.CDFValue
FROM @EDFTable AS T1
       INNER JOIN (SELECT DistinctValue, Calculations.NormalDistributionSingleCDFFunction (DistinctValue, @Mean, @StDev) AS CDFValue
 FROM (SELECT DISTINCT Value AS DistinctValue
       FROM @EDFTable) AS T2) AS T3
       ON T1.Value = T3.DistinctValue

SELECT KolomgorovSmirnovSupremum AS KolomgorovSmirnovTest, KolomgorovSmirnovSupremum – KolomgorovSmirnovMinimum AS KuipersTest
FROM (SELECT Max(ABS(EDFValue – CDFValue)) AS KolomgorovSmirnovSupremum, — the supremum i.e. max
Min(ABS(EDFValue – CDFValue)) AS KolomgorovSmirnovMinimum
       FROM @EDFTable
       WHERE EDFCDFDifference > 0) AS T3

SELECT ID, Value, ValueCount, EDFValue, CDFValue, EDFCDFDifference
FROM @EDFTable
       ORDER BY CASE WHEN @OrderByCode = 1 THEN ID END ASC,
CASE WHEN @OrderByCode = 2 THEN ID END DESC,
CASE WHEN @OrderByCode = 3 THEN Value END ASC,
CASE WHEN @OrderByCode = 4 THEN Value END DESC,
CASE WHEN @OrderByCode = 5 THEN ValueCount END ASC,
CASE WHEN @OrderByCode = 6 THEN ValueCount END DESC,
CASE WHEN @OrderByCode = 7 THEN EDFValue END ASC,
CASE WHEN @OrderByCode = 8 THEN EDFValue END DESC,
CASE WHEN @OrderByCode = 9 THEN CDFValue END ASC,
CASE WHEN @OrderByCode = 10 THEN CDFValue END DESC,
CASE WHEN @OrderByCode = 11 THEN EDFCDFDifference END ASC,
CASE WHEN @OrderByCode = 12 THEN EDFCDFDifference END DESC
SQLServerSteve
  • 1,121
  • 1
  • 13
  • 34
1

Purely for giggles, here is an implementation of the KS test in Haskell:

module KS where

import Data.List (sort)

computeKS :: (Ord x, Num x) => (x -> Double) -> [x] -> Double
computeKS cdf us =
  let
    n  = fromIntegral (length us)
    xs = sort us
    ds = [ abs (i/n - cdf xi) `max` abs ((i-1)/n - cdf xi) | (i, xi) <- zip [0..] xs ]
  in maximum ds

Argue among yourselves whether this is any more readable...

MathematicalOrchid
  • 2,430
  • 3
  • 13
  • 15