﻿ Classical tests > Goodness of fit tests > Kolmogorov-Smirnov

# Kolmogorov-Smirnov

Navigation:  Classical tests > Goodness of fit tests >

# Kolmogorov-Smirnov

The Kolmogorov-Smirnov (or KS) tests were developed in the 1930s. The tests compare either one observed frequency distribution, f(x), with a theoretical distribution, g(x), or two observed distributions. In either case the procedure involved forming the cumulative frequency distributions F(x) and G(x) and finding the size of the largest difference between these.

Assumptions: The sample is random (or both samples are random) and independent if two samples are involved. The scale of measurement should be at least ordinal and preferably continuous.

Hypothesis: H0: F(x)=G(x) (two-sided case); H1: F(x)≠G(x) for at least one value of x

Test: Compute the test statistic:

D=sup[F(x)-G(x)]

where sup means supremum, or largest value of a set. With a single sample of size n, the values of D or Dk, where k=n for a given significance level, α, are given in tables or can be computed programmatically (e.g. using Monte Carlo simulation methods).

Strictly speaking the test, as described, should not be applied as a goodness of fit test for a single sample and a theoretical distribution if the theoretical distribution parameters are estimated from the sample. It may be used to compare two observed distribution without this issue being a problem. The general question of comparing the fit of a sample, or empirical distribution function (EDF) to a theoretical distribution, has been the subject of considerable research. Stephens (1974, [STE1]) considers five different cases (which he denotes 0 to 4), of which three are of particular interest:

Case 0: The theoretical distribution is continuous and completely specified. This 'classic' case can be tested using the KS test, but Stephens notes that the appropriate value to use for k is: k=n +0.12+0.11/n and he provides the following percentage point values for Dk: 15%: 1.138; 10%:1.224; 5%:1.358, 2.5%:1.480 and 1%:1.628

Case 3: The theoretical distribution is a Normal with mean and variance unknown, but estimated from the dataset (i.e from the EDF). In this case the critical values are, for Dk: 15%: 0.775; 10%:0.819; 5%:0.895, 2.5%:0.955 and 1%:1.035. The appropriate value to use for k is: k=n +0.01+0.85/n

Case 4: The theoretical distribution is an Exponential with parameter θ unknown, but estimated from the dataset (i.e from the EDF). In this case the critical values are, for Dk: 15%: 0.926; 10%:0.990; 5%:1.094, 2.5%:1.190 and 1%:1.308. The appropriate value to use for k is: k=n +0.26+0.5/n and D is modified in this case to (D-0.2/n)

Stephens tests the KS model using these amended expressions and significance levels against alternative goodness of fit tests, some of which are very similar. He concludes that although the KS tests have been widely used in this context in the past, it is less powerful in general than some other tests, notably the Anderson-Darling test and the closely related Cramer-von Mises statistic and Shapiro-Wilk statistic.

Most software implementations continue to use the traditional form of the KS statistic, including implementations in SPSS and R, for example, although some packages (e.g. SAS/STAT) do use the adjustments recommended by Stephens and others. See also, Lilliefors test.

Example 1: Distribution of trees in a woodland

In the Chi-square test, above, we showed how the distribution of trees in a woodland could be counted using a grid and the frequency of a particular species in each grid square recorded. We then compared this frequency distribution to that expected using a Poisson distribution. In that instance the distributions were shown as counts, i.e. the number of squares that had 0 trees, 1 tree, 2 trees etc. when the mean observed was 4 trees per square. The observed counts are shown in the table below, followed by these values adjusted so that the total sums to 1 (by dividing each value by 25) and cumulating the individual frequencies. The same procedure has been applied to the expected frequencies, and the maximum difference identified. In this example we are assuming that the mean of expected distribution has been obtained independently.

 Frequency 0 1 2 3 4 5 6 7 8 9 10 Observed counts 1 0 6 6 2 3 4 2 0 1 0 Cum Observed 0.04 0.04 0.28 0.52 0.6 0.72 0.88 0.96 0.96 1 1 Cum Expected 0.02 0.09 0.24 0.43 0.63 0.79 0.85 0.95 0.98 0.99 1 max|O-E| 0.085

The maximum difference observed is D=0.085, and for large n at a 5% significance level, the critical value of the KS one-sample statistic is 1.36/n, so in this case as n=100, 0.136 is the critical level. Our result is smaller than this so we cannot reject the null hypothesis.

Example 2a: Comparing distributions — test dataset

A more usual application of the KS test is when comparing two distributions to see if they are significantly different. Simple descriptive statistics can suggest that two distributions are very similar, but graphical displays can indicate otherwise. For example, the data presented in the two histograms shown below are each comprised of 75 observations. These two datasets have the same minimum, maximum, mean, median and almost identical variance (1,9,5,5,5.6). However, from visual inspection they can be seen to totally different. In this instance, using the R function ks.test() we obtain a D value of 0.2533 with an approximate p-value of 0.02, so the test does identify the differences as significant — this result is clear from the graphs in this case, but often the differences are more subtle, as shown in Example 2b. Example 2b: Comparing distributions — insect wingspan data

This example is similar to the previous case, with two distributions to be compared. The means of the distributions are approximately equal, and the range is similar, but this data has 50 records in sample A and 70 in sample B. The data is the wings.txt sample data from Crawley (2007, [CRA1]), representing the wings span in mm of two groups of insects of the same species but in different locations. The ks.test() result in this case gives the result: D = 0.2629, p-value = 0.02911. In fact, in this case the variances are also significantly different (see F-tests). References

[CRA1] Crawley M J (2007, 2015) The R Book. J Wiley & Son, Chichester, UK

[STE1] Stephens M A (1974) EDF tests for goodness of fit and some comparisons. J Amer. Statistical Assoc., 69(347), 730-737