Rank correlation

Navigation:  Correlation and autocorrelation >

Rank correlation

Previous pageReturn to chapter overviewNext page

Rank correlation statistics are non-parametric procedures that relying on the ordering, or ranking of data that are comprised of pairs of measurements, rather than relying on the original data measurements themselves (if these exist). Although ranking data involves a loss of information, it does have some distinct advantages, notably its robustness (extreme values have a relatively small impact on the results, whereas they have a large impact in product moment correlation), and its lack of dependence on the form of the underlying data distribution (i.e. its non-parametric form). Two forms of rank correlation coefficient are widely available, those attributed to Spearman and to Kendall.

Spearman's Rank Correlation, ρ

Given a set of paired observations (xi,yi), i=1,2...n , n>2, and let (ui,vi) be the ranked arrangements of the (xi,yi), such that the largest xiu1, the second largest u2 etc., and likewise for ranking the yi using the set {v}. The ranking could equally well be from the smallest to the largest in both cases, as this will not alter the magnitude of the resulting computation. Compute the difference between the rankings of the sets as di=(ui-vi); then Spearman's rank correlation coefficient is:

Charles Spearman, after whom this statistic is named, was an academic psychologist who introduced a number of statistical techniques in the course of his work on the measurement of human intelligence. In the case of the rank correlation coefficient, Spearman actually devised a slightly different formulation:

but this is not the measure now used.

If the source data contains values that are equal, the ranks would be tied, so they are replaced by the average of the ranks ignoring ties. For example, if the source data contained x-values 12,5,5,3,1 the nominal ranking would be 1,2,3,4,5 and the adjusted ranking would be 1,2.5,2.5,4,5. With adjusted ranks the standard product moment correlation formula must be used. This statistic has a range [-1,1] and is produced using the standard product moment correlation formula, but here using the ranked values. The statistic has a variance of 1/(n-1) and all odd moments are 0. The formula is:

where the u and v entries are the ranked values and the means of these ranked values. The question as to what value of rs is statistically significant presents considerable problems. Modern software packages, such as SPSS and SAS, typically compute the statistic and then provide a probability level based on treating the transformed variable:

as coming from a t-distribution with (n-2) degrees of freedom (Kendall et al., 1939, p261, [KEN2]). However, with small values of n it is clear that there are only a very limited range of possible values that di can take, and hence the squared summation S can only take a finite range of (even) values, which in turn limits the possible values of r. For example, with n=5 the set x={1,2,3,4,5} and y={all possible permutations of x} the possible values of S are:

{0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40}, i.e. 21 values in all. In fact the number of values that S can take is:

which corresponds to the value 21 just illustrated for n=5. With a complete permutation there are only 120 cases (5!) to consider, so each value of S has an associated fixed frequency and hence defined probability based on this fixed permutations model. However, should the observed value of the coefficient be compared to all possible permutations or to a large number of random permutations of x? Assuming the latter model, the frequency distribution of r remains far from smooth or Normal for n<8, but with n=8+ the frequency distribution behaviour becomes smoother and increasingly close to the Normal distribution (it approaches the Normal distribution as n→∞). Another model against which observed values of r could be compared is one in which tied rankings occurred, which greatly increases the possible values r can take. In the paper by Kendall et al. (1939 [KEN1]) they obtain the distribution of S for n≤8 under strict permutation conditions and identify the value of using the t-distribution to determine probability levels as noted above for n >6. The authors also note the high level of association between the Spearman's rank measure and the Kendall measure, described below, which has a simpler sampling distribution and this is recommended as an alternative.

Kendall's Rank Correlation, τB

Kendall's rank correlation computation has similarities with the Spearman's approach, but does not use the numerical rankings directly. Instead it considers the number of possible pairwise combinations of the first set of values, and compares this with the possible set of arrangements of the second set of vales. It then counts the number of such pairs that only appear in one or other set and not both (this is known as the symmetric difference, Δ), and uses this value to compute the statistic. A simple example, drawn from Abdi (2007 [ABD1]) makes the procedure clearer:

Two wine experts are asked to arrange four wines {a,b,c,d} in order of quality. The first expert proposes the arrangement {a,c,b,d} whilst the second proposes {a,c,d,b}. With n=4 wines there are n(n-1)/2 = 6 pairwise arrangements of each expert's ordering. These two sets of arrangements are:

P1={[a,c],[a,b],[a,d],[c,b],[c,d],[b,d]} and P2={[a,c],[a,d],[a,b],[c,d],[c,b],[d,b]}

In these arrangement the two pairs not highlighted appear in just one of the sets. So their are two pairs that are said to be discordant, and the remaining pairs are said to be concordant. The Kendall statistic is simply the symmetric difference between these two sets, standardized by the number of possible pairs and adjusted to provide a range from [-1,1], where 1 equates to perfect positive correlation and -1 to perfect negative correlation:

For the example above the statistic yields τ = 1-2/6 = 2/3. Rather than perform the permutations as illustrated above, the sets can be numerically ranked taking the first set as the controlling sequence, to give:

R1={1,2,3,4} and R2={1,2,4,3}

From this ranking we can then easily calculate the number of concordant pairs (Nc) and the number of discordant pairs (Nd). Because the first series is in strict order all pairs are concordant, i.e. are in their natural sequence. However, the second set includes entries that are out of sequence. Each pair in the correct sequence is assigned a value of +1 and each pair out of sequence is assigned a value of -1. For our example we have:

(1,2) +1, (1,4) +1, (1,3) +1, (2,4) +1, (2,3) +1, (4,3) -1, so Nc=5 and Nd=1. The Kendall formula for this method of computation is:

again yielding the result, τ = 2/3. Different packages perform this computation in various ways, but should yield the same result. Adjustments are made to the formula in cases where ties in the rankings exist.

To evaluate the significance of this statistic we can compute the total number of possible arrangements of 4 objects and compute the τ value for each. The first arrangement of 4 objects can be regarded as fixed, ranked as 1,2,3,4. The second set could be any of the 4! permutations of these four rankings, so 24 in total. If we compute the τ value for each arrangement we find 4 arrangements that yield a value of ≥2/3. These are {1,2,3,4} which is perfect correlation, so a τ value of 1, plus three arrangements with a τ value of 2/3: {1,2,4,3}, {1,3,2,4} and {2,1,3,4}. Thus the probability we are seeking is 4/24 =1/6, i.e. the probability of seeing a correlation of 2/3 or greater is 1/6 (about 17%), so the two wine experts appear to have broadly similar rankings of the wine - in fact, with only 4 wines, only perfect agreement or perfect disagreement would be significant at the 5% level, i.e. 1/24. With n=4 all the probabilities can be computed manually, but as n increases enumerating the probabilities becomes more difficult. Fortunately for n>10 the sample frequency distribution of the statistic is close to the Normal distribution, with mean μτ=0 and variance στ2=n(n-1)(2n+5)/18. This enables a simple z-transform to be made: z=τ/στ and the resulting value compared to the percentage points of the Normal distribution in order to estimate the probability of the result. Where the rankings include tied values the formula for the variance is adjusted - individual software packages provide the details of their precise computational method and adjustments made for ties. For n≤10, with no ties, the following table can be used:

Critical values of Kendall's τB for α=0.05 and α=0.01

n

4

5

6

7

8

9

10

α =0.05

1

0.8000

0.7333

0.6190

0.5714

0.5000

0.4667

α =0.01

--

1

0.8667

0.8095

0.7143

0.6667

0.6000

References

[ABD1] Abdi H (2007) The Kendall Rank Correlation Coefficient. in Salkind N, ed.(2007) Encyclopedia of Measurement and Statistics. Sage Publications, Thousand Oaks, CA, USA. Available from: http://www.utdallas.edu/~herve/Abdi-KendallCorrelation2007-pretty.pdf

[BES1] Best D J, Roberts D E (1975) Algorithm AS 89: The upper tail probabilities of Spearman's rho. Applied Statistics, 24, 377–379

[BES2] Best D J, Gipps P G (1974) Algorithm AS 71, Upper tail probabilities of Kendall's tau. Applied Statistics, 23, 98-100

[KEN1] Kendall M (1938) A New Measure of Rank Correlation. Biometrika, 30, 81-93

[KEN2] Kendall M G, Kendall S F H, Babington-Smith B (1939) The distribution of Spearman's coefficient of rank correlation in a universe in which all rankings occur an equal number of times. Biometrika, 30, 251-273

[SPE1] Spearman C (1904) The proof and measurement of association between two things. Amer. J. Psychol., 15, 72–101

[SIE1] Siegel S (1956) Nonparametric statistics for the behavioral sciences. McGraw-Hill, New York