﻿ Classical tests > Wilcoxon rank-sum/Mann-Whitney U test

# Wilcoxon rank-sum/Mann-Whitney U test

Navigation:  Classical tests >

# Wilcoxon rank-sum/Mann-Whitney U test

The tests described here are non-parametric and can be applied to unpaired or paired datasets. They test to see if two independent samples come from identical continuous distributions with equal medians, against the alternative that they do not have equal medians.

Unpaired experiments

In 1945 Frank Wilcoxon, working for the US fertilizer business American Cyanamid, published a paper describing how the size ordering of data (ranking) could be used in statistical analysis, in place of the data itself [WIL1]. His example data was a set of independent tests on two preparations of fly spray. The test needed to determine which preparation was more effective in killing flies. A total of 8 tests were run on each of two samples, so 16 tests in total. The results presented by Wilcoxon are shown below:

Fly spray tests

 Sample A Sample B % kill Rank % kill Rank 68 12.5 60 4 68 12.5 67 10 59 3 61 5 72 15 62 6 64 8 67 10 67 10 63 7 70 14 56 1 74 16 58 2 R1=91 R2=45

Next to each result he has placed a ranking based on all 16 tests — thus the smallest value, 56 has been assigned rank 1 and the largest, 74, has been assigned rank 16. Because two results tied, these were both assigned 12.5, and where three tied (with a value of 67) each has a rank of 10 (rather than 9,10,11). If the two samples were drawn from the same population we would expect the total values for the ranks to be roughly the same. Wilcoxon then asked "what is the probability of getting a rank sum value, R1 or R2, as low as 45?". The answer can be obtained by systematically working out how many possible arrangements there are of 8 ranks from 16, which is: and then calculating how many of these arrangements gives a total rank sum of 45 or less. For example the smallest possible sum is Smin=1+2+3+4+5+6+7+8=36 and the largest is Smax=9+10+...+16=100 (the sum of the first n integers is n(n+1)/2, from which the minimum and maximum possible values can be directly calculated). In fact there are 95 arrangements that sum to 45 or less, so the probability is R=95/12870=0.007382. This is essentially a one-tailed result, so for a two-tailed result the probability would be 0.014763 and based on this we would generally reject the null hypothesis, i.e. the data do suggest the two samples have different means. This computation ignores ties, for which an adjustment can be made, but the results are generally very similar.

Probability levels can also be obtained by Monte Carlo simulation, i.e. by generating a large number of random permutations of the integers 1 to 16, and summing the first 8 (again, this procedure ignores ties). Each random permutation will produce a rank sum and the proportion of these that are 45 or less, which is the probability sought. Using this method we were able to produce estimates of the probability of 0.014763 +/- 0.00015, demonstrating that this procedure can be used to evaluate the probabilities for any desired sample size and each value in the range [Smin,Smax]. By permuting the ranked data including the tied scores, a similar simulation produced an estimated probability of 0.01353 +/- 0.00015.

There are a number of aspects of this test that are surprising and warrant comment. The first is that the test has made no assumptions about the distribution of the data, so is a distribution free or non-parametric test and is robust (it is not sensitive to outliers, because actual values are replaced by ranks). Indeed, the data could simply be ordinal scaled values rather than interval measurements, which is very useful for many types of data. The second is that extensive research has shown that in almost all cases it performs as well as, or better than, a standard t-test. By this we mean that it is as efficient or more efficient than a t-test in most instances, and at worst is at least 95% efficient compared to a t-test for large samples and always >86% efficient. The robustness and efficiency of the test means that in almost all instances it is preferable to use the Wilcoxon test to a t-test, unless the underlying distribution is known to be Normal or the two samples are known to have been drawn from populations with different distributions. The third observation is that the test is asymptotically Normal, so for larger samples a simple z-transform will produce an approximately unit Normal variate, and thus a z-test may be applied (see further, below).

Wilcoxon assumed that the same number of replications are run for each sample, i.e. the sample sizes n1=n2. Shortly after his publication two other researchers, H B Mann and D R Witney (MW), provided an essentially identical test but with the samples sizes no longer assumed to be equal. In practice it is assumed that the two sample sizes are not radically different. The MW or MWW test involves calculating exactly the same rank sums, R1 and R2, and typically using the smaller value to produce their statistic, although either may be selected: or the equivalent value based on the second set of ranks: From our earlier observations it can be seen that the statistic is simply the Wilcoxon measure with the minimum possible value subtracted, giving a range from 0. For the Wilcoxon data shown above U1=45-36=9. Note also that the sum U1+U2=n1n2, so the average value of the two alternatives is simply n1n2/2. For large n=n1+n2 the z-transform of the MW statistic is: Using the simulation procedure described earlier, the frequency distribution of rank sums can be plotted for different sample sizes. Below we illustrate this with a modest total sample size of 40 using 5000 random permutations; a fitted Normal distribution is also illustrated. It is worth mentioning that the MWW test may be unsatisfactory or fail under certain circumstances — for example, if there are an excessive number of ties in the data, or if the sample sizes are very different or very small for one sample.

Wilcoxon rank sum distribution — simulated for n=40, n1=20 Significance levels and confidence intervals

Many statistical texts provide tabulated values for the MW (or MWW test), and statistical software packages provide the test as a standard option. The procedures used may or may not be exact (depending on the way they have been implemented, the set of rank sums they opt to choose, and the size of the samples). In many instances current packages (including the R Project implementations) compute exact values for sample size up to 20-50 by default, and then switch to random permutation or Normal approximation for larger values (the method applied can often be selected by the user). Typically packages using the exact permutation procedure, including allowing for ties, and report the probability for the fly spray data as 0.01352 (i.e. adjusted for tied values) rather than 0.01476. A review of MWW implementations in a wide range of software packages by Bergmann et al. [BER1] in 2000 concluded that it was almost always safest to use software packages that generated exact probability values, as many did not at that time. Most packages have since resolved this problem.

Paired comparisons

Wilcoxon's 1945 paper also considered cases where the data measurements are paired. For example, in a randomized block experiment with 8 replicates of treatments A and B. For each replicate a measurement is made of the results after treatments A and B, the difference in the two values computed, and then these differences are ranked. Where a difference is negative the rank is marked as negative. Wilcoxon then considered the magnitude of the sum of the negative or positive ranks (whichever was less), and provided a table showing the probability of encountering a sum as large as that observed. As with the unpaired case, for large n this measure tends to a Normal distribution so a z-transform will produce a unit Normal which can then be used to evaluate the significance of the difference: In most software implementations of MWW the paired comparison version is included as one of the options — the assumptions are essentially the same as for the unpaired case, but no unpaired observations are permitted. Essentially the paired comparison is a form of Sign test.

References

[BER1] Bergmann R, Ludbrook J, Spooren W P J M (2000) Different Outcomes of the Wilcoxon-Mann-Whitney Test from Different Statistics Packages. The American Statistician,54,1,72-77

[WIL1] Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80-83