Random permutations

<< Click to Display Table of Contents >>

Navigation:  Randomness and Randomization >

Random permutations

We have described the notion of random permutations in the Introduction to this topic and we provided the example of the R function: sample(y), which performs a random permutation of the vector y each time it is called (i.e. this is a form of sampling without replacement). The MATLab function, randperm, and the Mathematica function, RandPerm, perform a similar function, but simply permute the first n integers. Essentially, given a vector of ordered integers, such as y={1,2,3....,15}, these functions will generate a random permutation of the vector. For example, producing the sequence: [6, 13, 14, 8, 5, 2, 1, 9, 12, 4, 10, 7, 3, 11, 15]. Each time the permutation function is called a different set of permutations will be generated, but a set of m such permutations will not necessarily be unique — there will always be an m/n! chance that the sequence will recur. For large n and n>>m this is not a significant issue.

Random permutations are used in a number of different ways. In their basic form they can be used as a form of shuffling, or randomizing, the order in which a specific set of events or experiments are conducted. Likewise, they can be used to randomly assign treatments to units grouped into blocks (see further, randomized block designs), in order to remove the effect of block differences. Most statistical software packages provide such functionality built into their experimental design and random sampling support facilities. For example, in the case of R, the sample() function is more general than indicated above. It supports two additional arguments: size and replacements. By default size is the number of elements in the vector, but if it is a smaller number, n, a sample is selected (e.g. it could be the first n from a random permutation of the vector). The replacements argument allow one to specify whether the sampling is with or without replacement — if with replacements, elements may be sampled more than once. In the particular case of R, the function can also support weighted probabilities, so that the assumption of equal probability weighting is altered to reflect the weighting pattern chosen.

An interesting question arises when random permutations produce a valid pattern that is, on separate grounds, felt unsuitable for use. For example, consider the case where three treatments, A, B and C, are to be administered to samples of 5 patients. In order to avoid concerns about effects due to the order in which the treatments are administered, or perhaps to conceal knowledge of the treatments applied from the patients and/or researchers, the team decide that the order be randomized. The sequence {A,B,C} is randomized 5 times to produce the set of sequences to be used. However, suppose the result is the sequence {A,B,C} in every case — although this is a perfectly valid result, arising at random, what should the researchers do? A number of approaches to this kind of problem have been suggested. One is simply to re-randomize, but this implies some (small) level of non-randomness about the selection process; a second is to define the random permutation or selection process as to eliminate certain sets of undesirable cases; and a third approach is to re-examine the design and size of the problem, to see whether changes made can limit the chance of such issues arising. Topics such as these form part of the broader issue of Design of experiments, which we cover later in this Handbook.

A somewhat different use of random permutations is to regard an observed pattern of data as a sample from a population that comprises the same units and values, but which is considered could have occurred in a different arrangement. For example, the observed incidence of a particular illness will generally vary between the counties or districts within a State or Province. The pattern of variation often shows a degree of similarity in values for neighboring areas. The degree to which such patterns exist can be measured using a metric known as the spatial autocorrelation (essential a correlation measure applied to spatially arranged datasets). To test whether the observed value of the measure is significant, one approach is to compute the statistic for a large number of permutations of the values assigned to areas. Each time the values are assigned randomly across the areas the autocorrelation measure is re-calculated, and after 1000 or 10,000 such calculations a range of values will have been obtained. These can be arranged in size order and the observed value compared with this ordering. If p% of the permuted values are greater than the observed value, then we can conclude (with this approach) that there is a probability p% of seeing a value as large as the observed or larger. Similar pseudo-probability distributions can be produced using random permutations in many application areas, notably when analytic distributions are not available, or where the requirements for a specific analytical test are not met. An example is found in the analysis of cross-tabulated data — if a statistic is computed from such tabulated data based on the observed data, the same statistic could be computed after permuting the rows and columns of the table, and by repeating this process a large number of times, a pseudo-probability value for the observed statistic can be generated.