Randomness and Randomization

Navigation:  »No topics above this level«

Randomness and Randomization

Previous pageReturn to chapter overviewNext page

The notion of randomness is a central one in statistical analysis. Random numbers, random time intervals and random spatial point distributions often provide the default model against which alternative models, such as clustering or uniformity are tested. In the case of random numbers, for example sets of 4 decimal digits in the range 0000 to 9999, the notion of randomness refers to the chance of any particular sequence of digits being selected being drawn from a Uniform distribution - i.e. every sequence is equally likely. Strictly speaking, a finite set of (published) random numbers is no longer random, in that it has been generated and is simply a specific example of such a collection. It is more appropriate to define random numbers in terms of a sequence, whereby there is no means of predicting the next number in the sequence. Computer software and/or physical devices used to generate random numbers are tested to check the quality (consistency, uniformity) of the numbers generated. It is essential that such generators, most of which produce pseudo-random rather than true random numbers, are as close as possible to truly random and control is available (if necessary) over the initialization of the generating process. The latter is important in order to be able to specify whether a particular set of random numbers is required (for repeated experiments using the same sequence) or, as is more common, different sets are required every time the procedure is called. This typically requires the inclusion of a seed value, either as a static input that can be altered as required, or based on some other pseudo-random value, such as the number of milliseconds since 00:00hrs.

Randomness in a temporal or spatial context is often defined in terms of a Poisson process. The assumption made is that the probability of an event occurring at a particular place and/or time is very small, but the mean number of events over the time period or spatial region is well defined and approximately constant. For example, an average of 2 plants of a specific type might be observed per 100 square meter subsection in a grid of 100 square regions (quadrats) of a 1 hectare (10,000 sq.m.) sample region, but the probability of a plant being found at any specific location in the region is very small. Likewise, the number of failures of light bulbs, computer chip sets or other devices per unit of time (failure rate), might be measured and observed to be relatively constant, e.g. 2 per week, but the probability of a failure at any particular moment or brief interval is extremely small.

If a finite number of objects (cards in a pack, numbers in a list, census districts in a State) are given, the sequence in which they occur may be randomized by assigning each object an integer value, and then randomly shuffling, or permuting, the set of values. This can be performed manually (e.g. shuffling a pack of cards) but is most effective when performed programmatically. Thus, for example, the R function: sample(y) – performs a random permutation of the vector y each time it is called (i.e. this is a form of sampling without replacement). If the number of elements, n, in y is not large, there is a substantial chance that the same permutation will be observed more than once. For example, with 5 objects there are only 120 permutations, so a set of 20 random permutations might well display recurrent sequences. As n increases the chance of repeated sequences diminishes rapidly, assuming the number generated is not excessively large. The generation and use of random numbers and random permutations is discussed more fully in the next subsection. Statistical tests can utilize randomization as a means of obtaining a reference distribution by using some form of Monte Carlo simulation in which repeated random permutations are used to estimate the exact reference distribution. For example, when examining whether the difference between two mean values are statistically significant or not, one can use approximate techniques such as the Student's t-test, or one could combine the two datasets used to obtain the observed mean values, randomly draw samples from the pooled data that have the same sizes as the original samples, and compute the difference in the sampled mean values. Repeating this exercise for every possible partition (if computationally feasible) would yield an exact reference distribution, against which the probability of the observed difference could be compared. However, for all but the smallest of datasets this computation is not possible, and therefore a set of random samples (e.g. 10,000) would be taken and an approximate reference distribution created. This is an example of computational statistics and inference rather than the application of classical statistical analysis.

Closely associated with random number generation and random permutations is the question of random sampling, as previously discussed in the section covering sampling and sample size. Pure random sampling involves selection of objects, events, locations or times at random from a larger set, often an infinite set. Many variations of random sampling exist, notably stratified random sampling, in which the population is first divided into classes or strata, and then random samples are made in each stratum, usually with some variation in sample size that reflects differences in the size and/or variability of the strata.

The term randomization usually refers to procedures in which objects (or experimental units) are assigned to treatment groups (or blocks) at random. This procedure is used to minimize bias in cases of designed experiments, such as medical and agricultural trials. As with random sampling, randomization in experimental design may not involve pure random assignment, but can involve some degree of grouping or blocking (e.g. randomized block designs). Other issues that arise include determining whether the process of randomization is static or dynamic, whether the process requires sequential or clustered or hierarchical implementation, can unequal allocations be handled, and is the method practical to implement? The application of randomization is discussed further in the section describing the design of experiments.

The subsections that follow discuss aspects of randomness and its use in statistical methods, in more detail. Initially we discuss random numbers and random permutations. We then examine how the mean values of random values obtained from different distributions tend to a common distribution, the Normal, in what is known as the Central Limit Theorem. The final two subsections address rather more sophisticated aspects of randomness: so-called random walks, and the related topic of Markov Processes.

References

[CAN1] Cancer Research UK (2010): Randomized trials. Available from: http://www.cancerhelp.org.uk/trials/types-of-trials/about-randomised-trials

[COX1] Cox D R (1958) Planning of experiments. John Wiley & Sons, New York