<< Click to Display Table of Contents >>
## Random numbers |

It is possible to create sequences or sets of Real numbers that are regarded as random. By this is meant that the chance of any particular number in a given range being created or selected is identical to that of any other number in that range. For example, with a 6-sided die, we expect the numbers 1, 2, 3, 4, 5, and 6 all to be rolled with equal frequency if the die is true and no tricks are played by the person rolling the die. The same applies for a roulette wheel, a coin being tossed, or a completely shuffled pack of cards (although in this last case each of the card face values will arise 4 times in 52 or 1 in 13). In practice this does not happen — if we keep rolling a die or spinning the roulette wheel each value will appear a certain number of times (with a given frequency) which will tend to show an even distribution of frequencies, but will not be exactly even. Likewise, computer systems will often provide ‘random numbers’ which are generated according to an algorithm with nearly equal frequency behavior, but in this case the numbers are not truly random. Such numbers are called pseudo-random, and are often generated for the interval [0,1]. So, for example, in Excel if you enter =rand() into a cell, a random number like 0.3419 will be displayed. If you create a row or column with =rand() in each cell, a series of different values will be created. New random numbers are produced whenever the spreadsheet is re-calculated or if the function key F9 is pressed. Most computer software packages offer such facilities, and pseudo-random numbers in any range can be generated by multiplying the results provided by an appropriate expression or value. In Excel's case there are known problems with some versions of its random number generation (e.g. in some implementations for Macintosh computers), so over-reliance on the numbers generated is not recommended. If in doubt, for a project that relies heavily on the quality of computer generated random numbers, large test sets should be created and their quality examined carefully. Extensive research into the generation and quality of pseudo-random numbers has been carried out over many years, and a range of random number generation (RNG) algorithms and initialization (seed selection) procedures have been devised (see further, Knuth, 1997 [KNU1]). Some packages, such as the RNG function in R, let you choose from a wide range of alternative methods and seeds, including providing your own procedure. R, SPSS and other packages offer the so-called Mersenne Twister procedure (Matsumoto and Nishimura, 1998 [MAT1]), which has excellent performance properties. Software, like Maple, that is designed to have arbitrary precision as opposed to the finite precision arithmetic of most packages, is well-suited to producing and handling large random numbers. The Maple function rand() will generate a random number of almost any size, for example, as an input to a cryptographic processes such as RSA or Rijndael.

The R library includes the following warning note:

Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 232 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)

In the example below, Excel was used and 20 columns from rows 1 to 20 with random numbers were filled, each cell containing =rand()*10. The average value in each column was then calculated, i.e. the average of the 20 values in that column, by dividing each column total by 20. The results are shown in the bar chart below. We would expect the average (i.e. arithmetic mean value) to be 5.0. However, as can be seen (which shows the mean values for each of the 20 columns) it varies from under 4 to over 6.

Random numbers, average values in range 0-10

The actual pattern of frequencies in this case can be seen by counting the number of values generated in the ranges 0-1, >1-2, >2-3 etc. up to >9-10. This grouping of values is a form of classification of our results into 10 categories that we shall number 1 to 10. We can plot these categories as a bar chart showing frequencies rather than values, and this form of chart is called a histogram (from the Greek, histos, a mast):

Random numbers, sampled frequency of values in range 0-10

This chart shows that with 400 randomly generated numbers in the range [0,10] roughly 40 will be found in each unit interval, but there is variation in these frequencies. The bars in this chart all have the same width, because each interval or class has a width of 1 unit. This is the normal arrangement, but if the intervals are not equal the widths should reflect this, in proportion.

To generate random numbers that reflect other probability distributions, such as the Normal or Poisson distributions, a similar approach is often utilized, with random Uniform numbers being generated and these are then used to obtain random values from the target distribution through a look-up procedure using the cumulative distribution of the target, as described in the section below on random sampling. This is known as the inversion method, as it uses the inverse of the cumulative frequency distribution to generate the random numbers. In practice specific functions that provide random values from a wide selection of distributions are provided in most statistical and mathematical packages. Typically these provide random values for distributions such as the Uniform, Normal, and Gamma distributions.

Because these numbers are program-generated there is a risk that the same set of ‘random’ numbers will be created next time you use the spreadsheet or run the program, so in a sense these would no longer be random. The usual solution to this is to allow you to specify a different start value (or seed) each time the function is called — often the number of seconds or milliseconds recorded by the computer since midnight is used for this purpose. Other computer-related variables that are often used in the generation of random numbers include: the current date-time in milliseconds; the current system status; and the current system memory usage. Interactive programs, such as those that use Microsoft Windows or X-Windows interfaces, may utilize the sequence and timing of recent keystrokes or mouse movements as part of their seed selection process.

If we examine the frequency pattern of the average values from a large number of samples of 20 random numbers in the range [0,10] we obtain a remarkable result. Instead of finding an even pattern of averages, as was obtained in the previous subsection, we find that most of the values are close to the expected average of 5, with fewer results being found as we look above and below this value (see chart, below). The chart shows the number of averaged values, in a sample of multiple sets of data, in evenly spaced ranges (the upper boundary of the ranges are shown). What is even more extraordinary is that this kind of pattern is found whenever the average values of large independent random samples are obtained from any underlying pattern, not just one where all the values are equally likely, or uniform.

Averages of random numbers

Taking averages of measured datasets changes the data into a well-defined and well-understood pattern of the kind we have just seen. This new pattern tends to be the shape of a bell, and follows the form of the Normal Distribution. The fact that datasets exhibit this property when averages are taken is known as the central limit theorem, CLT, and is one of the most important discoveries in the history of mathematics and statistics. More specifically, the datasets must be drawn from independent samples that are identically distributed (iid) and have finite variance, although it is accepted that relaxing of these conditions is possible. The term was introduced by the German mathematician, Polya, as recently as 1920, though has its origins in the work of Abraham De Moivre discussed in our introductory section on Applications. A number of generalizations of CLT exist (GCLT), including permitting weak dependence between the samples (if the dependence is sufficiently weak the asymptotic distribution is still N(0,1); and extension to distributions that have infinite or near infinite variance (for example, the Pareto distribution with α<2). In the latter case the distribution of the average values does not tend to the Normal (in general) but to some other, finite stable distribution. In this case, as α approaches 2 the stable distribution tends to the Normal, as expected under standard CLT.

To re-emphasize this result, the exercise above has been repeated but in this case we created 20 columns of 1000 random numbers (20,000 numbers) using a non-uniform pattern (the Poisson distribution with a mean value of 2). The result is a set of random numbers, 0,1,2,3…, with a mean value of approximately 2, for which the frequencies are far from regular (see the top chart below) — lower values such as 0, 1 or 2 are much more likely than values of 5 or above. The average spread (standard deviation) for the Poisson is √mean so √2 in this case, i.e. about 1.414… Now, when we add across the 20 columns and take the 1000 average values, the result is a bell-shaped frequency pattern again (as shown in the lower of the two charts below) despite the fact that the initial distribution is neither uniform nor bell-shaped. What is more, the new distribution has the same mean as the original distribution (i.e. 2) and we can see that the spread is much smaller — only ranging from 1 to 3. This reduction in spread always happens when we apply averaging in this way, and if s is the average spread of the original distribution then s/√n is the average spread in the new distribution, where n is the number of values used in the averaging process. In our example n=20, so the average spread should be around 0.316, and in fact was 0.321.

Random numbers and the central limit theorem

The above examples provide a rather informal description of the Central Limit Theorem. It is useful to state the theorem in a formal manner, although we do so here without providing a proof — a proof of a slightly restricted version of the CLT is provided in Mood and Graybill (1950, [MOO1]):

The term on the right hand side of this expression is the unit Normal distribution, N(0,1), integrated over the range [a,b]. The term on the left is sometimes called the reduced random variable., because the sample mean has undergone a transformation of location and scale (a z-transform). The divisor (scale transformation) is the standard error of the mean. Notice that re-arranging the left hand side of the expression provides the probability that the sample mean lies in the selected range:

It is important to note that these expressions contain the population variance, σ, not the sample variance, s, and when the sample size is not large (e.g. <50) the Normal distribution is not suitable and the t distribution should be used in its place.

Random sampling from a known distribution

In general random sampling commences with the creation of random numbers from a continuous Uniform distribution, as described above. However, if we wish to select random values from a continuous or discrete distribution that is not uniform, an extra step is required. For example, suppose that we have a discrete distribution with probabilities as follows:

P(0)=0.15, P(1)=0.20, P(2)=0.30, P(3)=0.20, P(4)=0.15

This is a generalized discrete distribution, as described into our topic Discrete Distributions. In order to select a series of random values 0,1,2,3,4 that reflect the different probabilities of the distribution, we first compute the cumulative distribution: 0.15,0.35,0.65,0.85,1.00. Then we generate a random uniform number, and check to see which range it falls into, i.e. 0-0.15, 0.15-0.35 and so forth. Depending on the interval identified, the associated entity value is selected. So if the random uniform values were 0.7809, 0.3462, 0.7400, 0.0020 the selected values would be 3,1,3,0. This idea can readily be extended to most discrete and continuous probability distributions.

Let f(x) be a probability density function (pdf) with F(x) as the cumulative distribution (cdf) of this function:

If f(x) is a continuous Uniform distribution (U) on the interval [0,1], f(x)=1 for all x, and by integration, the cumulative distribution function, F(x)=x. Graphically this translates as a diagonal line, as shown in the first diagram below. Selecting a random uniform value from the vertical axis, this translates into an identical uniform value on the horizontal (x) axis (the x-axis range here is the same as the y-axis, so the line is a simple diagonal, but the method generalizes to any range [a,b]). However, in the second graph, which shows the cumulative unit Normal distribution, N(0,1), a random uniform value selected on the vertical axis translates into a random Normal variate in the range (-∞,+∞). This operation can be expressed as x=F-1(U), i.e. the random value x is obtained using the inverse of the cumulative distribution function, with argument being a random uniform value.

This inversion procedure, which is due to von Neumann, can be applied for any continuous probability distribution in order to generate random values from that distribution. In practice the steps are: (i) select a random number, α, in the range [0,1]; (ii) set α=F(x); and (iii) solve for x using the inverse, x=F-1(α), assuming this is available (analytically or numerically). For example, suppose that one wishes to generate random numbers with an Exponential distribution:

We first form the cumulative distribution function, by integration:

and then set α=F(x), and solve for x, producing:

A number of observations should be made regarding this procedure: (i) the horizontal axis in the diagrams above is showing the x-range for the standard parameters. To obtain a range of values that corresponds to a particular set of parameters, e.g. a Normal distribution with mean value 5 and standard deviation 2, these parameters should be used when forming the cumulative function; (ii) given a large sample of random Normal values, {x}, their cumulative distribution should correspond closely (but not precisely) to the form of the curve illustrated; (iii) the procedure assumes that the cumulative distribution function can be obtained analytically; (iv) the vertical axis can be transformed to reflect the form of the cdf being examined, with the resulting plot being a straight line — this is typically how Normal distribution probability plots are presented.

As mentioned above, the generated data, being random samples, will not exactly match the parameters of the generating distribution. It may be possible to create an adjusted set of random values that does match the target parameters exactly. For example, in order to generate a set of random Normal values, with mean 0 and standard deviation 1, one can generate random values from the Normal distribution, compute the mean and variance of these sample values, and then use a z-transform to adjust the data so that the mean and variance are exactly 0 and 1. Adding a fixed value to the result will shift the mean, if required, and multiplying the data by a constant will alter the standard deviation to match the selected constant. Similar adjustments can be be made to random Uniform values if required.

There are alternative procedures to inversion, which can be used where the functional form of the distribution makes this approach difficult or requires excessive compute resources. Typically, such methods still commence with generating a uniform random value, and then use some form of approximation to generate a value with the distributional characteristics sought. For example, the distribution might be approximated by a function that is more convenient, or a bounding function may be used and values selected that lie within this boundary (so-called rejection methods). More recently Monte Carlo Markov Chain (MCMC) sampling has become the preferred approach, especially in connection with Bayesian inference (e.g. within the UK Medical Research Council's BUGS project).

A wide range of built-in random number generators are available with many software packages. For example, with MATLab random numbers drawn from each of the following distributions can be used simply by adding "rnd" onto the end of the function name, e.g. normrnd(), betarnd(), specifying the distribution parameters required: Beta, Binomial, chiSquare/Noncentral chiSquare, Exponential, F/Noncentral F, Gamma, Geometric, Hypergeometric, Lognormal, Negative Binomial, Normal, Poisson, Rayleigh, Student's t /Noncentral t, Uniform, and Weibull. A similar range of function is provided in many other packages, such as Minitab, XLStat, SPSS etc. SPSS adopts the format RV.xxxx as the form for generating a random variable from pdf xxxx. Likewise, in the R environment, the letter "r" preceding a distribution function name is often used to provide random numbers from that distribution — for example the Uniform and Normal functions: runif() and rnorm() in the base package; and rvm() for the von Mises distribution in the circular statistics package (CircStats).

References

[KNU1] Knuth D E (1997) The Art of Computer Programming. Vol 2, 3rd ed.

[MAT1] Matsumoto M, Nishimura T (1998) Mersenne Twister: A 623-dimensionally equi-distributed uniform pseudo-random number generator. ACM Transactions on Modeling and Computer Simulation, 8, 3–30

[MOO1] Mood A M, Graybill F A (1950) Introduction to the Theory of Statistics. McGraw-Hill, New York

Mathworld: Central Limit Theorem: https://mathworld.wolfram.com/CentralLimitTheorem.html

Wikipedia: Central Limit Theorem: https://en.wikipedia.org/wiki/Central_limit_theorem

see also:

Wikipedia: Copulas: https://en.wikipedia.org/wiki/Copula_%28statistics%29