Monte Carlo Markov Chains (MCMC)

Navigation:  Randomness and Randomization > Monte Carlo methods >

Monte Carlo Markov Chains (MCMC)

Previous pageReturn to chapter overviewNext page

In previous subsections we have described how the generation of a sequence of independent random numbers that are Uniformly distributed can be used to sample values from a variety of other continuous distributions, such as the Normal and Exponential. This procedure relies on the availability of an analytical cumulative distribution function, F(x), i.e. it assumes that the integral of the density function, f(x), is available. However, even relatively simply functions, such as

are not analytically integrable. This raises the question, how can random samples be generated that follow an arbitrary distribution that cannot be readily integrated? A closely related question is the computation of the expected value of a function, h(x), for the distribution f(x), i.e. E(h(x)). For example, the crude moments, E(xr) are obtained from an integral of the form:

but it is quite possible that the integral cannot be determined analytically. Hence, whilst evaluating f(x) for any x may be trivial, obtaining a random sample of values of f(x) or computing an expectation, could be problematic. Ideally one wants to be able to use the proven approach of generating Uniformly distributed random numbers to obtain a sample that truly represents the target distribution. As we have seen, the CDF approach will only work where the density function is analytically integrable. Conceptually we need to select from the density function in proportion to the density in specific regions - where the density peaks there should be more samples drawn than locations where the density function is negligible. If the samples taken are truly representative of the distribution function, estimates of expected values such as the mean will be good, otherwise they will be poor. Various approaches have been suggested over the past 50+ years as to how such problems may be overcome, all of which have pros and cons.

We have seen, in the previous subsection (Monte Carlo Integration) that obtaining estimates for expected values is possible using pure Monte Carlo methods, although the method described was not very efficient in terms of computational load. We also did not provide a means of generating random samples from the target distribution using these methods. Indeed, if we had been able to generate a random sample of values that truly represented the target distribution then we would also be able to obtain very good estimated of its expected values. The MCMC procedure, devised by Metropolis (see Metropolis et al. 1953 [MET1]) solves this problem. It uses the fact that an ergodic Markov chain will converge to a target distribution irrespective of its initial state. Since we know the target distribution already, we can use this information to specify the transition matrix. We first describe the steps of the algorithm (in its simplest form) and then examine the same example as we described in the previous subsection.

The steps are:

1. Select a random value from the 'support' of the target distribution, f(x). Typically the support in a simple case is a value in the range [a,b], for example set the starting position to be (a+b)/2. Call this value s_old. Set a loop counter limit, as the process must be repeated for a relatively large number of iterations.

2. Select a random value, r, from any convenient and suitable 'proposal distribution' and add this to the previous value used (initially the starting position), thus s_new=s_old+r. For the present, assume the proposal distribution is any convenient symmetric distribution, such as the Normal distribution, e.g. N(0,1). This distribution is used to select a step size and direction (positive or negative), so can be multiplied by a constant factor to ensure the steps are not too large or too small.

3. Compute the functions a=f(s_new) and b=f(s_old), and a uniform random variable value, u. Then select c=min{a/b,1}. If u<c then set s_old=s_new otherwise leave unchanged. Keep a list of the values s_new.

4. Increment loop counter (until limit reached) and go to 2

If we look closely at these steps we can see that Step 1 establishes a starting position for the simulation, in particular a starting position along the x-axis.. Step 2 'proposes' a move (left or right along the x-axis) from the current position - the decision as to whether to accept or reject (AR) the proposed move is determined by a decision rule in Step 3. The decision rule in Step 3 is of the form: "if the magnitude of the distribution function at the proposed location is great than the current location, then accept the move. However, if it is less than the current location, only accept the proposed move with a probability that matches the ratio of the two magnitudes". The use of the uniform random variable is simply to put this probabilistic selection into action. What this means in practice is: (a) samples are selected which tend to favor larger values of the target function, and (b) over time (a sufficient number of steps) samples will reflect the overall structure of the target distribution. The latter is not obvious, but follows from the way the rule is defined and the selection of a suitable proposal distribution. It is a result of the fact that the algorithm determines an ergodic Markov chain, which as we have seen, converges to a fixed vector (in this case, our known target distribution) whatever the initial vector (see further, Brooks, 1998 [BRO1], Chib and Greenberg, 1995 [CHI1]  and Gilks et al., 1996 [GIL1], for a full discussion of MCMC and how and why it works). In Markov Chain terminology the process generates a stationary distribution that is the same as our target distribution; in Bayesian statistics, it is the posterior distribution of the model parameters. Indeed, it in the field of Bayesian statistics that MCMC methods have flourished in recent years.

We now return to the example of the Exponential distribution discussed in the previous subsection. The algorithm above was applied to the problem of obtaining random samples from an Exponential distribution with mean=0.6, without using its cumulative distribution function, i.e. the target distribution f(x) was the exponential. The starting value for x was taken as the mid-point of the range [0,5], so 2.5. The proposal function used was the Normal, with a step multiplier of 0.1 units. This leaves open the possibility, given the closed range, that steps into negative values could occur. As this is not valid for the target distribution a reflection rule was applied: if s_new was negative then the absolute value of s_new was taken. It is usual to ignore a substantial initial set of sampled values in order to allow the Markov Chain to converge to the target distribution - this is known as the burn-in stage. All samples during this stage are rejected. Data values thereafter are treated as true random samples from the target distribution. The chart below illustrates this, where a total of 40,000 samples were taken with the first 2000 being discarded. The histogram of sampled data shows that the samples closely approximate the form of the true Exponential distribution. Since this is a particular simulation run, the mean does not exactly equal 0.6. As shown in the previous subsection, repeated simulation runs will result in a series of random samples whose mean values will themselves show a distribution (asymptotically Normal) around the true mean of 0.6. The second chart shows the pattern of steps made at the end of the process, which is very similar to the 1D random walk described earlier. This pattern of random fluctuations with no trend is an indication that the process is performing as expected and is 'mixing' quite well, although the small step-size means that the chain has to be run for much longer than with a larger step size. The performance of the process, once convergent, can be improved sometimes by thinning. This simply involves retaining only every kth sample, which has the effect of reducing the high local autocorrelation between samples but at the expense of discarding the proportion (k-1)/k of these.

MCMC sampling from an Exponential distribution

exp_metrolopis_sim

Pattern of movements, last 1500 steps in MCMC process

exp_metropolis_rw

MCMC methods, with a wide range of diagnostic and enhanced features, are provided in many modern statistical and mathematical software packages. Most provide the development of the Metropolis algorithm, due to Hastings (1970, [HAS1]) that relaxes the requirement for the proposal distribution to be symmetric, at the expense of a slightly more complicated decision function (Step 3, above). For example, MATLab now includes a function, mhsample(), that performs Metropolis-Hastings sampling, with selectable proposal pdf, target pdf, burn-in and thinning parameters. In SAS/STAT this type of functionality is supported in a number of procedures, including a specific MCMC procedure that provides a wide range of options and diagnostics (see SAS/STAT User Guide, 2009, Ch.7, for more details [SAS1]). MCMC methods are are central component of most modern Bayesian inference software, notably the BUGS project which implements models using the Gibbs sampler. Discussion of this procedure is beyond the scope of this Handbook, but is described in the original paper on the method, by Geman and Geman (1984, [GEM1]). Essentially Gibbs sampling is a variant of the Metropolis samples in which the proposal distribution is selected so that it exactly matches the posterior conditional distributions for each parameter in a model. SPSS/PASW includes MCMC methods purely in connection with imputing missing data values (for an thorough discussion of such methods, see van Buuren, 2007 [VAN1]). Essentially the idea in this application area is to draw random samples from a distribution of plausible values (using MCMC), and to use these to provide estimates of the missing data values. By performing this procedure multiple times, and then analyzing the results obtained from each imputation, a set of estimates are produced that can be combined and thus enable the uncertainty due to the missing data to contribute to the confidence limits for the model procedure. van Buuren provides worked examples of the procedure based on the collection of human development data for which a substantial proportion of data was missing.

References

[BRO1] Brooks S P (1998) Markov Chain Monte Carlo Method and Its Application. Journal of the Royal Statistical Society. Series D (The Statistician), 47(1), 69-100

[CHI1] Chib S, Greenberg E (1995) Understanding the Metropolis-Hastings Algorithm. The American Statistician, 49(4), 327-335

[GEM1] Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell., 6, 721-741

[GIL1] Gilks W R, Richardson S, Spiegelhalter D J (1996) Markov Chain Monte Carlo in Practice. Chapman & Hall/CRC Press

[HAS1] Hastings W K (1970) Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika, 57(1), 97-109

[MET1] Metropolis N, Rosenbluth A W, Rosenbluth M N, Teller A H, Teller E (1953) Equations of State Calculations by Fast Computing Machines. Journal of Chemical Physics 21(6), 1087–1092

[SAS1] SAS/STAT(2009) 9.2 User Guide, Ch.7 "Introduction to Bayesian procedures". SAS Institute Inc., Cary, NC, USA

[VAN1] van Buuren S (2007) Multiple imputation of discrete and continuous data by fully conditional specification. Statistical Methods in Medical Research, 16, 219-242