Design of experiments

<< Click to Display Table of Contents >>

Navigation:  »No topics above this level«

Design of experiments

Many problems encountered in statistics involve the analysis of data collected by third parties as a result of some form of survey, ongoing data gathering process, remote sensing exercise, or similar activity. The design of the collection process is often outside of the control of the analyst, and whilst those supervising the data collection may have spent a great deal of time considering the best way of gathering the information, in most instances it does not reflect the need to undertake any kind of formal 'experiment'. However, a very different set of circumstances exists where a higher degree of control over the design can be exerted and is required in order to evaluate some process or condition in a formal manner. In general such problems require a series of experiments to be conducted (or run) to ensure that the results are repeatable and the inevitable variation in measured results can be evaluated in a satisfactory manner. With more complex problems this requires very careful prior design in order to minimize the number of runs required and maximize the utility of the results obtained.

Whether a problem requires the use of third party data or involves primary research and design, as is the case with experimental design problems, the principles and step-by-step approach described in the topic "The Statistical Method" still apply.

The term trial in a statistical context refers to the application of a highly structured test or experiment. Typically a trial is conducted in order to compare two or more options, varieties, arrangements or treatments. This distinguishes trials from other forms of statistical analysis, including those experiments that are conducted in order to determine an absolute value, such as the speed of light or the level of particulate air pollutants near to a major highway. Trials are also distinct from the large body of statistical research that involves recording and analyzing observations made on the world around us — for example surveys, census data analyses, and the use of remote sensing technologies. Trials are experiments that typically involve human subjects (notably medical trials), animals, plants or industrial processes. Different treatments are applied to experimental units, for example to individual plots within a field, predefined groups of people or animals classified in a particular manner, or two distinct production lines or processes. A trial may have one or more experimental units per treatment. Each experimental unit receives one treatment (this might be a course of tablets, for example, and if the treatment involved the administration of a specific mix of drugs this would still be classed as one treatment). An entire trial may be replicated, in which case a new set of experimental units are selected but otherwise the trial structure remains unaltered. In many instances trials are conducted with two groups: units subject to treatments; and units, known as controls, for whom no treatments are made but in all other respects are grouped and included in the trial for comparison purposes.

In experiments involving humans, trials may be open or blind. Open trials are those in which all parties are aware of which treatments are being applied to whom and by whom. In blind trials one or more parties are unaware of the assignment of treatments. For example, in a so-called single-blind trial the individual receiving the treatment is not made aware whether the treatment they are being given (for example a course of tablets) is an existing drug treatment, a new drug treatment (subject to many controls nowadays) or a placebo (e.g. a tablet that is not a drug at all and has no effect on the patient). In double-blind trials neither the experimenter nor the patient know which treatment has been assigned to which patient (treatments are coded) thereby minimizing the risk of any influence the experimenter may have on the experiment. Clearly such approaches can only be applied in some situations.

A good experiment (Cox, 1958, [COX1]) is one that: (i) shows no sign of systematic error — randomization is a key tool in avoiding such factors; (ii) achieves precision in its determination of the effects of different treatments (avoiding systematic bias or errors); (iii) has broad validity (i.e. low study error); (iv) is as simple as possible given the task to be studied; and (v) facilitates the computation of the level of uncertainty associated with the findings. However, if an experiment has a very low level of uncertainty, and may be replicated many times with the same outcome, the application of statistical techniques to this residual uncertainty may not be necessary.

Statistical analysis of the outcome of formally designed experiments is often carried out using the family of techniques known as analysis of variance (ANOVA). Essentially this involves separating the observed total variance in the measured results into distinct groupings, dividing the variance between the groups of interest (e.g. treatments) from that observed within groups. The variation between groups is a measure of how different the treatments are in terms of the measured outcomes, whilst the variation within groups is a measure of how intrinsically variable the groups are, for example due to differences between the experimental units. ANOVA methods examine the relative size of these sources of variation, and if the variation between groups is much greater than within groups, it suggest that the groups or treatments are indeed significantly different. However, it may well be the case that additional information is known about the experimental units, identifying a source of within-group variation that could be removed, which in turn could help produce a clearer analysis of the data. For example, when examining the effectiveness of a new treatment for diabetes it may be that the effectiveness of the treatment will be different depending on the sex and the body mass index (BMI) of the individual being subjected to the treatment. By conducting the experiment in blocks, with Male/Female and High/Medium/Low BMI levels as blocks, the ANOVA procedure can be used to separate the variation due to these factors from the variation between treatments.

The design of a statistical experiment can generally be expressed using a statistical model, an equation (often, but not always, a linear expression) that describes how the results of the experiment might be represented mathematically. For example, as we shall see in the next section, if we let yij represent the data obtained from the experiment (the measured outcome or result) conducted on the jth replicate that receives the ith treatment; and we let Ti be the effect attributable to the ith treatment and let e denote residual error, unexplained by other factors, then the statistical model for this kind of experiment could be of the form:

This model states that the measured response is a simple linear function of the overall mean value, μ, for all the data, plus a treatment effect, plus some residual error. More complex experimental designs inevitably have more complex models associated with them. But as Box is often quoted as saying: "Essentially all models are wrong, but some are useful".

Crawley (2007, Ch.9 and p339, [CRA1]) extends this perspective on modeling by adding the observation that the correct model can never been known with certainty and the simpler the model is, the better it is (Occam's Razor). In his excellent chapter on statistical modeling he notes that if it turns out that the model one starts with is not good enough there are a number of ways to improve the situation (over and above the core issue of reviewing the problem):

transform the response variable

transform one or more of the explanatory variables

try fitting different explanatory variables if you have any

use a different error structure

use non-parametric smoothers instead of parametric functions

use different weights for different values

To clarify some of these ideas in the context of experimental design consider the following practical problems:

Industrial process: an existing industrial process creates a fibre of well-established breaking strength. Researchers believe that the final breaking strength could be increased by including an additive in the initial mix. They wish to design an experiment to test their hypothesis. The experimental unit in this instance is the input mix for the process, the additive is the factor or treatment being examined, and the measured final breaking strength of the output fibre is the response variable:

oif the experiment examines the factor or treatment at a single level or a small number of fixed levels, which is likely in this example, it is known as a fixed effects experiment. The factor being studied is known as the main effect (assuming there are other effects which are being ignored or controlled for). For this kind of problem a simple one-factor completely randomized design is generally the most appropriate

oif the experiment examines the factor at potentially a wide range of levels, for which a continuous range of responses is expected, it may be that the response pattern can be modeled (at least, over some finite range of values) by a simple mathematical equation, such as a linear function or quadratic curve. In such cases a regression-based design may be the most appropriate, since this will enable the regression parameters to be directly estimated from the measured responses. The expected pattern of variation may also be an important determinant of where experimental design is focused — concentrating on the areas of greatest interest, greatest variation (variability) or greatest change (steepest variation)

oin the example above a single factor is to be examined, at a number of pre-defined levels. However, it may be that the chemical additive used will produce a different response depending not just on the levels applied, but also on the underlying initial mix of the standard chemical used. In this instance the response sought will be determined by the mixture that produces the maximum fibre strength, and a mixture design is more appropriate than the simple complete randomized design

oit may be that fibre strength is not the only response variable of interest — for example fibre weight and flexure strength as well as absolute breaking strength might be extremely important — and the variability of the resulting product is also often of great concern. With multiple output variables (at least two) the resulting response surface is of interest, with the objective being to maximize the primary response variables whilst ensuring constraints on other variables are respected and/or to maximize more than one response variable at the same time. Problems of this type are approached using response surface designs and may utilize a range of analytical or heuristic optimization procedures to achieve the best results

Medical trial: a new treatment for Malaria has been developed and undergone preliminary testing in the laboratory. It is now at the stage for testing on human subjects who have Malaria. What kind of experimental design would be most appropriate? Problems of this type have been previously discussed in the topic Conduct and reporting of medical research. In the context of controlled experiments, the preferred approach to this kind of problem is via the use of a randomized controlled trial (RCT):

oRCTs have a well-established overall approach to their design and implementation, but there will be many aspects of technical detail (as opposed to administrative or ethical considerations) that require determination prior to trials proceeding. These include: very clear problem definition (for example there are several types of Plasmodium parasite that can infect humans and produce Malarial symptoms, but only one type is very serious — how specific is the new treatment?); determination of desired sample sizes for the treated group and the 'control' group(s) (often set to be approximately equal, but will vary depending on the particular circumstances); identification (recruitment) of cases (who are the experimental units in this example, with possible controls for location, stage of infection, stratification or blocking of cases by age/sex/health etc.); randomized (and blinded) allocation of cases to treatments (e.g. treated with new treatment, old treatment, or placebo); definition of the response — in the tuberculosis trial described earlier the responses evaluated were: Considerable improvement, Lesser improvement or deterioration, Deaths (after 6 months) — in many studies more complex response variables will be measured, depending on the type of condition and treatment

oAlthough the above example is applicable to many types of medical trial, there may be occasions where multiple factors are to be studied at the same time — for example, two factors, each at two levels. Problems of this type can be reduced to a series of separate pairwise comparisons, but a more compact form of trial that uses a factorial design may also be used and offers some benefits. For example, testing a tuberculosis treatment program that included the use of antibiotic A, antibiotic B, antibiotics A and B combined, or no antibiotics a single 2x2 factorial design might be used and cases assigned to one of the four groups at random. If this experiment was run as a set of three separate trials it would typically require more cases as there would be 3 pairs of 2 = 6 groups, and half of those in the study (as opposed to one quarter) would not be treated with either antibiotic — this is perhaps an extreme example but does highlight some of the issues that may arise. In general with m factors each of which can have n levels, at least mn groups need to be studied (i.e. 4 for 2 antibiotics at 2 levels, present and absent). Clearly as the number of factors and levels increases the complexity of both the design and subsequent analysis also increases, as does the practicality of implementing the design. With more than a few factors it is usual to either seek to reduce the number of factors (i.e. simplify the problem and objectives), carry out a partial or fractional factorial design, or in some instances, use specialized designs than can address problems with many factors (Plackett-Burman designs, which work by applying some clever design rules to the mixes required)

oSome forms of trial may be sequential, in that are series of treatments are given, with the treatment at the various stages in the process being dependent on the outcome of the previous stage. Note that this is not the same as a multi-stage treatment that takes place in a pre-defined order but does not vary according to the outcome of the previous stage — in this latter case the multi-stage process can be considered as a single treatment

oCohort studies and case-control studies — although a huge amount of work goes in to designing these other forms of medical trial, they do not typically fall into the category of experimental design — they can be described as observational rather than experimental and often cover an extended period of time

oWith more than one factor to be compared (e.g. additive A, additive B) randomized block designs, such as Latin squares and Graeco-Latin squares can be used. However, in many industrial applications (including those in the health sector) there may be large numbers of factors to be systematically tested at two or more levels in order to determine the optimum combination of factors. For problems of this kind specialized designs have been developed, notably partial or fractional factorial design (FrFs) and Plackett-Burman (PB)  designs. Whilst such designs use statistical concepts in their design and analysis, they can be viewed as belonging to the broader class of optimization problems. Associated with this focus on optimization methods and computationally intensive procedures are a number of so-called optimal design procedures, particularly aimed at for situations in which the classical designs cannot be applied. Amongst these designs are those known as D-optimal, A-optimal, I-optimal and many others (see the NIST website entry for a discussion of these). Optimal designs have become widely available in specialized software packages in recent years

Agricultural trial: a study of genetically modified (GM) wheat is to be undertaken in order to determine the yield per hectare with and without the application of nitrogenous fertilizers. The fertilizer is to be applied at two different rates, hence there are three levels: 0, rate 1, and rate 2

oat first sight this appears to be a relatively simple problem, with a single factor at three levels, with perhaps a number of replications, n, of the trial to be undertaken. As such a completely randomized design would seem to be the best approach, with a set of equal size plots (e.g. 1 hectare each) being used for each test. Thus 3n plots would be required, but in what order or arrangement?

othe plot sizes, shapes, locations (relative to one another and over a larger geographic scale), and assignment of tests to plots (the experimental units) warrants careful consideration. Different locations will exhibit variations in moisture, aspect, soil composition, wind etc., and these factors need to be eliminated from the final analysis as far as possible (e.g. by careful site selection, stratification or blocking of sample sites, and randomization)

owith a trial of this type other considerations come into play: for example, control of the area around the study region in order to avoid contamination (from and to the study plots). Severe restrictions may be placed on the set of acceptable locations and number trials permitted, introducing the risk of additional study area constraints and limited study size

From these three examples some overall observations may be made:

there are many types of problem that lend themselves to formal experimental design methods, and identifying the most appropriate and effective approach to adopt is an important early stage in the process

most experimental design problems seek to test some form of hypothesis, often achieved by comparing one method (treatment or process) against another, but some are more exploratory in nature (seeking to identify patterns and possible causal relationships) whilst others, such as response surface methods, are more akin to optimization than classical statistical procedures

experimental design problems are characterized by a number of key components: control over the design and conduct of the experiment; identification of the factors or effects to be manipulated during the experiment (e.g. dosage levels, additives in the mix); selection of the specific experimental units to be the subject of the trial; assignment of experimental units to treatments using randomization and where necessary, blinding of this assignment process; and measurement and analysis of the results (dependent variable responses) using a previously defined model or models

Box et al. (2005, p94, [BOX1]) in their excellent book "Statistics for Experimenters", provide the following initial summary regarding the conduct of experiments (modified here, for brevity):

whenever possible, experiments should be comparative

there should be genuine replication

whenever appropriate, blocking (e.g. pairing) should be used to reduce error — as the authors state elsewhere: block what you can and randomize what you cannot

randomization should be planned as an integral part of experimentation. Having eliminated "known" sources of discrepancy, either by holding them constant during the experiment of by blocking, unknown discrepancies should be forced by randomization to contribute homogeneously to the errors of the various treatments

In the subtopics that follow, a range of designs that have been described briefly above, are examined in greater detail, and examples given of their use. Readers who need a deeper understanding of the ideas and techniques discussed are recommended to read Box et al. (2005, [BOX1]) or similar works appropriate to their subject area (see further, the recommended reading topic, above). An excellent web resource on experimental design, particularly in the context of industrial experiments and process improvement, is the NIST/Sematech eHandbook section 5.3, Choosing an experimental design, which provides links to the many design variants used in industrial and other application areas.

Tables of pre-constructed designs have been published by various authors over many years, although these are now largely redundant due to the rise in computer-generated designs. Examples are provided in Box et al. (2005, [BOX1]) and the R package BHH2 designed to accompany their book which includes sample datasets, design generation facilities and graphical tools that are described in this work and not generally available elsewhere. The R packages DoE (Design of Experiments) and FrF2 (Fractional factorial designs) are also useful facilities for design generation. A number of commercial software packages include extensive capabilities for the design and analysis of experiments, particularly with an industrial focus. For example, Minitab, Statistica and JMP, are particularly strong in this area, as are the SAS packages SAS/STAT (PLAN and ANOVA procedures) and SAS/QC (which handles a wide range of factorial and related designs). Generic mathematical software, such as MATLab and Mathematica, also provide a wide range of tools for such design and analysis requirements.


[BOX1] Box G E P, Hunter J S, Hunter W G (1978) Statistics for Experimenters: An Introduction to Design, Data Analysis and Model Building. J Wiley & Sons, New York. The second, extended edition was published in 2005 (n.b. this edition is very expensive)

[COX1] Cox D R (1958) Planning of experiments. John Wiley & Sons, New York

[CRA1] Crawley M J (2007, 2012) The R Book. J Wiley & Son, Chichester, UK

Web site:

NIST/Sematech eHandbook of Engineering Statistics, section 5.3, Choosing an experimental design