<< Click to Display Table of Contents >> Navigation: Statistical data > Sampling and sample size 
Sampling is central to the discipline of statistics. Typically samples are made in order to obtain a picture of the population as a whole without the need to make observations on every member of that population. This saves time, cost and may be the only feasible approach, perhaps because the population is infinite or very large or is dynamic so sampling provides a snapshot at a particular moment. Ideally we wish to make a sample that provides an extremely good representation of the population under study, whilst at the same time involving as few observations as possible. These two objectives are clearly related, since a perfect representation is only possible if the entire population is measured or if the population is completely uniform. This latter point highlights the fact that larger and more carefully structured samples may be required to obtain an estimate of a given quality if the population is highly variable. The difference between the measured value for an attribute in a sample from the 'true value' in the population is termed sampling error.
Typically a set of n independent samples are taken based on some form of random selection from the target population, as far as it is possible to define the latter. Randomness in the selection process seeks to help eliminate bias, whilst independence of samples also helps to ensure that bias due to samples being associated with each other in some way is minimized. For example, a sample of leaves from banana trees would seek to take one sample from each of a large number of banana trees that were relatively well separated spatially. It would be inappropriate to take 10 different samples from one tree (the samples would not be independent as they all came from the same tree) and likewise, trees in close proximity may exhibit similarities due to localized effects (e.g. soil, cultivation practice, disease spread etc.) which may result in samples not being independent. Similar considerations apply to samples taken from animals or humans, or from soils and rocks, and in some instances, samples take over a period of time where time dependencies exist. In cases where space and/or timedependencies are thought to exist, tests for autocorrelation should be carried out and if necessary, sample design and modeling must explicitly take account of the lack of independence between observations. The procedure adopted for any particular sampling exercise is known as the sampling protocol, and should always be carefully planned and designed.
The first step in this process is to define the population of interest, from which samples are to be taken. The population may be finite or infinite, and may be very clearly defined (even if difficult to enumerate) — for example, "all adults over the age of 18 living in a given city", or may be less well defined — for example "particulates in the air over London", or "all measurements of outcomes from a particular industrial process". Ensuring the nature of the population to be studied is well understood is an important step in the initial design of any sampling scheme.
The study population is the collection of units that could possibly be measured (known as the sampling frame in survey work). In some instances samples are made on a population that can conveniently be studied (for example, sampling in a laboratory or a particular location) rather than sampling the population itself, with the results being applied to the real population of interest. This results in socalled study error, which again one seeks to minimize. In the example cited in The Statistical Method section, Michelson measured the speed of light in air on the Earth's surface, not the speed of light in a vacuum, so the study error in this case consisted of the difference between these two environments. In many studies the research is carried out at a particular time and location, and the possible effects of temporal and spatial variation are excluded or deemed to have no substantive bearing on the results. However, all research does take place in space and time, so there is always some study error related to these factors. For example, after Michelson had completed his research, one criticism of it was that it was carried out during a brief period of the year, which did not allow for the possibility that the findings would have been different had the research been conducted 3 or 6 months later (reflecting the different position of the Earth relative to the Sun). It is often helpful to ensure that the study population is as tightly defined as possible, thereby ensuring that sampling is only from those individuals or objects that are of direct interest and helping to restrict variation in measured attributes.
This is the detailed procedure followed for sampling the data from the study population. In many instances the sampling protocol makes use of some level of randomization in order to avoid the risk of bias. The time, location and possible selection from subgroups of the study population form key elements of the sampling protocol. Controlled experiments typically involve use of a formal design that seeks to separate explanatory variables into distinct groups, which are then systematically or randomly sampled. In many instances random numbers are required in order to select entities, locations or times to be sampled, and typically these are computergenerated from a uniform distribution over the range [0,1]. Random numbers may also be drawn from other distributions, either using builtin software functions (e.g. the Excel Data Analysis tools Random Number generator facility, or SPSS functions of the form RV.DIST which returns a random value from a distribution DIST with specified parameters), or by using Uniform random numbers in conjunction with the cumulative distribution of interest.
In many instances a sample is required from an empirical distribution or a known (theoretical) distribution with predefined parameters. Typically this involves taking a random sample from the distribution selected or from a subset (e.g. a range) within this distribution. Many software packages provide facilities for generating such random samples, which may then be used to compare with observed datasets or as a frame for sampling (see further, Sampling from a known distribution).
When a large number of records have been obtained and stored in a database, samples from the dataset may be extracted for analysis rather than analyzing the full dataset (which might consist of thousands or millions of records). Typically samples of a prespecified size are selected at random from the stored recordset, with checks made to ensure that key parameters do not vary too greatly from the population values (i.e. the sample is representative, not biased). Samples may be taken with or without replacement (without replacement is the norm) and may be stratified if necessary, depending on the manner in which the data is stored and grouped. Most statistical software packages provide a range of procedures for record selection. These vary from systematic to simple random, stratified random where selection probabilities are proportional to stratum size (PPS) and many variants on these. The table below lists the options provided for SPSS — other packages provide similar facilities. The documentation for SPSS, SAS/STAT and other packages provide exact details of how the variants (e.g. PPS Brewer vs PPS Murthy) are calculated. It is important to note that most standard statistical formulas assume that records are drawn from an infinite population by simple random sampling without replacement (WOR). If this is not the case the analytical tools applied must be adjusted to take the data selection procedure adopted into account. Again, statistical software that facilitates such nonrandom selection will also include facilities for computing core statistical measures and simple models adjusted for the sampling approach adopted.
Sampling procedures — record selection sampling in SPSS
Simple Random Sampling 
Units are selected with equal probability. They can be selected with or without replacement 
Simple Systematic 
Units are selected at a fixed interval throughout the sampling frame (or strata, if they have been specified) and extracted without replacement. A randomly selected unit within the first interval is chosen as the starting point 
Simple Sequential 
Units are selected sequentially with equal probability and without replacement 
PPS 
This is a firststage method that selects units at random with probability proportional to size (PPS). Any units can be selected with replacement; only clusters can be sampled without replacement 
PPS Systematic 
This is a firststage method that systematically selects units with probability proportional to size. They are selected without replacement 
PPS Sequential 
This is a firststage method that sequentially selects units with probability proportional to cluster size and without replacement 
PPS Brewer 
This is a firststage method that selects two clusters from each stratum with probability proportional to cluster size and without replacement. A cluster variable must be specified to use this method 
PPS Murthy 
This is a firststage method that selects two clusters from each stratum with probability proportional to cluster size and without replacement. A cluster variable must be specified to use this method 
PPS Sampford 
This is a firststage method that selects more than two clusters from each stratum with probability proportional to cluster size and without replacement. It is an extension of Brewer's method. A cluster variable must be specified to use this method 
Use WR estimation for analysis 
By default, an estimation method is specified in the plan file that is consistent with the selected sampling method. This allows you to use withreplacement (WR) estimation even if the sampling method implies WOR estimation. This option is available only in stage 1 
The above concepts apply, in somewhat modified form, to problems in higher dimensions. In particular, in two dimensions (spatial data selection) a number of special procedures may be required to ensure that samples are both randomly selected and yet are also representative (see further, de Smith et al., 2018, section 5.1.2 [DES1]). As a simple illustration of the kind of approaches that can be adopted, the diagram below shows four methods for point sampling within a 100x100 unit study region. One interesting and important feature of this example is that approach B, which is simple random sampling, results in apparent spatial clustering of samples, whilst substantial areas are left unsampled. Sampling approach C is one means of trying to limit this affect. For more details see the reference cited.
Pointbased sampling schemes
A. Regular 
B. Random 
C. Random offset from regular (random clustered) 
D. Regular with random start of sequence (y offset) 
There are many factors that affect the choice of sample size. In public opinion surveys it is very common to hear that the sample taken was of around 10001500 people. This figure is obtained from a relatively simplistic calculation, based on achieving an approximately 95% confidence level in the results with estimation of a proportion, p, within a range of roughly +/3% (see also, our discussion on confidence intervals). The figure of 10001500 arises from these two requirements — using a Binomial distribution the standard error (SE) of the proportion, p, is √(pq/n). Note that the term √(pq) is maximized for any given n when p=q=0.5, so this assumption provides an upper bound of 1/2 on √(pq) and thereby on the range of expected variation in our estimate. Now from the Normal distribution, which is the limit of the Binomial for large n (and a reasonably rapid approximation if p and q are similar in size), we know that 95% of the distribution is included within roughly +/ 2 standard deviations. Thus the sample size needed to ensure an error in an estimate of x=5% is obtained from the formula for 2SEs, i.e. 1/√n. This gives the result n=1/x2 so for x=5%, x=0.05 we have n=400, or for 3% we have just over 1100. For a 1% range at 95%+ confidence a sample size of 10,000 would be required, so the choice of 10001500 is a compromise between the quality of estimation and the cost and time involved in undertaking the survey.
For some problems wider bands are acceptable on the estimated proportion or mean, thus for a value within +/20% a sample of only 25 is required — if this was an estimate of the concentration of zinc in the soil in parts per million (ppm), an estimate of 100ppm with a range of 80120ppm may be perfectly acceptable. This method of computing sample size is, of course, simply a rule of thumb that has been found to work in many situations of this particular type. Put more formally, we are estimating the probability, α, that the estimated proportion will not differ from the population proportion, p, by more than some amount x:
If we denote by zα the Normal distribution probability value for a confidence interval determined by α (e.g. with α=0.025, twotailed test, 5% in total, zα=1.96) then this (rather simplified) formula for sample size n becomes:
Sample size selection is thus related to several factors, including: (i) cost, time and risk; (ii) the type of problem being addressed (and the techniques used to address the problem); and (iii) the variability of the data being sampled. If one has prior knowledge of the data variability, or can make an informed estimate of this (for example based on prior research and/or test samples), then the determination of sample size becomes more straightforward. Clearly greater variability in the data will mean that the standard error (SE) is intrinsically larger, which in turn requires a bigger sample size for a given level of precision in the parameter(s) to be estimated. Furthermore, if the population is known to vary in some kind of structured (or stratified) manner, for example spatially or temporally, then it makes sense to sample less frequently in the less variable phases or zones and more frequently in those strata of the study population that are more variable. Thus a given overall sample size, n, might represent the sum of a set of stratified samples {ni}, where each ni is separately determined from the estimated variance, si, in zone or time slot i. There is an optimal method of determining the component samples given n, assuming that estimates of the variance in each zone, si, are available together with some measure of the proportions, wi, of the overall population represented by each of the separate zones or strata. With countable items (e.g. census data) these proportions can be obtained as the count in zone i divided by the total for all zones; other approaches might be to use areas or length of sampled time slots to determine the proportions. The basic allocation rule is then:
Note that this allocation simply determines how a given sample size (perhaps obtained with reference to some cost or time constraint) may be allocated amongst the selected strata.
Bartlett et al. (2001, [BAR1]) provide a general discussion of sample size determination for educational and social research, drawing on the earlier work of Cochran (1977, [COC1]). Their guidance draws on the experience of many researchers conducting questionnaire surveys and similar social research programmes. The formulas described are based on those discussed above, but take into account issues such as: finite population size; determination of the estimated variance; dealing with multiple measured variables of various types; and finally, dealing with nonresponse. Cochran recommends, for example, that if the sample size, n, exceeds 5% of the population size, P, then the sample size value should be adjusted by a factor n/(1+n/P). Assuming an initial sample size estimate of 400 and a population of 4000, this would adjust the sample size down to 364. If the expected response rate is 60% the sample size is then increased to approximately 600.
A number of statistical tests, such as ztests and ttests, yield results that are dependent on the sample size, through the standard error. The sections that describe these tests also provide guidance on how to compute the sample size in order to meet requirements on the levels of Type I and Type II errors that are acceptable. Special graphs, known as Operating Characteristic curves, provide plots of the relationship between sample size and the two main types of error (see Ferris et. al., 1946, [FER1], for a number of such charts covering χ2, F, Normal or ztests and ttests).
In the medical field a range of sample size guidance documents, tables, software tools and formulas are available, many of which are effectively variants on the same general model (see Altman [ALT1], Chow, Shao and Wang [CHO1], Jones et al [JON1], Carley et al.[CAR1], Dupont and Plummer [DUP1], Whitley and Ball [WHI1], and Machin et al.[MAC1]). Typically sample size estimation in these publications is based on the relationship between three elements, two involving risk assessments and one involving the size of the effect one is seeking to discover (small effects require larger samples in order to detect them reliably): (a) the risk of a false positive (α level, usually taken as 5% or 0.05); (b) the risk of a false negative (β level, usually taken as 20% or 0.20; or using the notion of power=1β, so 80%); and (c) the size of the effect. The last item can be difficult to determine, but is typically of the form: E=(target difference)/(estimated standard deviation). For example, if a study is trying to detect a difference of size 14 units between a measurement on two equal sized groups (e.g. the blood pressure in mmHg treated using different therapies, with measurements taken 6 hours after therapy commenced) and the estimated standard deviation was 18mmHg, then the standardized effect value would be 14/18=0.78. The chart below, redrawn from Altman [ALT1], enables the required sample size to be read from the central section by drawing a straight line between locations on the left and right hand axes. The left hand axis shows a measure of the size of effect one is trying to detect (in standardized units) whilst the axis on the right shows the power of the test (as noted above power=1the risk of a Type II error or false negative). The third element, the risk of a Type I error or false positive is determined by the significance level in the central section.
Altman's Nomogram for computing sample size or power (two equal sized groups)
So if an experiment is to be defined that seeks to be able to identify a standardized effect of size 1.0 with a power of 80% and a risk of a Type I error of 5% we draw the red line (lowest on the nomogram)and choose an overall sample size of around 32, i.e. a target of 16 participants in each of two groups. If the power is increased (e.g. to 90%) and/or the effect size reduced (e.g. to 0.6) the required sample size increases to around 60 per group (120 total — blue line, upper line). For the blood pressure example cited earlier, a total sample size of 52 is required, as shown by the green (middle) line (26x2, based on an 80% power level). Note that this analysis can also identify trials that are inadequately powered, for example a trial that seeks to identify a relatively small effect with a sample size that is too small will equate to one whose power is low. Essentially these results are based on the use of the (noncentral) tdistribution in a ttest for the difference of two means where the population standard deviation is not known (see further, NIST and Beyer, Table IV.4 [BEY1] — note that these sources cite the sample size required for a single group).
Some have argued that this model is overcautious and results in recommended sample sizes that are larger than are clinically necessary (with important ethical and practical implications), focusing instead on estimation based on clinical effect (e.g. benefits, harm). A related, alternative approach to samplesize determination, is to explicitly include measures of cost, in particular attempting to place a cost on each Type of error. The total cost is then the risk of a Type I error times the cost of this error plus the risk of a Type II error times the cost of this error plus the cost of the experiment or research exercise. This approach makes a great deal of sense, but allocating costs to the different types of error can be very difficult. If it is possible to produce such costs the impact of increasing sample size can be examined. In broad terms as sample size increases the Type I and Type II errors reduce so the costs associated with these risks will decrease, but the cost of the experiment will increase and may be infeasible for practical or ethical reasons. Incrementally increasing the sample size may achieve a result whereby total costs are minimized and this value can then be used for the research exercise.
Particular issues arises in connection with rare events, for example when conducting trials of a vaccine that protects against a relatively rare disease, or when investigating suspected links between particular cancers and point sources of environmental pollution. In the former case, it may be necessary to carry out a trial involving very large numbers of individuals in order to identify a statistically significant effect. This was the case with early trials of the Salk Polio vaccine in the USA, in 1954, following Polio epidemics in 1952 and 1953. The estimated normal rate of infection at the time was around 50 per 100,000 population, but this still represented a large number of people (typically children). To obtain a target of approximately 100 confirmed cases of polio based on the normal incidence this would require a study group of 200,000 children. In the event, a randomized control trial (RCT) involving two groups of approximately 200,000 children using a doubleblind assignment of subjects was undertaken — one group being given the Salk vaccine and the other a saline placebo. An extract of the core results are shown below — the success of the RCT led to the rapid rollout of the Salk vaccine and then other, preferred vaccines, in the immediate aftermath, ultimately leading to the virtual eradication of Polio worldwide today. However, many aspects of the overall trial process were deeply flawed, with a large part of the trial (which was not in RCT form) described by Brownlee (1955, [BRO1]) as "futile" and "worthless".
USA Salk Vaccine Randomized Control Trial, 1954, Table 2b extract
Experiment 
Study Group 
Population 
Polio Cases 

Paralytic 
NonParalytic 

Randomized Control 
Vaccinated 
200,745 
33 
24 
Placebo 
201,229 
115 
27 
source: Francis and Korns (1955, [FRA1],[FRA2])
A simple form of analysis of this kind of data is to compute the effectiveness of the treatment by comparing the rates of infection per 100,000 in the vaccinated (r1) and placebo (r2) groups. The effectiveness measure is then E=100(1r1/r2)% giving a result of E=72% in this case. Data of this kind can be analyzed in a number of different ways. A simple approach is to consider the probability of observing x=33 or fewer paralytic cases of polio amongst those who were vaccinated as against 115 in the placebo group, both groups having been drawn from large equal sized populations. Using the null hypothesis that from a total of n=148 severe cases one would expect each group to have roughly half the total, hence p0=0.5, we can use a simple ztransform of Binomial form:
which is a very large value, hence extremely unlikely to have arisen by chance. By comparison, the number of nonparalytic cases were quite similar and very likely to have arisen by chance. This is the approach adopted by Francis and Korns [FRA2, Administrative content section, pp6263].
References
[ALT1] Altman D G (1982) How Large a Sample? in Gore S M, Altman D G eds. Statistics in Practice. BMA, London
[BAR1] Bartlett J E II, Kotrlik J W, Higgins C (2001) Organizational research: Determining appropriate sample size for survey research. Information Technology, Learning, and Performance Journal, 19(1) 4350
[BEY1] Beyer W H (1966) Handbook of Tables for Probability and Statistics. Chemical Rubber Co., Cleveland, OH
[BRO1] Brownlee K A (1955) Statistics of the 1954 Polio Vaccine Trials. J American Stat. Assoc., 50,10051013
[CAR1] Carley S, Dosman S, Jones S R,Harrison M (2003) Simple nomograms to calculate sample size in diagnostic studies. Emerg. Med. J (EMJ), 22, 180181
[CHO1] Chow J, Shao J, Wang H (2009) Sample Size Calculations in Clinical Research. 2nd edition, Chapman & Hall/CRC Biostatistics
[COC1] Cochran W G (1977) Sampling Techniques. 3rd ed., J Wiley, New York
[DES1] de Smith M J, Goodchild M F, Longley P A (2018) Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools. 6th edition, The Winchelsea Press, UK. Available from: http://www.spatialanalysisonline.com/
[DUP1] Dupont W D, Plummer W D (1990) Power and sample size calculations. Controlled Clinical Trials, 11,116128
[FER1] Ferris C D, Grubbs F E, Weaver C L (1946) Operating Characteristics for the Common Statistical Tests of Significance. Annals of Mathematical Stats, 17(2), 178197
[FRA1] Francis T, Korns R F (1955) Evaluation of 1954 Field Trial of Poliomyelitis Vaccine: Synopsis of Summary Report. The Amer. J of the Medical, 603612 Sciences,
[FRA2] Francis T, Korns R F (1955) An Evaluation of the 1954 Poliomyelitis Vaccine Trials". American J Public Health. 45(5 Pt 2), entire edition
[HEY1] Hedayat A S,Sinha B K (1991) Design and inference in finite population sampling. John Wiley & Sons, New York
[JON1] Jones S R, Carley S, Harrison M (2003) An introduction to power and sample size estimation. Emerg. Med. J (EMJ), 20, 453458
[MAC1] Machin D, Campbell M J, Fayers P, Pinol A (1987) Sample Size Tables for Clinical Studies. Blackwell Science Ltd, Oxford
[MAK1] Mackay R J, Oldford R W (2000) Scientific method, statistical method, and the speed of light. Working Paper 200002, Department of Statistics and Actuarial Science, University of Waterloo, Ontario, Canada
[WHI1] Whitley E, Ball J (2002) Statistics Review 4: Sample size calculations. Critical Care, 6, 335341
NIST/Sematech Engineering Statistics Handbook: Sample size: http://www.itl.nist.gov/div898/handbook/prc/section2/prc222.htm