The standard test for a simple proportion, p, is based on the use of the Binomial distribution or a z-transform of the data for large sample sizes. Tests for multiple proportions typically are based on the chi-square distribution, as used in contingency table analysis. With multiple proportions the Multinomial distribution can be used in a Multinomial test.
Assumptions: The sample is random and the population is large, possibly infinite (at least 10 times the sample size)
Hypothesis: H0: p=p0
Test: For small samples use the percentage points of the Binomial distribution or direct facilities provided in software packages (usually referred to as a one- or two-sample Binomial test). For large samples, typically n>150 (see also, comments on confidence intervals below), compute the z-statistic:
where x is the count of events (i.e. p=x/n) and the 1/2 in the numerator is a correction for continuity (this has a small effect for large n). Note that this expression is of the general form: (p-p0)/SE where SE is the standard error, as described in the subsection on sample size.
Example: In our earlier discussion on rare events, analysis of the Salk Polio vaccine trials was described and an example significance test of the form above was given. The statistic computed was:
In the original report on the Polio vaccine trials (Appendix pp62-63, [FRA2]) the statistical procedure for evaluating the success of the trials is described. It differs slightly from that described above, and is now described. The following variables were defined:
•N1=number of children vaccinated (200,745)
•N2=number of children receiving placebo (201,229)
•n1=number of cases amongst vaccinated children (33)
•n2=number of cases amongst placebo children (115)
The authors then defined the test expression:
which produces almost the same result as above because N1 is almost the same as N2.
For small samples the terms of the Binomial distribution may be used, using standard distribution tables or built-in software functions (e.g. the R function, binom.test). For finite populations the z-value at a given significance level should be multiplied by an adjustment factor of (N-n)/(N-1), where N is the population size and n the sample size. For large N and small n the adjustment has no impact as the ratio will be very close to 1.
For large samples 100(1-α)% confidence intervals for p are conventionally defined by:
and where the z-values are the percentage points of the unit Normal (e.g. 1.96 for 95% confidence intervals) and x and n are as defined above. Brown et al. (2001, [BRO1]) have pointed out that the use of this simple approximation to the Binomial as a means of determining confidence intervals is fraught with problems, for many specific values of n and p (even large values). They recommend using the exact Binomial for testing where possible (see further, below), or for n>40 and the special 95% case they recommend using the formula above but with n'=n+4 and p'=(x+2)/n'. This adjustment is preferable to the standard formula even for smaller sample sizes.
Most statistical software tools and packages provide tools for computing Binomial probabilities and/or calculating the significance levels for computations on proportions. Some packages (such as XLStat) provide very simple analyses based on z-tests, whilst other (such as SPSS) provide a range of tests for different data types and sample sizes. The R base package includes two tests for proportions, which include the exact Binomial test, binom.test(), and the more general prop.test() function. The latter is essentially a chi-square test of the type described in our contingency table analysis section, where equality of proportions is being evaluated. By default this includes Yates continuity correction.
Probability levels and confidence intervals can be computed by simulation, and some software packages offer this option (e.g. for comparison of multiple proportions), but for the extremes of the distribution this may require a very large number of iterations.