The terms of the Binomial give the probability of x successes out of n trials, for example 3 heads in 10 tosses of a coin, where p=probability of success and q=1‑p=probability of failure. As discussed at some length in the Historical context section, the Binomial is perhaps the first example of the use of frequency distributions and the study of its properties was a key component in the early development of statistics as a distinct discipline. The Bernoulli distribution is a special case of the Binomial for which there are two possible outcomes: x=1 with probability p, and x=0 with probability 1-p.
The term “Binomial” is used because the individual terms of the distribution are based on the expansion of the binomial series B(p,q,n)=(p+q)n. For n=0 this yields B(p,q,0)=1, for n=1 we have simply B(p,q,1)=p+q, for n=2 we have B(p,q,2)=p2+2pq+q2, and so on for higher powers of n.
Let f(x) to denote the probability distribution associated with a discrete valued variable x. Let p = the probability of an event x occurring (e.g. heads, success..) and let q=1-p be the probability that the event does not occur (tails, failure, …), with p+q=1. We then consider a series of n events and assume all events are completely independent of one another: e.g. the sequence of 5 p events and 3 not p events ppqqppqp =p5q3 (which could have occurred in a possible 10 different arrangements). Then the Binomial probability distribution function (pdf) is defined as:
This distribution has mean, μ=np and variance, σ2=npq so the standard deviation σ=√(npq). In a suitable controlled trial, with independent events and constant probabilities, the best estimates for the population mean and variance are the sample mean and variance.
If we take the example of tossing an unbiased coin, then p=0.5, q=0.5 and in 100 tosses we would expect to 100x0.5=50 heads on average, or with 18 tosses we expect 9 on average (see Chart A, case p=0.5, below). The average spread of values would be +/-√npq = +/-√4.5, so in the range roughly [7,11]. In this example p=q=0.5 and the shape of the distribution is symmetric (Skewness=0). In all other cases the distribution is skewed. For large n the skewness is reduced and disappears as n becomes very large, more rapidly if p is close to 0.5. Also, as n becomes large, the Kurtosis (peakedness) of the distribution tends to a value of 3. Note that the Kurtosis for the Normal Distribution (which provides an approximation to the Binomial, see further, below) is either 3 or more commonly as 0, as 3 is generally deducted first.
We have noted previously that the Binomial tends to the Normal Distribution when n is large and p is not small, with the approximation being best when p is close to 0.5 because in this case the Binomial is symmetric for all n. This feature of the Binomial was first noted by De Moivre, who provided quite a complicated derivation of a series approximation to the Binomial. More typically the approximation is seen as a manifestation of the Central Limit Theorem, whereby the Binomial variable, x, is standardized in the usual manner (as a z-transform) and z then tends to a unit Normal Distribution as n→∞, where
As can be seen, subtracting the mean value, μ=np, from each x and then dividing by the standard deviation (to ensure the standard deviation=1) yields a real-valued variable with mean=0 and range potentially from [-∞,+∞] .
Using the standard integral expression for the Normal Distribution we have:
If we fit a Normal distribution to the Binomial (see Chart B, below) we find that in this case, with n=50, p=q=0.5, the fit appears to be extremely good.
However, to compute the probabilities associated with a particular range of values [α,β] we must integrate the expression over this range and compare the results to the exact values obtained with the Binomial expansion. Unless n is very large and p is close to 0.5, this approximation is actually not all that good. One key reason for this is that the Normal is a continuous distribution being used to approximate a discrete distribution. A useful solution is to adjust the values used to compute the range in the integral by setting:
where the special brackets mean the floor function, i.e. 'the largest integer not greater than'. As can be seen, the adjustment decreases the lower bound slightly and increases the upper bound, effectively providing a form of smoothing or continuity correction. The Normal approximation, and a number of other approximations to the Binomial, were important historically for computational reasons. However, in recent years the use of computers have made the need for them largely redundant.
The Binomial distribution can be truncated such that lower or upper values are excluded. The simplest truncation is the omission of the event x=0, in which case the distribution function remains unchanged but is divided, as expected, by (1-qn). This variant is sometimes called the Positive Binomial, but since there is a distinctly different distribution known as the Negative Binomial, this terminology is generally avoided. Likewise the mean remains unchanged but is again divided by (1-qn), whilst the variance is:
npq/(1-qn) — qn[np/(1-qn)]2
A number of other important variations of the Binomial should be mentioned at this point. The first is the limit of the Binomial as p→0 whilst n→∞ and np remains fixed. This leads to the Poisson distribution, the derivation of which is described in the section indicated.
The second related distribution is the Hypergeometric. This distribution is based on the Binomial but takes into account situations in which n is relatively small (e.g. 100 or less), and sampling takes place without replacement — for example, sampling 20 individuals who have received a certain treatment (e.g. a vaccination) from a population in which a known proportion, p, have been vaccinated. As each individual is sampled, the proportion of remaining individuals who have been vaccinated will have altered, so the Binomial model needs to be adjusted to reflect this.
The Multinomial distribution is further variation, in which more than two outcomes are possible (the Binomial only deals with true/false, heads/tails cases). An example would be when examining the probable mix of events when the population contains a number of distinct groups, such as blood types. In principle such problems can be reduced to a Binomial by grouping, for example taking the case of blood type "A negative", and comparing this with all other blood types taken together. Obviously there is more or less loss of information when reducing the problem to two categories, but it avoids the need for using the somewhat awkward multinomial.
Key measures for the Binomial distribution are provided in the table below:
The Binomial distribution assumes that events are independent and the probabilities of events occurring are constant over time. Where sampling without replacement takes place, the population size typically needs to be 100+. If not, the Hypergeometric distribution should be used. The Binomial distribution also assumes that events are binary, so that the cases True/False, Heads/Tails etc. apply, but modeling three or more outcomes explicitly does not apply. As noted above, such cases can be modeled using the Binomial as a series of separate binary events using grouping or by taking one event at a time and comparing this with the event not occurring.
[JOH1] Johnson N L, Kotz S (1969) Discrete distributions. Houghton Mifflin/J Wiley & Sons, New York, Section 10. Note that a 3rd edition of this work, with revisions and extensions, is published by J Wiley & Sons (2005) with the additional authorship of Adrienne Kemp of the University of St Andrews