The hypergeometric distribution is, in essence, a special form of the Binomial. Whereas the Binomial assumes that there are n independent trials of an experiment, with a fixed probability, p, which is the same for every event, the hypergeometric deals with the situation in which the population size, N, from which events are sampled, is relatively small (<100) and sampling takes place without replacement (hence the probabilities are not always the same). A simple analogy is selecting balls from a bag or urn containing a mix of red and black balls. The Binomial applies if, after each random selection the ball is replaced, whilst the hypergeometric deals with the case where the balls are not replaced, so the next ball selected is drawn from a slightly different overall mix of red and black. Obviously if there are a very large number of balls in the urn, and we are not selecting too many, there is effectively no difference between the two distributions.

To take a tangible example, suppose we are sampling x=5 individuals from a population, N, of 30, for which it is known that p=30% have a particular medical condition. What is the probability that at most 2 have the condition? Since we are considering 'at most 2' we need the sum of the cases x=0, 1 and 2 that are found to have the condition. In each case x individuals are drawn from the Np individuals in the property who have the condition, and the remainder (n-x) come from those N-Np who do not have the condition. Letting k=Np, the probability distribution is simply:

This expression states that the probability of obtaining a particular number of cases, x, who will have the condition equals the number of ways of choosing x from k (which is the number of cases in the population that have the condition) times the number of ways of choosing the remaining cases (n-x) from those who do not have the condition (N-k), divided by the total number of ways of choosing n cases from the total of N. The mean of this distribution is nk/N (essentially np, as per the Binomial).

In the above example the Binomial gives the probability of finding 2 or fewer individuals with the medical condition in the sample as 84%, whereas the hypergeometric yields a value of 81%. Excel uses a more homely example: "A sampler of chocolates contains 20 pieces. Eight pieces are caramels, and the remaining 12 are nuts. If a person selects 4 pieces at random, the hypergeometric returns the probability that exactly 1 piece is a caramel." The answer in this case is roughly 36%. In fact, the full list of probabilities for the cases 0,1,2,3, and 4 pieces of caramel are: 0.1022, 0.3633, 0.3814, 0.1387, and 0.0144. This kind of model can be applied with larger samples and populations, as is illustrated in the 4 charts below. In each instance the population, N=250, k=100 (so 40% of the population have the desired characteristic), and samples of sizes 50,20,10 and 5 are taken.

Hypergeometric distribution, N=250, k=100

Although some of these examples suggest that the hypergeometric is unlikely to have any serious application, Johnson and Kotz (1969) cite a number of real-world examples that are worth mentioning. The first is in estimating the population of animals of a particular type in a capture-recapture programme. For example, estimating the stock of a particular breed of fish in a lake. First, a known number of fish, k, are caught and tagged, and then returned to the lake. After a period of time to allow for full re-mixing of the stock (but not breeding) a further n are caught and the number, x, that are tagged, are noted. Then the maximum likelihood estimator of N is nk/x. For example, suppose that 20 fish are caught, tagged and returned to the lake, and subsequently 20 fish are again caught of which 5 are tagged, then the estimated population of fish of this type in the lake is 80. Other example applications cited by Johnson and Kotz include the analysis of languages and their origins, and modeling exposure of children within a family to an infectious disease (measles). In this latter case the hypergeometric was used by Irwin (1954, [IRW1]) to model infection probabilities in a family of n children, each with a probability p of being infected, but taking into account that the infections are correlated. By assuming that the correlation was a constant (0>r>1), Irwin showed that the probabilities of infection could be modeled using the hypergeometric distribution taking r=-1/(N-1), but this implies N=-(1-r)/r . N is thus negative and not an integer, so this is a generalization of the standard hypergeometric. As with the Binomial, the hypergeometric distribution can be extended to a multinomial equivalent (e.g. an urn containing more than two colors of ball), although few applications of this are cited in the literature. Fisher's exact test, used in the analysis of cross-tabulated count data (contingency tables), relies on the hypergeometric for the computation of the exact probabilities.

Key measures for the distribution are provided in the table below:

Mean |
nk/N |
If the mean is known in advance then the population N can be estimated assuming n and k can be obtained readily |
---|---|---|

Variance |
nk(N-k)(N-n)/(N2(N-1)) |
Higher moments have a rather complicated form, but are cited in Johnson & Kotz (1969) and on the relevant Wikipedia and Mathworld pages |

[JOH1] Johnson N L, Kotz S (1969) Discrete distributions. Houghton Mifflin/J Wiley & Sons, New York

[IRW1] Irwin J O (1954) A distribution arising from the study of infectious diseases. Biometrika, 41, 266-268

Mathworld/Weisstein E W: Hypergeometric Distribution: http://mathworld.wolfram.com/HypergeometricDistribution.html