Contingency tables

Navigation:  »No topics above this level«

Contingency tables

Previous pageReturn to chapter overviewNext page

The term contingency table appears somewhat outdated, as modern usage of the word contingency usually refers to planning for some kind of emergency or unexpected event. The term was introduced by Karl Pearson at the start of the 20th century to refer to cross-tabulations of data that have been recorded as counts, for example a count of the number of children in a sample with fair or dark hair, tabulated against the color of their mother's hair (dark or fair). This would be a 2x2 contingency table, as illustrated below:



















Typically the rows and the columns of such tables represented distinct, i.e. mutually exclusive categories, and the cells entries are the recorded counts. Sometimes the rows or columns are purely nominal categories (possibly defined somewhat subjectively), such as "Fair" and "Dark", or they might be classes derived from a continuous variable, such as "Low", "Medium" and "High" frequency sounds.

A contingency table can be regarded as a form of two-dimensional frequency distribution. The row totals can be seen as one set of frequencies (the row-wise marginal distribution), and the column totals are a second set of frequencies (the column-wise marginal distribution). If the rows and columns are independent then each individual row-column entry can be estimated by the product of the row and column marginal probabilities (e.g. for cell 1 in the example above, the expected value based on the assumption of independence is E=(e/n x g/n)xn = eg/n. The difference between the expected entries under the assumption of independence and the actual or observed cell entries, O, can be used to help determine whether or not there is some form of relationship between the row and column variables. If the result indicates that the assumption of independence does not hold, it suggests some kind of relationship does exist, but it does not indicate either the nature of the relationship (causality) or its strength. A number of measures of the strength of the association have been devised, but all should be treated as 'indicative'.

A number of tests have been devised for analyzing data organized in this manner. The most widely used has been the chi-square contingency table test, which is essentially the same as the chi-square goodness-of-fit test discussed earlier. A similar test, which also uses the chi-square distribution and is now the recommended procedure is the G-test. Both of these methods rely on an approximation of the exact test, which is based on the hypergeometric distribution and is known as Fisher's exact test. Information-theoretic equivalent tests may also be used, as for example are described by Kullback (1959, Ch. 8 [KUL1]). Indeed, the G-test is the information theoretic test of independence in a two-way classification table. For 2x2 tables, tables with small n, and for tables with small values in individual cells (<5) that must not be combined or grouped with other entries, the exact test should be used.

Analysis of more complex contingency tables is possible, for example cases for which counts relating to three or more variables (dimensions) are recorded. For the class of tables in which the rows are a number of samples (representing data relating to one or more independent variables) and the columns are outcomes or responses (dependent variables) with the entries being counts, then analysis techniques similar to those applied in analysis of variance may be used. Typically this will employ some form of log-linear model, and is discussed further under the topic Generalized Linear Models (or GLIM) and more specifically in the topic on Poisson regression.


[KUL1] Kullback S (1959) Information theory and statistics. John Wiley & Sons Inc.