<< Click to Display Table of Contents >>

Navigation:  Statistical concepts >


The term confounding in statistics usually refers to variables that have been omitted from an analysis but which have an important association (correlation) with both the independent and dependent variable. Thus confounding is closely connected to the notion of causality and cause-effect relationships. For example, research into the benefits of breastfeeding as compared with bottle feeding using formula milk suggest that many of the widely claimed benefits (breast is best) simply cannot be proven (Rumbelow, 2009 [RUM1]). In many cases the evidence in favor of breastfeeding is either poorly established or affected by confounding factors, notably the kind of mother who chooses to breastfeed.

Taking another example, variable A may appear to be highly correlated with variable B, but this (by itself) provides no explanation for the relationship. If variable A is the height of 10,000 randomly selected individuals in some study area and variable B is the incidence of lung cancer amongst this sample, it would be extremely dubious to infer that tall people are more prone to such cancers, even though a highly significant association may be detected. The most basic confounding factor here is age (or more generally, time in other areas of data analysis), since we have not excluded infants and children from our sample. Typically in medical studies, age and sex always warrant inclusion as possible confounding factors, even where such factors are not expected to be of significance. Also, it is important to note that factors such as age may be explanatory but not specifically causal factors — confounding variables may be surrogates for underlying causal factors.

Although confounding factors are very important, identifying relevant factors and then controlling for their effects can be difficult. A common approach is to stratify the analysis by the confounding factor of interest, e.g. separate a problem in to a series of problems, each corresponding to a separate age group, and/or to analyze the data using time as a variable. Case-control studies, i.e. studies that include carefully matched cases and controls, can help to eliminate some of these effects by ensuring that a similar mix of confounding variables applies to both cases and controls. Of course this assumes that the relevant confounders have been identified in advance and can be identified and assigned in the design process. There is also a risk of over-matching, i.e. ensuring case and controls are matched on variables that may be (partial) surrogates for one another (e.g. vegetarians and individuals who have a high fiber content to their diet) which could bias the results by over-selection in the sampling process.

The term confounding is also used in a similar context in the design of experiments. Here the concern is with the design of an experiment that seeks to identify the effect of a treatment, for example an additive to a fuel to increase its efficiency, whilst removing the effect of other factors on the analysis. For example, a trial might involve a number of different vehicles and drivers, with tests run under a range of driving conditions and speeds. The impact of these factors can be separated from the effect of the additive on fuel consumption by careful design. However, it may not be possible to run an experiment for every possible combination of such factors, and a fractional design (only including a subset of the possible combinations) may be necessary. In this case some factor combination effects, or interactions, will not be detectable and will thus be confounded with the main effects being studied.


[RUM1] Rumbelow H (2009) breast is best? exposing the myths of breastfeeding. The Times, 20th July 2009. An edited version of this article was published in the RSS/ASA magazine, Significance, in December 2010.