Case-control studies

<< Click to Display Table of Contents >>

Navigation:  Statistical data > Statistics in Medical Research > Conduct and reporting of medical research >

Case-control studies

Case-control studies involve investigations that are essentially retrospective in nature, since they involve the study of patients who have acquired a disease or condition, and comparing these cases with so-called 'controls' whose profile is similar to the cases. By similar we mean that the controls have not exhibited the condition under investigation but have considerable similarities with those who have — for example, they have the same age/sex mix, they live in the same area, do similar work, attended similar schools etc. In some instances case-control studies involve multiple control groups, but more typically there is one group of cases and another of controls. As with other forms of medical research procedures, we provide an example below to illustrate the typical application of this approach. In this example we consider the analysis of data relating to the incidence of a particular disease amongst individuals who have been exposed to some infectious agent, substance or environmental factor which is suspected as being causative. The simplest model is to assume that the row and column totals are known and fixed and then to apply Fisher’s exact test to the results. This model evaluates the hypothesis that the results show no association between disease incidence and exposure and is a simple and clear procedure.

Example:  Esophageal cancer cases in Brittany

In the example table below we show the incidence of esophageal cancer amongst males in part of Brittany in France (205 cases) and a sample of 770 males (the controls) selected at random from the local electoral lists in the same region. Examining the tabulated summary results we see that the odds of being a high consumer of alcohol for cases is 96:109 (i.e. almost 1:1) whereas for controls the ratio is 104:666 (around 1:7). The odds ratio is thus (96/109)/(104/666)=5.64. Put another way, the data suggest that you are at least 5 times more likely to suffer from esophageal cancer if you are a heavy drinker than if you are a more moderate drinker. To place this into context, a 125ml glass of 8% strength wine (very weak, most wines are 12-13%) equates to 10gms of alcohol or one unit (in international measure), with the current recommended daily maximum consumption being 2 units for women (20mg) and 3 units (30mg) for men.

Alcohol consumption
















This data and its analysis is discussed in detail in Breslow and Day (1980, [BRE1]). The lower case letters identify the notation used in many studies, with the odds ratio being computed simply as ad/bc.

We can compute Fisher’s exact statistic for this 2x2 table, and the chi-square approximation, on the hypothesis that the entries are independent. The chi-square statistic yields a value of 110.26 (unadjusted) or 108.22 with Yates adjustment, both highly significant. The exact test, using Fisher’s method as implemented in R, also confirms that the result is highly significant and also provides the odds ratio, as above, together with a 95% confidence interval for this ratio of [3.94,8.06]. These values are the so-called Cornfield confidence intervals (Cornfield, 1956, reported in Breslow and Day [BRE1]). These statistical tests can be seen as test of the hypothesis that the odds ratio equals 1 against the alternative that it is greater than 1.

The data under discussion have been greatly simplified — detailed information which is available has been summarized in a 2x2 table. Whilst this is helpful and demonstrates an apparently very strong relationship between the incidence of this type of cancer with high alcohol consumption, it disguises potentially important information and possible confounding factors. For example, it was known that the individuals in the control data in this study were, on average, 10 years younger than the case data. Since alcohol consumption may vary with age, perhaps an age-related confounding factor exists. Indeed, since the ages of cases and controls are known, the data could be stratified by age group and each stratum analyzed separately. The odds ratios for each group can be computed and compared to see how homogeneous these are, assuming sufficient data exists at each stratum level for such a comparison. Likewise, the division of alcohol consumption into two levels rather than more is somewhat arbitrary, thus analysis of cases and controls could be extended to obtain a more detailed picture of this relationship (with or without age-based stratification). Clearly as the number of levels and strata are increased, so the cell entries will diminish and without relatively large samples the scope for detailed breakdowns of this type will be limited. Also, the Fisher test for tables with more than a 2x2 arrangement is typically implemented using simulation methods. Finally, the dataset also collected information on tobacco consumption — perhaps the strong relationship observed for alcohol consumption is actually not causative but indicative of lifestyle. By including tobacco consumption levels as well as alcohol consumption, for cases and controls, estimates of relative risks (by age group) can be obtained.

The above study, dating from more than 40 years ago, contrasts with larger scale more recent studies, such as the case-control study of smoker-related deaths in India. The authors (Jha et al., 2008, [JAH1]) describe their methodology as follows:

In a nationally representative sample of 1.1 million homes, we compared the prevalence of smoking among 33,000 deceased women and 41,000 deceased men (case subjects) with the prevalence of smoking among 35,000 living women and 43,000 living men (unmatched control subjects). Mortality risk ratios comparing smokers with nonsmokers were adjusted for age, educational level, and use of alcohol.... In this age group [30-69], smoking was associated with an increased risk of death from any medical cause among both women (risk ratio, 2.0; 99% confidence interval [CI], 1.8 to 2.3) and men (risk ratio, 1.7; 99% CI, 1.6 to 1.8). Daily smoking of even a small amount of tobacco was associated with increased mortality. Excess deaths among smokers, as compared with nonsmokers, were chiefly from tuberculosis among both women (risk ratio, 3.0; 99% CI, 2.4 to 3.9) and men (risk ratio, 2.3; 99% CI, 2.1 to 2.6) and from respiratory, vascular, or neoplastic disease

In both of the above analyses controls were not tightly matched to cases — for example, their age profile and possible other important factors, were not matched other than at a broad level. In some instances carefully matched controls can be identified which generally increases the power of the analysis. Typically more controls than cases are identified, and analysis proceeds either by taking the first or a randomly selected matched control from a set, or by combining the controls as if they represented a single individual. Procedures for analyzing matched controls, which are variants of those described above, are covered in Breslow and Day [BRE1]. For more complex problems it may be preferable to apply techniques based on logistic regression, with unmatched or matched samples.


[BRE1] Breslow N E, Day N E (1980) Statistical Methods in Cancer Research: Volume 1 — The Analysis of Case-Control studies. IARC Scientific Publications No.32, World Health Organization, IARC  Lyon

[BRE2] Breslow N E, Day N E (1987) Statistical Methods in Cancer Research: Volume 2 — The Design and Analysis of Cohort Studies. IARC Scientific Publications No.82, World Health Organization, IARC Lyon

[JAH1] Jha P, Jacob B, Gajalakshmi V et al. (2008) A nationally representative case-control study of smoking and death in India. N Engl J Med; 358:1137-1147

[MAN1] Mantel N, Haenszel W. (1959) Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748