Statistical significance

Navigation:  Statistical concepts >

Statistical significance

Previous pageReturn to chapter overviewNext page

The term significance in a statistical context does not equate to its meaning in general usage. The latter implies some idea of overall importance, whereas statistical significance is a purely probabilistic statement regarding the chance of observing a particular result. If it is estimated that the chance is very low - perhaps 1 in 100, we might state that the result is statistically significant. This ignores any measure of true importance, which might be measurable in some way (i.e. using some measure of the size of the observed effect) or may not be readily measurable - for example, publishing results may have a large social or political effect which are not readily understood or measurable at the time. Statistical measures of the size of an effect include correlation coefficients, absolute differences, odds ratios, relative risks, and measures of association for count data.

Significance does not provide any indication of causation, despite the fact that a highly significant relationship exists between two variables. For example, in a 2010 study published in the British Medical Journal, it was reported that there appears to be a significantly increased risk of cardio-vascular disease (CVD) amongst individuals who do not brush their teeth twice a day (de Oliviera et al., 2010, [DEO1])... a somewhat surprising result that warrants very close examination. For example, out of around 12,000 individuals included in the sample, those who never or rarely brushed their teeth accounted for 19 deaths from a total of 170 deaths - this death rate was indeed significantly higher than for those who were more vigilant in their dental hygiene (statistically significant), but considerably more investigation and research would be required to establish the importance of this relationship and whether in fact there was some form of causality (e.g. bacterial infection levels in the blood) or whether the finding is related to lifestyle and diet, and thus a form of marker for the observed levels of CVD (see further, an extract from this paper at the end of this subsection).

Tests of significance are applied when examining the probability of obtaining a particular result, for example if we observe a mean value of 2.8 in a sample when the true mean is actually 2.5, is this difference statistically significant or could it have occurred by chance? Commonly applied significance tests are those applied to mean values (z-tests and t-tests) and when comparing two variances (F-tests). The use of the expression significance in this context was introduced by R A Fisher and is specifically statistical in its meaning and refers to probabilities not practical significance. A statistically significant result (a probability p<α where α is some predetermined probability level beyond which results are regarded as significant) may be a purely technical result of no other importance. For example, statistically significant results are often found when using large samples, but the differences between the values being measured are often relatively small. We are generally more interested in large differences - for example a new anti-cancer drug might be shown to provide a statistically significant improvement in life expectancy in 80% of patients, but if this improvement is from 20 weeks to 22 weeks we might regard it as of very limited value. If the significance level, α, is small (e.g. 0.001) it requires a much larger difference between the observed and expected results that a larger α value, so a p-value that is even smaller tends to lend greater weight to the evidence that the result is genuinely different from the expected. However, because it allows larger differences to be regarded as not significant it increases the chances of not identifying a real difference that exists - this is known as a Type II error (failure to reject the Null Hypothesis (H0) when it is in fact false - see further Type I and Type II errors). Note that the p-value obtained when conducting a test is not a measure of whether H0 is true, it is simply a quantitative measure of the strength of evidence against H0. How to interpret specific p-values remains unclear, and many statisticians regard the use of p-values and even the term significance testing, as unsatisfactory. Using probabilistic methods and effect size measures to distinguish between alternative hypotheses, without over-dependence on such specific measures, appears less likely to result in misinterpretation of results. It is now a widely held view that confidence intervals and related measures are of greater value than significance levels, since they provide an indication of the range of values that are likely to be encountered, embracing and extended the single-valued significance figure. It is also important to note that significance tests are only valid if the reference distribution against which significance is evaluated (e.g. the Normal distribution, or an empirical distribution) is valid for the problem under consideration, and if the data are genuinely random independent samples from the reference distribution. Non-randomness in sampling and/or lack of independence of samples (e.g. autocorrelation) may invalidate any significance computations made.

Significance and multiple comparisons

Suppose that a test is carried out seeking to identify whether disease incidence is higher amongst group A, who have characteristic X and group B, who do not. Characteristic X could be any one of a large number of characteristics of individuals, from exposure to particular substances or infectious agents, to particular genetic features. If a large number of characteristics are tested (e.g. 100) then at random one would expect some of these to produce significant results at the 5% or 1% levels. Hence reporting of the number of tests of different characteristics is important in such cases.


[DEO1] de Oliveira C, Watt R, Hamer M (2010) Toothbrushing, inflammation, and risk of cardiovascular disease: results from Scottish Health Survey. British Medical Journal, 27 May 2010, doi:10.1136/bmj.c2451 - see also the following extract and tabulation of data from the paper discussing the analysis of this data:

"Cox proportional hazards models were used, with months as the time scale, to estimate the risk of cardiovascular disease events or death according to oral hygiene. The data were censored to December 2007 in event-free survivors. Initially, tooth-brushing habit was included in the Cox model as a covariate, with adjustment for age and sex. In further multivariate models we adjusted for socioeconomic group using the registrar general classification (professional/intermediate, skilled non-manual, skilled manual, part skilled/unskilled), smoking (never, ex-smoker, current smoker), sex specific thirds of physical activity, frequency of dentist visits (at least once every six months, once every one to two years, rarely or never), BMI (underweight <18.5, normal weight 18.5-25.0, overweight 25.01-30.0, obese 30.1-40.0, morbidly obese >40.0), and family history of cardiovascular disease, hypertension, and diabetes. The proportional hazards assumption was examined by comparing the cumulative hazard plots grouped on exposure, although no violations were noted.

We examined the association between frequency of tooth-brushing and inflammatory markers (C reactive protein and fibrinogen) using general linear models with adjustments for age, sex, socioeconomic group, smoking, visits to dentist, BMI, family history of cardiovascular disease, hypertension, and diabetes, and acute infections (including influenza, pneumonia, bronchitis, and upper respiratory tract infections in the three weeks before assessment). C reactive protein was log transformed to normalize the data. All analyses were performed with SPSS (version 14), and all tests of significance were based on two sided probability."

Table: Cox regression models for tooth-brushing and death from cardiovascular disease (CVD)


Fatal CVD events/total

Hazard ratio (95% CI)



Adjusted for age/sex

Model 1*

Model 2**



Twice a day


1.0 (reference)

1.0 (reference)

1.0 (reference)

Once a day


1.2 (0.8 to 1.8)

1.1 (0.8 to 1.5)

1.1 (0.8 to 1.5)

<Once a day


2.4 (1.5 to 4.0)

1.6 (1.0 to 2.8)

1.5 (0.9 to 2.6)

P value for trend




*Adjusted for age, sex, socioeconomic group, smoking, physical activity, and visits to dentist.

** Additionally adjusted for BMI, family history of cardiovascular disease, hypertension (diagnosed by doctor or clinic blood pressure >140/90 mm Hg), diabetes diagnosed by doctor.