Correlation and autocorrelation

Navigation:  »No topics above this level«

Correlation and autocorrelation

Previous pageReturn to chapter overviewNext page

The term correlation has a statistical meaning that is similar to the commonly understood concept of correlation or association. Specifically it refers to a measure of the similarity between two (or more) paired datasets. Correlation does not imply causation although it may suggest a possible causal relationship (see further, the earlier discussion on Misinterpretation of statistics, and the Wikipedia discussion of this issue). The clearest example of this difference is the ongoing debate regarding climate change - the observed correlation between mean global temperature rise and carbon dioxide emissions may well indicate that the temperature rise is caused by the burning of fossil fuels, but there might be other factors that directly or indirectly account for the observed relationship.

Two variables that are measured in pairs but for which no apparent correlation exists can be regarded as (probably) independent of one another. On the other hand, if some form of correlation is detected then their independence must be questioned and the possibility of a dependent relationship warrants examination. In this context we often talk about one variable in the pair being regarded as the independent variable and the other as the dependent variable. Again, this does not imply causality, but rather reflects the manner in which the pairwise variation (co-variation) is examined. For further discussion of graphing data pairs and the difficulties associated with interpretation and analysis of this type of data, see the topic Exploratory Data Analysis in this Handbook, including the discussions of Anscombe's Quartet and Scale dependency.

A measure of the degree to which two (or more) variables are correlated is known as a correlation coefficient. This is a statistic, computed from the data, that typically has a range [-1,+1], where 0 indicates no correlation, +1 indicates perfect positive correlation and -1 indicates perfect negative (or inverse) correlation. For real-valued data the most widely used correlation coefficient is that due to Karl Pearson, known as the Pearson or Product Moment correlation coefficient. This is essentially a measure of linear association that assumes the data is drawn from a Bivariate Normal population. Pearson's coefficient can yield misleading results depending on the nature of the association (especially if non-linear) and if the data includes outliers. More robust measures include those in which data is either measured as ordinal or treated as ordinal, and ranked. Two principal coefficients of rank correlation are widely available, due to Spearman and Kendall.

Correlation can be analyzed between more than two variables, typically by holding one or more variables constant whilst examining the correlation between a specified pair of variables. This approach is known as partial correlation analysis, and is described later in this topic. Another, major extension to correlation techniques applies to data recorded in series, especially time series and spatial series (sorted by distance band). Unlike standard correlation, with two variables, only a single variable is analyzed, in this case comparing pairs of values separated by an interval of time or distance band (a lag). This enables patterns of dependency in time and/or space to be studied, assisting in model development where the common assumption of independence of observations does not hold. The general name used for such correlation techniques used to be serial correlation, but it is now more usually referred to as autocorrelation.