Autocorrelation

Navigation:  Correlation and autocorrelation >

Autocorrelation

Previous pageReturn to chapter overviewNext page

We noted in the introduction to this topic that the term autocorrelation applies to data that is arranged in some form of one- or two-dimensional series, i.e. highly ordered datasets, with correlation being computed between pairs of values separated by a fixed time or distance (lag). The formula for correlation in this instance is typically the Pearson or Product Moment correlation, amended for serial data. In the case of time series, or temporal data, this is relatively straightforward, simple requiring (x,y) pairs to be replaced by (xt,xt+k) pairs, where the subscripts identify fixed step discrete times t, and t+k.

The  population autocorrelation coefficient at lag k, ρk, is then computed as the ratio of the (auto)covariance to the (auto)variance:

where γ(0) means the autocovariance at lag 0. Assuming there is sufficient data, the computation is symmetric, i.e. γ(k)=γ(-k) and thus ρk=ρ-k.

As with the product moment correlation coefficient, ρk has a range [-,1], with the mid-value, 0, being the indicator of no autocorrelation. If the lagged variables are independent then ρk=0, but a zero value from sample data does not guarantee that the variables are independent. The analysis of autocorrelation in the context of temporal datasets is discussed in the first subsection below.

With spatial data modeling is more complicated because there is no definitive serial ordering. For spatial point data (points with associated attribute values) a serial ordering can be imposed using fixed radius circles, or distance bands, extended around each point. For lattice-type data (e.g. census districts with census data) serial ordering typically uses some measure of adjacency - adjacent regions may join each other/have a common boundary, whereas non-adjacent regions do not. These questions are discussed in the subsection below covering spatial autocorrelation, initially examining simple regular lattices, then extending to irregular lattices, and finally to distance bands.

Where temporal and/or spatial autocorrelation is detected it will typically vary in intensity by lag. Plotting a statistical measure of autocorrelation against lag (time or distance) will produce one or more line graphs that highlight this pattern of variation - such a diagram is called a correlogram, and is an extremely useful tool in determining the pattern of autocorrelation and the point at which autocorrelation ceases to be an issue (the point beyond which samples can be considered as being effectively independent). Having identified the intensity and pattern of autocorrelation this information may then be used in a variety of ways. Most (if not all) real-world time series data exhibits autocorrelation so it is natural to use this information when constructing time series analysis models, in particular, forecasting models. Periodicity or spikes in the correlogram can indicate the frequency interval of particular patterns, which can then be examined in terms of possible causative factors. Autocorrelation can be explicitly be built into regression models (for more details, see for example, the topic addressing spatial autoregressive models) and used in interpolation or prediction techniques (such asgeostatistical modeling). The techniques developed in this latter field have also been extended and applied to a number of spatio-temporal problems, such as the spread of epidemics and the analysis of remote sensing data.

There are several reasons for carrying out analyses of autocorrelation in datasets. The most obvious is that all datasets represent samples made in space and time, and as such there may be some element of spatial and/or temporal association between data items. Most statistical techniques depend upon the samples being truly independent, so if dependence is suspected it must be checked before continuing with any analysis. In many cases it is reasonable to assume that: (i) spatial and temporal factors have no bearing on the data collected; or (ii) the design of the data collection process or the experimental procedure adopted will have removed any such effects - a good example is where the order in which a set of taste tests are carried out, or the sequence of administration of a trial combination of drugs is administered, is randomized or permuted; and (iii) the time and location of trials are completely unrelated. However, if there is any doubt regarding independence, tests for the existence of autocorrelation should be carried out, and if found, the analytical procedures adopted should be altered and/or the data collected in a different manner.