<< Click to Display Table of Contents >> Navigation: Correlation and autocorrelation > Autocorrelation > Temporal autocorrelation |
We have seen in the introduction to this topic that the population autocorrelation coefficient at lag k, ρk, is defined as the ratio of the (auto)covariance to the (auto)variance:
To clarify how this arises, let us consider the analysis of sample time series data. If we have a sample set {xi,yi} of n pairs of real-valued data the correlation between them is given by the ratio of the covariance (the way they vary jointly) to the square root of the variance of each variable. This is effectively a way of standardizing the covariance by the average spread of each variable, to ensure that the sample correlation coefficient, r, falls in the range [‑1,1]. We saw earlier that the standard formula used for this ratio is:
Now suppose that instead of a set of data pairs {xi,yi} we have a set of n values, {xt}, which represent measurements taken at different time periods, t=1,2,3,4,…n, for example daily levels of rainfall at a particular location, or the closing daily price of a stock or commodity. The graph below shows a typical stock price time series: the blue line is the closing stock price on each trading day; the red and black looped lines highlight the time series for 7 and 14 day intervals or ‘lags’, i.e. the sets {xt,xt+7,xt+14,xt+21,...} and {xt,xt+14,xt+28,xt+42,...}.
Time series of stock price and volume data
The pattern of values recorded and graphed might show that rainfall, or commodity prices, exhibits some regularity over time. For example, it might show that days of high rainfall are commonly followed by another day of high rainfall, and days of low rainfall are also often followed by days of low rainfall. In this case there would be a strong positive correlation between the rainfall on successive days, i.e. on days that are one step or lag apart. We could regard the set of “day 1” values as one series, {xt,1} t=1,2,3…n‑1, and set of “day 2” values as a second series {xt,2} t=2,3…n, and compute the correlation coefficient for these two series in the same manner as for the r expression above. Each series has a mean value, which is simply:
The numeric subscript indicates the lag, and the dot in the subscripts here indicates that the mean is computed across all usable values of t. Using these two mean values we can then construct a correlation coefficient at lag 1 between our two series. This is essentially the same formula as for r:
If n is reasonably large then the value 1/(n‑1) will be very close to 1/n, and the values of the two means and standard deviations will be almost the same, so the above expression can be simplified under these circumstances to:
This expression is known as the sample serial correlation coefficient or autocorrelation coefficient for a lag of 1 time period. It may be generalized for lags of 2, 3, …,k steps as follows:
The term autocorrelation coefficient has been used since the 1950s to describe this expression, rather than serial correlation coefficient. The top part of this expression is like the covariance, but at a lag of k, and the bottom is like the covariance at lag of 0. These two components are sometimes known as the autocovariance at k and 0 lags.
In time series analysis it is usual for the time spacing, or “distance”, to be measured in equal steps. The set of values {rk} can then be plotted against the lag, k, to see how the pattern of correlation varies with lag. This plot is known as a correlogram, and provides a valuable insight into the behavior of the time series at different lags or “distances”. For a random series the values for the rk will all be approximately 0 — in fact they are distributed as ~N(0,1/n). If there is short term correlation, as in our rainfall example, the rk will start high (close to +1) and decrease to roughly 0 when the number of lags exceeds the length (or range) of this correlation. With periodic data the frequency and strength of the periodic components can be detected. A simple example is illustrated below. The data in this case are the maximum temperatures recorded each month since 1948 at Heathrow Airport, London (a total of 744 records). First of all we plot the raw data, which clearly show a strong cyclical or periodic pattern:
Monthly maximum temperature, Heathrow Airport, 1948-2009
The R function, acf(), has then been used to compute the autocorrelation, which has been plotted against lag in months. The zero lag has acf(0)=1, and as expected there is a cyclical pattern to the autocorrelation, with strong positive autocorrelation at intervals of 12 months and multiples thereof, and matching negative correlation at 6 months, 18 months etc. The dotted horizontal line indicates a confidence interval for the acf value (but this should be treated with caution).
Monthly maximum temperature, Heathrow Airport, Autocorrelation function
A clearer picture can be obtained by computing partial autocorrelation values. Partial autocorrelation is computed in the same manner as for standard product moment correlation. With two variables (x and y), controlling for a third (z) we had the following formula for partial correlation:
For partial autocorrelation the equivalent formula is:
In this instance the partial autocorrelation coefficient is measuring the strength of the autocorrelation between data that are at time lags 1 and 3, removing the effect of time lag 2 (the correlations between lags 1 and 2, and between 2 and 3. In the case of autocorrelation analysis, computation of the partial autocorrelation enables the correlation effects of all intermediate periods to be removed (controlled for) thereby concentrating on the individual lag relationships. For the data above, the partial acf values have been computed using the R acf() function again, and plotted against the lag as before (see below). In this example it can be seen that the majority of the lag effects are accounted for by month-by-month autocorrelation levels up to 12 months, and after this the lag effects are very small. With more complex datasets it is often possible to identify effects at specific lags that can be attributed to some process that might otherwise be difficult to detect in the source data and basic acf. The partial acf is also useful in modeling time series as it helps to identify over what period short-term autocorrelation dominates (see further, ARIMA modeling).
Monthly maximum temperature, Heathrow Airport, Partial autocorrelation function
It is possible of course, that the overall pattern of rainfall, temperature or commodity prices shows a steady increase over time, in which case the correlograms will not tend to zero in the manner expected. In this case the series is described as non-stationary and before carrying out such analysis an attempt to remove the trend component should be undertaken. Typically this involves fitting a trend curve (e.g. a best fit straight line) to the original data points and subtracting values for this curve from the original dataset at lags 1,2,3… before carrying out the analysis. The original data may also contain outliers, which if left in may distort the analysis, so inspection of the source data and outlier adjustment or removal (e.g. of data errors) may be advisable. Having identified these factors, adjusted the data if necessary, and computed the correlograms, the next step is to examine the results and attempt to interpret the observed patterns. Unfortunately, as is the case with many patterns observed in time or space, more than one process can generate an identical pattern. However, modeling an observed pattern may provide an effective means of estimating missing or sparse data, or predicting values beyond the observed range, despite the fact that the process generating the model may not be understood or unique. These questions are discussed in greater detail, with examination of other statistical analysis tools for serial datasets, in the section on Time Series Analysis (TSA).
These comments apply to series which follow a clear sequence of steps in a single dimension, time. At first sight such methods do not translate easily to spatial problems, since there is no obvious single direction to follow. Of course one could select a single transect and take measurements at fixed intervals to produce a well-ordered series, which could then be analyzed in exactly the manner described. But more general procedures are needed if a wide range of practical spatial problems are to be subjected to such analysis. These procedures need to model space in a manner that results in well-ordered data series, ideally in evenly spaced steps, using the general notion of proximity bands. This is described in the next subtopic, on Spatial Autocorrelation.