﻿ Time series analysis and temporal autoregression

# Time series analysis and temporal autoregression

Navigation:  »No topics above this level«

# Time series analysis and temporal autoregression

A time series can be regarded as a set of values, {xt}, which represent measurements taken at different (sequential) time periods, t=1,2,3,4,…n. Time itself is clearly a continuous variable, but measurements in many cases are made at specific points in time, so appear as discrete observations. Typically a single variable or value is analyzed at each point in time (univariate analysis) even if multiple variables are recorded for that point (for example, daily weather conditions at a meteorological station; hourly financial stock price and traded volume data). The data are often, but not always, measured or defined for times that have equal intervals between them, e.g. every 5 minutes, close of trading every day, every year. Bivariate and multivariate time series analysis is possible, but is not covered in any detail in this topic.

In the earlier section of this Handbook, discussing temporal autocorrelation, we illustrated a number of time series including stock price and volume data, and temperature data. We showed how such data could be analyzed to identify patterns using the autocorrelation function (acf) and correlograms. In this topic we will take such analysis further, examining other means of analyzing patterns in data of this type, and from this, designing models that seek to have an explanatory and/or predictive function (forecasting). We start by considering statistical methods applied to time series in the time domain, i.e. analysis of data by examining their behavior over time. We then look at similar problems but in the frequency domain — such analyses are particularly applicable to data that exhibit multiple periodicities over time, and the analysis and modeling of these periodicities is the primary objective.

Chatfield (1975 [CHA1]) identifies a variety of different types of time series. These include:

(i) economic time series (share prices, macro-economic statistics, income datasets); (ii) physical time series (river flow data, meteorological data, pollution monitoring data); (iii) marketing time series (sales figures, advertising response data); (iii) demographic time series (population levels over time); (iv) manufacturing (process output and control charts); (v) binary processes (digital data sequences, as in switching systems and data transmission systems); and (vi) temporal point processes (1D point processes, and by extension, 2D point processes and spatial point datasets). For some of these series data measurement is actually continuous, or maybe continuous — for example, a barograph measuring air pressure, or a data communications channel across which traffic flows and is monitored continuously. Other data is only measured or defined at specific times, for example the daily closing price of a given traded instrument (stock) on the NYSE. A large proportion of the statistical techniques described as time series analysis address problems of this latter type, discrete data, usually recorded at fixed intervals.

There are several reasons that statisticians undertake time series analysis. These include simple descriptive requirements, i.e. using graphs and descriptive measures to identify key aspects of the data — averages, peaks and troughs, periodicity, critical turning points, etc. It may also be a requirement to use the information for prediction, i.e. to obtain estimates of future values of the series based on historic data. Traditionally prediction, or forecasting, was a relatively static procedure, but increasingly real-time predictive modeling is used, especially in highly dynamic environments such as emergency control, infrastructure management (communications, power etc.), and in financial markets. Pure time-series prediction utilizes the raw data to make forecasts. Whilst in some instances, especially for short-term forecasting, this may be an effective approach, for more complex data (i.e. data with complex response behavior) and for longer-term forecasting, explanatory analysis is of great importance. In this case data on many variables might be used to provide estimates of future values of the time series — an example here would be forecasting Gross Domestic Product (GDP), a time series with an extremely complex dependency on many underlying national and international variables. The UK Treasury Model, for example, is used for econometric forecasting and currently utilizes around 30 main equations and around 100 independent variables (variables that are input into the model rather than generated within it). Lastly, the objective of time series analysis may be to enable or assist the control of some dynamic process. For example, engineers forecast peak demand for electricity during the day based on a variety of internal and external inputs, and then use these forecasts to bring on-stream extra generating capacity or supply sources ahead of the demand peaks, and then remove capacity as demand is forecast to fall.

Note that forecasting relies heavily on what might be called "well behaved" data, i.e. the use of historic data and related information to predict future data values. This can be a very effective procedure, but of course does not take into account totally unexpected and often dramatic changes. The latter are sometimes referred to as Black Swans (see Taleb, 2007), because no matter how many white swans people may have seen historically, predicting the existence of black swans would have been almost impossible. Scientists and philosophers who are adherents of the Black Swan perspective argue that in the long term statistical forecasting and even inductive reasoning is ineffectual, on the basis that the really important and big changes in the world are simply not possible to predict — major wars, famines, banking crises and similar events seem to come completely 'out of the blue'. Whilst this view clearly has considerable merit, it is too restrictive and negative in many situations, and if widely adopted would result in stagnation in many fields. The Black Swan paradigm has a valuable contribution to make, especially in risk analysis and macro-level forecasting, but in many situations can be treated as part of the overall picture rather than the dominant component.

As noted above, time series almost always exhibit some degree of autocorrelation, hence analysis of the autocorrelation structure in a series is often one of the first tasks carried out (after visual inspection and data cleansing). However, many time series also exhibit a number of other common characteristics. For example, many time series exhibit some level of periodicity, and the length and magnitude of such patterns warrants close examination. Likewise, time series may or may not show a distinct trend in the data, but perhaps a longer series would display a trend when none is apparent in a shorter sample, hence the start, end and duration of time series datasets is important. It is usual for any obvious trends to be identified and removed before autocorrelation analysis is carried out. When analyzing a time series, if the precise point at which the series is examined does not affect the results, then the series is described as stationary. More formally, a stationary time series is one whose joint probability distribution is not altered by a shift in time or space — this in turn requires that the mean and variance of the data are constant across time and/or space, a condition that is rarely achieved. Where series include trend and/or periodic behavior it is usual for these components to be identified and removed (the series is decomposed) before further analysis. Many models used in time series analysis, including simple autoregressive (AR) and moving average (MA) models, and combined ARMA models, assume stationarity. More complex models, such as ARCH and GARCH, that allow for heteroskedasticity, are also supported in specialized software packages, notably in econometric modeling.

Example: Pollution dataset analysis

In order to clarify some of the issues and procedures associated with time series data, we initially examine a pollution monitoring dataset (see chart below). The data represents the daily mean count of small particulates (PM10s), in micrograms per cubic meter of air, at an automated air pollution monitoring station, GR4, in Greenwich, England. The chart shows the values recorded each day during 1993. This particular dataset for the monitoring station in question extends through to the present.

PM10 daily mean values, Greenwich, Station GR4

source: London Air Quality Network (LAQN), www.londonair.org.uk

All the data from the start of January 1993 to end December 2009 were downloaded from the LAQN website and examined prior to analysis. Several issues became immediately apparent:

(i) the data for 2009 was classified as "Provisional" rather than "Ratified", and included a very large single outlier (>200) suggesting an equipment error or exceptional event for that date. The safest option was felt to be to ignore the 2009 data, at least initially, and then to examine it subsequently (e.g. following ratification); (ii) with a total of 6209 records graphing the full (raw) dataset results in an unreadable chart, so graphing subsets, as per the illustration above, with inevitable drawbacks, is often necessary; (iii) the data includes many missing values — in fact almost 10% of the overall dataset consists of missing values, embedded within the overall sequence. This is a serious issue, as time series analysis is not generally possible without these missing data values estimated in some manner. Various solutions to this problem were then examined. The first step involved analyzing the size of the various gaps in the data. In some instances these were large, being >30 days. Interpolating such large gaps was not felt to be acceptable. The data cannot be re-sampled, so the only alternatives are: (a) to analyze subsets of the data between the large gaps, infilling small gaps using some form of interpolation (estimating the missing values); or (b) to find an alternative source for the missing data. Option (b) was initially considered, because there are several PM10 monitoring stations in Greenwich, and the records for nearby station GR5 are available for many of the times when GR4 was out-of-service. Using records from GR5 would enable the series to be augmented and would reduce the missing values count to 174, i.e. <3%. All of the large gaps could be filled in this way or using more sophisticated "imputation" methods. The small gaps, typically of <5 days, could then be filled using a series mean value or simple linear interpolation. Both these procedures raise questions of a technical and scientific nature, and it is the responsibility of the researcher to justify any such decisions made, and to estimate how these decisions influence the outcome of analyses. For example, using GR5 rather than GR4 data may be justified on the grounds that they are located near to one another and both are in roadside locations adjacent to major highways. However, the highways will differ due to locational factors and patterns of traffic carried, so there is an introduced study error by adopting this approach. Comparison of GR4 and GR5 datasets for days on which both stations were monitoring actually shows that GR5 PM10 levels do not provide a usable model for GR4 — although their two mean values are very similar (roughly 22 and 26 ppm respectively) the two datasets are, in fact, not correlated with one another at all. Thus option (b) is not a realistic proposition. Option (a) was then studied, and a series of around 2500 records was found (almost 5 years data) during which only relatively small sequences of missing data occurred. These were then interpolated using the series mean, and the data were then in a suitable form to allow time series analysis to be undertaken.

Typically the first step in analysis of this kind of data is to compute and plot the autocorrelation function (see below, first chart). This clearly shows a pattern of local autocorrelation, with a secondary peak at around one month. By computing and plotting the partial autocorrelation levels for each lag it is possible to obtain a clearer picture of periodicity in the data (see below, second chart). This shows that most of the autocorrelation is actually just a lag 1 (1 day) effect, with minor effects at 3, 7 and 25 days.

Autocorrelation function of PM10 data, Greenwich UK, GR4, Daily records

Partial autocorrelation function of PM10 data, Greenwich UK, GR4

In this particular dataset, the initial analysis, graphing of the long term dataset, and further analysis of the types discussed below, all indicate that there is actually very little pattern in the data — no distinct trend, very little periodicity, and only very short term, day-to-day correlation between recorded levels. This in turn suggests that: (a) PM10 levels remained stable over the duration of the study period; and (b) fluctuations in observed PM10 levels are largely random. This latter observation is supported by, and perhaps explains why there is no observable correlation between PM10 levels at stations GR4 and GR5.

References

[BOX1] Box G E P, Jenkins G M, Reinsel G C (1994). Time Series Analysis, Forecasting and Control. 3rd ed. Prentice Hall, Englewood Cliffs, NJ.

[CHA1] Chatfield C (1975) The Analysis of Times Series: Theory and Practice. Chapman and Hall, London

[PEN1] Pena D, Tiao G C, Tsay R S, eds. (2001) A course in time series analysis. New York: John Wiley and Sons

[TEL1] Teleb N N (2007) The Black Swan: The impact of the highly improbable. London, Penguin Books