SAR models

Navigation:  Regression and smoothing > Spatial series and spatial autoregression >

SAR models

Previous pageReturn to chapter overviewNext page

Many of the techniques that are briefly described in this subsection originate from time series analysis and were subsequently developed from the mid-1950s within the discipline known as spatial statistics. They have been applied and substantially extended in the last 25 years, notably by econometricians, geographers and medical statisticians. Additional disciplines that have made extensive use of these techniques include the actuarial, ecological and environmental sciences. Detailed discussion of the methods and underlying theory may be found in Cressie (1993, [CRE1]), Bailey and Gatrell (1995, [BAI1]), Anselin (1988, [ANS1]), Anselin and Bera (1998, [ANS2]), Anselin (2002, [ANS3]), Haining (2003, [HAI1]) and most recently, Gelfand et al. (2010, [GEL1]). Haining discusses the various forms of model that may be appropriate for the statistical analysis of spatial data. He notes the choice of a particular modeling approach often reflects the data types, preferences and experience within the disciplines involved, with social sciences, earth sciences and medical researchers all tending to focus on different approaches. The procedures have been implemented in many software packages: SpaceStat; S-Plus; the R-Spatial project; MATLab Spatial Statistics Toolbox (Pace et al.); WinBUGS; SAM; PySal; and GeoDa packages, amongst others. A number of these have been specifically developed to deal with large (and often sparse) matrix difficulties that arise with detailed regional and national datasets.

A pure spatial autoregressive model simply consists of a spatially lagged version of the dependent variable, y:

As can be seen this is similar to a standard linear regression model where the first term is constructed from a predefined n by n spatial weighting matrix, W, applied to the observed variable, y, together with a spatial autoregression parameter, ρ, which typically has to be estimated from the data. Essentially a spatial lag model is expressing the notion that the value of a variable at a given location is related to the values of the same variable measured at nearby locations, reflecting some kind of interaction effect. The spatial weights matrix, W, is almost always standardized such that its rows sum to 1, hence it is effectively including a weighted average of neighboring values into the regression equation. Note that in this case W is not necessarily symmetric (compare this with the symmetry requirements on conditional autoregressive, or CAR, models).

For an individual observation the basic spatial lagged autoregression equation is simply:

Note the similarity of this model to a series of simultaneous equations (hence the description of such models as simultaneous autoregressive, or SAR, models). The model can also be compared with a simple time series autoregressive model, from which it owes its origins:

Since the dependent variable, y, appears on both sides of the expression:

it can be re-arranged to give:

from which we can obtain an expression for the variance of y as:

hence

where C is the variance-covariance matrix.

This derivation has made no distributional assumptions regarding the response variable or the errors. In this model the spatial weights matrix, W, is effectively raised to the power 1, so only first-order neighbors are included in the autoregressive function, and for this reason this model is therefore sometimes described as having a ‘first-order’ specification.

Although widely used, Wall (2004, [WAL1]) has pointed out significant weaknesses in the interpretation of the spatial structure of commonly applied SAR and CAR spatial weighting schemes. These models were originally designed for use on infinite regular lattices, rather than finite irregular lattices (which have edges and variable numbers of neighbors per zone) and in this latter context are less well behaved. She recommends considering use of geostatistical models as an alternative or additional approach when analyzing lattice (zonal) datasets, since in this approach the covariance function is modeled directly. Of course, the use of geostatistical methods also have their weaknesses in this context, notably the need to use zone centroids or similar arbitrary points rather than the zones themselves when constructing the experimental variogram.

If we add additional predictor variables, x, to the pure spatially lagged autoregression model described earlier we have a mixed regressive spatial autoregressive model (mrsa):

As can be seen this is the same as a standard linear regression model with the addition of the autoregressive component (the SAM software refers to this formulation as a ‘lagged response model’). As before we can re-arrange this expression to solve for y:

The design of this kind of mixed model incorporates spatial autocorrelation together with the influence of other (aspatial) predictor variables. The objective of this revised approach is to obtain a significant improvement over a standard OLS model. The level of improvement will depend on how well the revised model represents or explains the source data, and to an extent this will vary depending on the detailed form of the weighting matrix, W.

Theoretical analyses have shown that this type of model can be derived from a variety of different processes, including direct processes such as spatial diffusion, certain forms of spatial interaction (including spillover and gravity or potential-type process models), and indirect processes such as resource distribution. This lack of a well-defined link between process and form is commonplace in spatial analysis, and is well-documented in fields such as point set clustering and fractal analysis. That is also applies here, in spatial regression modeling, should come as no surprise.

A second approach to SAR modeling is known as the spatial error model. This model is applied when there appears to be significant spatial autocorrelation, but tests for spatial lag effects do not suggests that inclusion of the latter would provide a significant improvement. A decision diagram for selecting the appropriate model based on a set of additional diagnostics (Lagrangian multiplier test statistics) is included in GeoDa (see the tutorial materials for a discussion of their use). The spatial error model (from GeoDa) is defined as:

Hence the basic model is as per the standard linear model, but now the error term is assumed to be made up of a spatially weighted error vector, λWε, and a vector of iid errors, u. We can re-arrange the expression for ε, above, to obtain:

hence the error variance-covariance matrix, C, in this case is given by:

The Georgia, USA educational attainment dataset used earlier to illustrate GWR can be analyzed in a similar manner using spatial autoregressive methods. If this is conducted within GeoDa the OLS results match those within GWR (although the AIC values differ slightly owing to the differences in the detailed expressions applied). However, to apply a spatial autoregressive model a spatial weights matrix is required. In the following example we have set the spatial weights to be defined by simple first-order rook’s move contiguity (adjacent edges), and then examined the GeoDa diagnostics to determine which form of regression model seems most appropriate to apply. In this instance the spatial error model was identified as the most appropriate and the regression re-run using this model. The results are summarized in the table below, which includes the new autoregressive parameter estimates.

Georgia dataset — comparative regression estimates and diagnostics

Predictor variables

Global parameter estimate

Spatial error model parameter estimates

GWR parameter estimates

Total population, β1

0.24 x10‑4

0.24 x10‑4

0.14 to 0.28 x10‑4

% rural, β2

‑0.044

‑0.046

‑0.06 to ‑0.03

% elderly, β3

‑0.06*

‑0.099*

‑0.26 to ‑0.06

% foreign born, β4

1.26

1.196

0.51 to 2.42

% poverty, β5

‑0.15

‑0.145

‑0.20 to –0.00

% black, β6

0.022*

0.013*

‑0.04 to 0.08

Intercept, β0

14.78

15.46

12.62 to 16.49

lambda, λ

 

0.313

 

Diagnostics

 

 

 

Residual SS (RSS)

1816

1708

1506

Adjusted R2

0.63

0.67

0.68

Effective parameters

7

7

12.81

AIC/AICc

855.4

846.0

839.2

* not significant

Although the RSS value in the table above is not as low as with GWR, the model is intrinsically far simpler and enables a more global view of the relationship between variables. There is an argument for utilizing both global regression and GWR approaches when analyzing datasets of this type, since they provide different perspectives on the data, and different insights into the use of such data for predictive purposes. The spatial error model applied in this example was defined above as:

Observing that:

we have:

Hence this expression models the dependent variable y as a combination of a general (global) linear trend component, Xβ, plus a pure spatial autocorrelation component, λWy, plus a (negative) neighboring trend or predictor component, λWXβ, plus a set of iid random errors, u (the SAM software refers to this formulation as a ‘lagged predictor model’). Comparing this to the mrsa model above:

we see that the spatial error model can be viewed as a form of mixed spatial lag model with an additional spatial component, the neighboring trend, λWXβ. These types of model can be generalized still further (Haining, 2003, p355, [HAI1]), for example as:

where the scalars α, ρ and φ, and the vectors β and δ are all parameters to be estimated, and the final term represents spatial autocorrelation on the errors. Clearly one could proceed from the generalized model to the particular, or vice versa. Likewise one could progressively increase or decrease the set of explanatory variables in the model.

Given the considerable complexity of spatial phenomena, Haining suggests a data-driven approach to statistical modeling, which can be seen as fitting comfortably within the Data and Analysis components of the PPDAC framework described earlier in this Handbook. His approach commences with ESDA, proceeds to model specification for the current data, and then progresses to an iterative cycle of selection and implementation of parameter estimation, assessment of model fit and re-specification where necessary.

References

[ANS1] Anselin L (1988) Spatial econometrics: Methods and models. Kluwer Academic Publishers, Dordrecht, NL

[ANS2] Anselin L (2002) Under the hood: Issues in the specification and interpretation of spatial regression models. Agricultural Economics, 17(3), 247-67

[ANS3] Anselin L, Bera A (1998) Spatial dependence in linear regression models with an introduction to spatial econometrics. in Ullah A, Giles D (eds.) Handbook of applied economic statistics. Marcel Dekker, New York, 237–89

[BAI1] Bailey T C, Gatrell A C (1995) Interactive spatial data analysis. Longman, Harlow, UK

[CRE1] Cressie N A C (1991, 1993) Statistics for spatial data. John Wiley, New York (Revised edition 1993)

[GEL1] Gelfand A E, Diggle P J, Fuentes M, Guttorp P eds. (2010) Handbook of Spatial Statistics. Chapman Hall/CRC Press, Boca Raton, Florida

[HAI1] Haining R (2003) Spatial data analysis — theory and practice. Cambridge University Press, Cambridge, UK

[WAL1] Wall M M (2004) A close look at the spatial structure implied by the CAR and SAR models. J of Statistical Planning and Inference, 121, 311-24