<< Click to Display Table of Contents >> Navigation: Regression and smoothing > Spatial series and spatial autoregression |
The broad principles of regression analysis and modeling, as described in earlier sections of this Handbook, also apply to two-dimensional (spatial) datasets. In a spatial context one is seeking to model the variation in some spatially distributed dependent variable, for example house prices in a city, from a set of independent variables such as size of property, age of property, distance from transport facilities, proportion of green space in the region etc. (known as hedonic regression). Spatial coordinates (x,y) or (x1,x2) might be explicitly or implicitly incorporated into such models, for example including the coordinates of the centroid of a census tract or event location into the modeling process, or by taking account of the pattern of adjacency of neighboring zones. Methods such as these may be used in the context of spatial description and mapping, as part of a process of data exploration, or for explanatory and/or predictive purposes.
Ideally in regression analysis the form of the chosen model should be as simple as possible, both in terms of the expression employed and the number of independent variables included — these are the simplicity and parsimony objectives. In addition, the proportion of the variation in the dependent variable(s), y, explained by the independent variables, x, should be as high as possible, and the correlation between separate independent variables should be as low as possible. These criteria form part of a strength of evidence objective. If some, or all, of the x are highly correlated (typically having a correlation coefficient above 0.8) the model almost certainly contains redundant information and may be described as being over-specified. If a strong relationship exists between selected independent variables and is also broadly linear, the variables are said to exhibit multi-collinearity. There are a number of standard techniques for reducing multi-collinearity, including: applying the so-called centering transform (deducting the mean of the relevant independent variable from each measured value); increasing the sample size (especially if samples are quite small); removing the most inter-correlated variable; and combining inter-correlated variables into a new single composite variable (e.g. using principal components analysis, PCA, or other data reduction techniques). The latter procedures can be viewed as forms of model re-specification.
With higher-order spatial trend analyses, multi-collinearity is to be expected since various combinations of the coordinates are themselves the independent variables. Technically, imperfect multi-collinearity does not result in the ordinary least squares (OLS) assumptions being violated, but does lead to large standard errors for the coefficients involved, also referred to as variance inflation (VI). This limits their usefulness. As a rule of thumb, if the measure known as the Multi-collinearity Condition Number, MCN, is greater than 30 this problem is quite serious and should be addressed. The MCN is derived from the ratio of the largest to the smallest eigenvalue of the matrix XTX. Generally the square root of this ratio is reported, but it is important to check this for any specific software package (if no square root is taken values greater than 1000 are taken as significant).
When fitting a model using OLS it is generally assumed that the errors (residuals for sample points) are identical and independently distributed (iid). If the spread of errors is not constant, for example if in some parts of the study area the residuals are much more variable than in others these errors are said to exhibit heteroskedasticity. If the latter is the case (which may be detected using a range of standard tests) the estimated variance under OLS will be biased, either being too large or too small (it is not generally possible to know which direction the bias takes). As a result the specification of confidence intervals and the application of standard statistical significance tests (e.g. F tests) will not be reliable.
Many (most) spatial datasets exhibit patterns of data and/or residuals in which neighboring areas have similar values (positive spatial autocorrelation) and hence violate the core assumptions of standard regression models. One result of this feature of spatial data is to greatly over-estimate the effective sample size, and hence the degrees of freedom that may be indicated will be stated as being far higher than they should be.
A common additional requirement for standard regression modeling, as noted above, is that the distribution of the errors is not only identical but also Normal, i.e.
ε~ N(0,σ2I)
If the distribution of the errors is Normal and the regression relationship is linear, this equates to the response variable being Normally distributed, and vice versa, i.e.
y~ N(Xβ,σ2I) , or setting the vector μ= Xβ
y~ N(μ,σ2I)
The assumption of Normality permits the analytical derivation of confidence intervals for the model parameters and significance tests on their values (against the null hypothesis that they are zero). In order to satisfy the Normality conditions required in many models it is common to transform (continuous valued) response data using Log, Square Root or Box-Cox functions prior to analysis where this can be shown to improve the approximation to the Normal. It is also advisable to examine and, where appropriate, remove outliers even if these are not errors in the dataset but represent valid data that would otherwise distort the modeling process. Exclusion must be made explicit and separate investigation of any excluded data carried out. An alternative to exclusion is to weight observations according to some rule. Spatial declustering is an example of such weighting applied to spatial datasets.
If spatial autocorrelation has been detected a more reasonable assumption is:
y~ N(μ,C)
where C is a positive definite covariance matrix. Typically C would be defined in such a way as to ensure that locations that are close to each other are represented with a high value for the modeled covariance, whilst places further apart have lower covariance. This may be achieved using some form of distance-decay function or, for zoned data, using a contiguity-based spatial weights matrix. This kind of arrangement is similar to Generalized Least Squares (GLS), which allows for so-called second-order effects via the inclusion of a covariance matrix. The standard form of expression for GLS is:
y=Xβ+u
where u is a vector of random variables (errors) with mean 0 and variance-covariance matrix C. The GLS solution for β is then:
In spatial analysis this kind of GLS model is often utilized, with the matrix C modeled using some form of distance or interaction function. Note that C must be invertible, which places some restrictions on the possible forms it may take. Simple weighted least squares (WLS) is a variant of GLS in which the matrix C is diagonal and the diagonal values are not equal (hence the errors are heteroskedastic). In this case the weights are normally set to the inverse of the variance, thereby giving less weight to observations which have higher variance.
In addition to regression on dependent variables that take continuous values, similar methods have been developed for binary and count or rate data. In the case of binary data the data typically are transformed using the logit function. The central assumption of this model is that the probability, p, of the dependent variable, y, taking the value 1 (rather than 0) follows the logistic curve, i.e.
This model can be linearized using the logit transform, q=ln(p/(1‑p)). The regression model then becomes:
As with standard linear regression the errors in this model are assumed to be iid and the predictor variables are assumed not to be co-linear. The parameters in such models are determined using non-linear (iterative) optimization methods, involving maximization of the likelihood function. With count or rate data Poisson regression may be applicable. The model is similar in form to the binary logit model, but in this case the response variable, y, is count (integer) data (hence non-continuous). The basic model in this case is of the form:
where A is an (optional) offset or population-based value (e.g. the underlying population at risk in each modeled zone). This model may be linearized, as per the logit model, but this time by simply applying a log transform: q=ln(y). The regression model then becomes:
There are many variants of linear and non-linear regression analysis — choosing between alternative models and methods presents a particular challenge. Each additional parameter may improve the fit of the model to the sample dataset (or data set), but at the cost of increasing model complexity. Model building should be guided by the principal of parsimony — that is the simplest model that is fit for the purpose for which it is built. Overly complicated models frequently involve a loss of information regarding the nature of key relationships. Drawing on this information content view of modeling, an information theoretic statistic is often used as a guide to model selection. The most common statistic now used is the Akaike Information Criterion (AIC) and is often computed in its small-sample corrected version (AICc) since this is asymptotic to the standard version. The measures are defined as:
where n is the sample size, k is the number of parameters used in the model, and L is the likelihood function. It is recommended that the corrected version be used unless n/k>40. If the errors are assumed to be Normally distributed then the standard expression can be re-written as:
The AIC measure is utilized in several of the software packages described in this section and elsewhere in this Handbook (e.g. in the point pattern analysis software, spatstat). For example GeoDa uses the version above in OLS estimation, whilst Geographically Weighted Regression (GWR) uses a variant of the version corrected for small sample sizes. Both packages use appropriate alternative expressions in cases where the likelihood functions are different from the Normal model. A wide range of regression models, including OLS and GWR, and several autocorrelation models (see below) are supported by the SAM package (which also supports AIC-based model selection). Although SAM has been designed for and by macroecologists, it can equally well be applied in other fields of spatial research. A second information theoretic measure, which places greater weight on the number of parameters used, is often also provided. This is known as the Bayesian Information Criterion or BIC, or sometimes as the Swartz Criterion. It has the simpler form:
Simple spatial regression and trend surface modeling
Simple regression techniques have been widely applied in spatial analysis for a very long time. The general form of model employed can be represented by the expression:
where: y is the (possibly transformed) observed value of some continuous dependent variable y at location (x1,x2); f() is a general function to be defined, but often a linear expression in the coefficients; and w is a vector of attributes measured at (x1,x2). For example, if f() is simply modeled as a linear function of the coordinates we can write
Model performance can be assessed using a number of statistics. One such statistic, which will be referred to in our subsequent discussion, is the correlation coefficient squared (coefficient of determination or R2) statistic. This statistic identifies the proportion of variation in the data that is "explained" by the model. A modification this statistic, adjusted R2, takes into account the complexity of the model in terms of the number of variables that are specified. With a set of n observations made at different locations in the study area, yi(x1i,x2i), a series of n equations can be set up and solved for the p=3 unknown parameters using OLS. This is simply a linear trend or best fit linear surface through the point set, {yi}. A set of differences between the observed values and the best fit surface can then be computed:
where the y hat values (the estimated values) are obtained by evaluating the trend surface at each point i. Assuming the value of n is reasonably large and provides a representative coverage of the complete set of locations in the study area, these residuals may be treated as samples from a continuous surface and mapped using simple contouring or filled contouring procedures. If, on the other hand, the sample points represent centroids of zones (pre-defined regions or generated zones, e.g. Voronoi polygons) the visualization of residuals is generally applied to these zones as a choropleth map. Plotting the residuals using a combination of the fitted surface and stick/point display in 3D may also be helpful but is less common.
Whichever approach is adopted the fitting of a trend surface and mapping of residuals is a form of exploratory spatial data analysis (ESDA) and is supported in many Geographical Information Systems (GIS) and related packages. Within GIS packages such residual maps may be overlaid on topographic or thematic maps with a pre-specified transparency (e.g. 50%) enabling spatial associations between large positive or negative residuals to be visually compared to potential explanatory factors.
With higher order polynomials or alternative functions (e.g. trigonometric series) the fit of the trend surface to the sample data may be improved (the sum of the squared residuals reduced) but the overall utility of the procedure may well become questionable. This may occur for many reasons including the increased difficulty in interpreting the fitted surface (e.g. in terms of process), and increased volatility of the surface close to and immediately beyond the convex hull of the sample points and/or in sparsely sampled regions. Higher order polynomial trend surfaces are usually limited to second order equations, although in some applications 3rd order or above have been shown to be useful — for example, Lichstein et al. (2002, [LIC1]) fit a full 3rd order polynomial trend equation to the spatial coordinates of bird species sightings. They then proceed to analyze the distribution of bird species by modeling habitat + trend (large scale spatial pattern) + spatial autocorrelation (small scale spatial pattern) in order to identify separate factors that might be used to explain observed distribution patterns by species. Note that any analysis of trends in this way will result in a fitted surface whose form is heavily dependent upon the set of specific locations at which measured values of y are available and the scale factors operating for the problem being examined. Frequently data locations are not the result of selections made by the analyst but are pre-determined survey locations that may reflect convenience for data gathering (e.g. proximity to a forest path, access to quarry sites, location of wells) rather than an ideal sampling frame for analytic purposes.
The most common measure of the quality of fit of a trend surface or similar regression is the coefficient of determination, which is the squared coefficient of (multiple) correlation, R2. This is simply a function of the squared residuals with standardization being achieved using the sum of squared deviations of observations from their overall mean:
Under appropriate conditions, e.g. ε~N(0,σ2I) the significance of this coefficient can be evaluated in a meaningful way, with values close to 1 generally being highly significant. Frequently, however, the statistical significance of the measure cannot be determined due to distribution conditions not being met, and the value obtained should be seen as no more than an indicator of goodness of fit.
Trend surface and residuals analysis is essentially a form of non-statistical ESDA — it is not necessary to state how significant the fit of the model is or what confidence limits might be placed around the values of the estimated coefficients and the fitted surface. For inferential analysis the standard pre-conditions that apply are extremely unlikely to be met in the context of spatial analysis. Conditions that may apply for statistical purposes include:
•the set {yi} is comprised of independent (uncorrelated) observable random variables. This set does not have to be drawn from a Normal distribution but it is often helpful or necessary that they are at least asymptotically Normal (possibly after transformation)
•the set {εi} is comprised of independently and identically distributed (iid) unobservable random variables with mean 0 and constant variance, σ2, where σ2 is not a function of β or x
•the set {εi} is Normally distributed — this condition is not a pre-requisite but is a requirement for inference on finite sets (as previously noted, this condition is directly connected to the first bullet point above)
•the model applied is appropriate, complete and global. This assumption incorporates the assumption that the independent variables, x, are themselves uncorrelated, and the parameters β are global constants
Tests for the Normality of the observations and the residuals are commonplace (the first and third conditions above), and a variety of tests for heteroskedasticity (resulting in failure to meet the second condition) have been implemented in a several GIS-related packages. Note that heteroskedasticity may or may not be associated with positive spatial autocorrelation. Testing for (global) spatial autocorrelation using the Moran I statistic is a useful starting point for such analyses. A spatial weights matrix, W, is required in order to generate the Moran I statistic. This statistic has the form:
This measure is the spatial equivalent of the standard product moment correlation coefficient. The weighting terms, wij, are defined either from patterns of zone adjacency or using some form of distance-decay function. A value of Moran’s I=1 indicates perfect positive spatial autocorrelation, whilst a value of Moran’s I very close to 0 indicates no spatial autocorrelation. A value such as I=0.2 indicates some positive spatial autocorrelation, although this may or may not be significant — with large spatial datasets small positive I–values are very often technically identified as significant, although the value of this interpretation may not be great. Assuming that significant spatial autocorrelation is identified several options exist for continued analysis using regression methods:
•proceed with the analysis ignoring the observed spatial autocorrelation (particularly if it is small) accepting that the drawing of confidence intervals and determination of significance levels will not be possible, or if computed may be misleading
•drop the requirement that the parameters β are global constants and allow these to vary with location. Effectively this involves fitting a series of regression surfaces to the dataset in such a way as to ensure a continuous prediction surface is created. One procedure for achieving this objective is Geographical Weighted Regression (GWR); or
•include additional elements into the regression model that take explicit account of the observed pattern of spatial autocorrelation, as described in the sections on SAR and CAR models
References
[LIC1] Lichstein J W, Simons T R, Shriner S A, Franzreb K E (2002) Spatial autocorrelation and autoregressive models in Ecology. Ecological Monographs, 72, 445-63
[GET1] Getis A D, Griffith D A (2002) Comparative spatial filtering in regression analysis. Geog. Anal., 34, 130-40