Regression and smoothing

Navigation:  »No topics above this level«

Regression and smoothing

Previous pageReturn to chapter overviewNext page

Regression analysis is the term used to describe a family of methods that seek to model the relationship between one (or more) dependent or response variables and a number of independent or predictor variables. However, as Yule and Kendall (1950, p213, [YUL1]) note:

"the term regression is not a particularly happy one.... it was introduced by Galton [over a century ago] in connection with the inheritance of stature. Galton found that the sons of fathers who deviate x inches from the mean height of all fathers themselves deviate from the mean height of all sons by less that x inches... which he described as a 'regression to mediocrity' "

But the term has stuck and its original usage generally forgotten.

If there are two variables, x and y say, and one is regarded as the dependent variable, y, and the other as the independent variable, x, they may be totally independent, or they may show some degree of dependency, i.e. as x increases so y either increases or decreases in a systematic manner. If a scatter diagram or scatterplot of sample pairs (x,y) is produced it may be possible to discern the general form of a systematic relationship, as in the example discussed below. If there is no pattern, but only what appears to be a random scattering of points, it is likely that there is no relationship detectable from this particular dataset and the variables may well be regarded as being independent. If on the other hand a very distinct pattern is observed, perhaps a simple linear relationship with y increasing in a linear manner as x increases, then a simple linear model may well be appropriate (bearing in mind, however, our earlier examples of scale dependency and non-linearity as illustrated by Anscombe's quartet). Where the relationship is linear and positive, a best fit straight line through the scattering of (x,y) pairs will have a positive slope, b, and unless this line goes through the origin (0,0) it will cross the y-axis at some value y=a, known as the intercept. The equation will then be of the form:

The diagram below shows a scatterplot of 9 data pairs (from Crawley, 2007, [CRA1]): {y,x}=(12,0) (10,1) (8,2) (11,3) (6,4) (7,5) (2,6) (3,7) (3,8). These data represent the growth of caterpillars (the dependent or y variable) each given a diet which differs in their tannin content (the x variable, incremented from 0 to 8). A best fit line is also shown, with the difference between the observed data pairs (large black circles) and the best fit line being shown by vertical lines (the residuals). As can be seen, the line drawn has a negative slope - the more tannin in the diet the smaller the growth in the caterpillars thus the slope b has a negative sign. This line crosses the y-axis at x=0, where y=12, and passes through the mean of the x and y values, which is at x=4, y=6.888.

regress_resids

As noted above, in general a best fit line will not pass exactly through the (x,y) pairs of the dataset, but it will pass through the point defined by the mean of the x-values and the mean of the y-values. This partly determines the equation for the line, but it could be rotated about this point having any slope from 0 to +/- infinity. The choice of slope is determined by the best fit criterion selected. Typically this is chosen so that the sum of squared differences between the values on the best fit line and the observed data values (the residuals) is minimized (the least squares principle). This approach combines mathematical convenience and a resulting fit that is widely regarded as being highly satisfactory (the best arrangement in certain respects) even though it is not the only option and can be very sensitive to outliers. Using the least squares approach the location of the line and its slope can be determined in an unambiguous manner, with the slope being positive if y increases as x increases, or negative if y decreases as x increases. In general, the line that is the best fit for y given x (or y on x) is different from the line for x given y, i.e. the order of the variables and the dependency relationship matters. In the example above, the equation of the fitted line is actually:

More generally we can see that the data are being modeled by an equation of the form:

where the final term is the error or residual associated with observation i. If the scattering of points either side of the best fit line is very small, so the line represents a very good fit, the simple linear model can be regarded as in some sense 'better' than a model for which the scatter is greater and the fit less tight. The coefficient of correlation is a standardized measure of how good the fit is, because it relates the pattern of co-variation of the x and y variables (covariance) to the average spread (square root of the variance) of the x and y variables taken independently. Thus a correlation coefficient of +1 indicates a perfect positive relationship with no spread around the best fit line, whereas a value of 0.7, say, indicates a reasonably close relationship but there is clearly a fair amount of spread either side of the best fit line. In the example above the correlation coefficient is -0.9, which indicates a strong negative correlation, and suggests that approximately 81% of the variance (-0.92) can be explained by the best fit line, with 20% being residual or error variance. Where there are multiple independent or predictor variables, they may be measured on very different scales. This can result in the coefficients becoming difficult or impossible to interpret and/or rounding error problems (e.g. coefficients that are apparently zero or close to zero, but which are also significant! Standardization of the source variables (subtracting the mean for that variables and dividing by the sample standard deviation) enables the coefficients to be compared and interpreted more readily, and many software packages will cite both un-standardized and standardized coefficients.

The relationship between a pair of variables will not necessarily be linear, but this is the simplest such relationship and is often the starting point for statistical modeling. For this reason we start with a closer examination of linear models applied to continuous real-valued datasets, before extending the discussion to more general cases. In broad terms, the family of linear models (linear in the coefficients) embraces a wide range of expressions and is often used when a relationship of some general form is suspected and the objective is to model this relationship as efficiently and effectively as possible. Such models are essentially empirical, in that they depend to a large degree on examination of the sample data available. They may involve a few parameters or many, and they are typically far better at predicting unobserved values within the sample domain than beyond it. This contrasts with problems for which some prior (often non-linear) expression is known, for example as a result of extensive scientific research (e.g. a diffusion equation, a constrained growth equation). In this latter situation the complete form of the equation to be fitted is known, and the objective is to estimate the parameters that apply for the problem at hand. Such problems, whose form is known a priori, tend to be compact (requiring estimation of very few parameters) and often have predictive power that extends beyond the sample domain.

Below we provide a table that summarizes some of the more commonly encountered regression and modeling terminology, together with links to the subtopics that describe many of these in more detail:

Selected regression and related statistical modeling terminology

Form of model

Description and Notes

Simple linear regression

A single approximately continuous response (dependent) variable and one or more predictor variables related by an expression that is linear in its coefficients (i.e. not necessary linear in the predictor variables)

Multiple regression

This term applies when there are multiple predictor variables and all are quantitative

Multivariate regression

Regression involving more than one response variable. If, in addition, there are multiple predictor variables the composite term multivariate multiple regression is used. Multivariate data of this type is often analyzed using more general multivariate tools than regression

Non-linear regression

Regression that is non-linear in the function coefficients. Note that simple polynomial and exponential expressions can be handled via linear regression techniques, directly or following data transformation

Simultaneous Autoregression (SAR)

Simultaneous autoregressive models (SAR is also used as an abbreviation for Spatial Autoregression). A form of regression model including adjustments for spatial autocorrelation. Many variants of SAR model have been devised

Conditional autoregression (CAR)

Conditional autoregressive models — as per SAR, a form of regression model including adjustments for spatial autocorrelation. Differs from SAR in the specification of the inverse covariance matrix. In this model the expected value of the response variable is regarded as being conditional on the recorded values at all other locations

Logistic regression

Logistic regression applies where the response variable is binary, i.e. of type 1/0  Yes/No, Success/Failure, or Infected/Not infected. Samples of a given size provide sample proportions for each category. Typically modeling such data involves use of the logit transform, with linear regression being conducted on the transformed data. Variants on the basic binary model are available for response variables that represent more than two categories, which may or may not be ordered

Poisson regression

Poisson regression applies where the response variable is a count (e.g. crime incidents, cases of a disease) rather than a continuous variable. This model may also be applied to standardized counts or “rates”, such as disease incidence per capita, species of tree per square kilometer. It assumes the response variable has a Poisson distribution whose expected value (mean) is dependent on one or more predictor variables. Typically the log of the expected value is assumed to have a linear relationship with the predictor variables. An excess of zeros in many sample datasets may present problems when attempting to apply this form of regression

Ecological regression

The term ecological regression does not relate directly to the subject of ecology, but to the application of regression methods to data that are aggregated to zones (lattices), as is often the case with census datasets and information collected by administrative districts. The related issue of the so-called ecological fallacy concerns the difficulty of making inferences about the nature of individuals within an aggregated region on the basis of statistics (data values, parameters, relationships) that apply to the aggregated data

Robust regression

A family of techniques that apply regression methods to problems in which outliers (in the response variable, predictor variables, or both) are known or are suspected to exist in the sample data

Hedonic regression

The term hedonic regression is used in economics, especially in real estate (property) economics, to estimate demand or prices as a combination of separate components, each of which may be treated as if it had its own market or price. In the context of regression these separate components are often treated as the independent variables in the modeling process

Analysis of variance

Applies if all of the predictors are either qualitative or classified into a (small) number of distinct groups. Analysis of variance methods are often used to analyze the significance of alternative regression models under the Normality assumption for the distribution of errors

Analysis of covariance

Applies if some of the predictors are qualitative and some are quantitative. Analysis of covariance methods are also widely applied in spatial modeling, where the covariance of observations at selected locations is examined

References

[CRA1] Crawley M J (2007) The R Book. J Wiley & Son, Chichester, UK

[YUL1] Yule G U, Kendall M G (1950) An Introduction to the Theory of Statistics. Griffin, London, 14th edition (first edition was published in 1911 under the sole authorship of Yule)