The general ideas behind least squares originated some 200 years ago with the work of Gauss and Legendre. Initially it was applied to problems in astronomy, in particular fitting a mathematical curve to observed partial data in order to predict the position of celestial objects. In this section we commence with a discussion of ordinary least squares (OLS) as applied to linear modeling, and then extend the discussion to OLS applied to non-linear modeling, before considering generalizations of the procedure (iteratively re-weighted least squares and generalized least squares). In basic statistics ordinary least squares is the form most often encountered and discussed in text books, and has the merit of simplicity and having a closed solution. Other least squares methods apply when the assumptions in OLS are not met, for example when the observations include outliers or exhibit spatial or temporal autocorrelation.
In the introduction to this topic we used the example of a simple linear model of the form:
in which the final term is the error or residual associated with observation i. The procedure known as the least squares method seeks to minimize the sum of squared errors (residuals) in expressions of this type. Initially we perform the analysis using standard algebraic techniques, and then we extend this using a matrix algebra formulation, which is better suited to such problems as their complexity increases. Note that in this procedure it is assumed that: (a) observations are random, independent samples; (b) the expected value of the errors (residuals) after model fitting is 0; (c) there is no autocorrelation in the data (a risky assumption in temporal or spatial datasets); (d) the dependent variables, if there are more than one, are also linearly independent (they do not exhibit co-linearity); (e) a linear model is appropriate; and (f) the data being modeled are continuous and real-valued. The assumption that the errors are Normally distributed is not a pre-requisite for ordinary least squares.
Re-arranging the above equation and taking summations of squared values, we are seeking to minimize an expression of the form:
in order to provide estimates, a and b, for the unknown intercept and slope parameters, α and β. Assuming the variables are continuous and differentiable, we can use simple differential calculus to obtain the solution. First we differentiate the left hand side of the equation with respect to β, and then equate this to 0 in order to identify the minimum for the slope parameter:
Multiplying through and canceling the -2 we then have:
This is essentially a single equation in two unknowns, but it can be solved by noting that the best fit line always passes through the mean of the x- and y-values, hence we have the additional information that:
Substituting for α into the previous expression we have:
and simplifying, this gives:
which determines the value for the slope, b, as an estimate of the population slope, β, and can be seen to be of the form of a covariance/variance ratio. The estimate for α then follows using the mean values by substitution.
We now re-write the above in matrix notation, extending the result to a more complex form of linear equation with p parameters. The expression:
can be seen as a particular case of the more general form:
In this equation y is a vector of observed values, X is a (n x p) matrix with its first column entirely consisting of the value 1, and β is a p x 1 vector of parameters to be estimated, where β0≡α. Typically the matrix X will consist of randomly sampled observations (random design), but could equally well be obtained using a set of pre-defined values of the independent variable (as in the caterpillar growth example provided earlier) in which case the matrix X is often referred to as a fixed design. The least squares method seeks to minimize the sum of the squared residuals, and because this requires vector and matrix multiplication, the transpose operation, T, is required:
Differentiating this matrix expression and equating the result to 0 (the zero vector) we have:
The first term, which does not include β, disappears when differentiated; likewise, the two middle terms lose their β component and using their symmetry can be replaced by a single expression times 2; whilst the final term includes squared β components, so these become 2β giving:
where again we have used the ^ symbol to indicate that the result is an estimate of the population parameter vector β. This estimate is unbiased, as can be confirmed by taking the expected value, E(), of the right hand side and replacing y using
and noting that E(ε|X)=0 (a central assumption of the model). Note that in this least squares method no assumptions have been made regarding the distribution of the errors. However, it can be shown that if the errors are also assumed to be Normally distributed, then the same parameter estimates can be derived using maximum likelihood methods, and hence a linear regression for which the Normality assumption also holds is asymptotically efficient.
The results above also apply if the model equation is more complex, perhaps involving higher order powers of a dependent variable and/or if there are multiple dependent variables. Expressions such as:
remain linear in the coefficients even though they are not linear in the dependent or predictor variables. Thus simple multivariate regression and polynomial regression can be addressed using the standard least squares method.
It is immediately apparent from the preceding discussion that inversion of the design matrix is an essential part of this process. With large datasets (e.g. 10,000 or 100,000 data elements) this may be infeasible or too lengthy to be considered and thus alternative approaches may need to be considered (e.g. random sampling from the dataset to obtain a usable subset). Likewise, since inversion is a requirement of this procedure, if the design matrix is singular or near singular it will not be possible to obtain a satisfactory solution. This may occur for a number of reasons, for example if there is a high degree of co-linearity amongst the predictor variables (they are highly correlated in a linear manner). One solution to this problem is to use a technique known as ridge regression, discussed below.
The standard least squares procedure does not apply for models which are non-linear, i.e. models for which the partial differentials remain as functions of the parameters and independent variables, so do not drop out in a convenient manner when equated to zero. For example, a very common three parameter exponential model of the form:
is non-linear, as can be seen if one differentiates the expression for the squared residuals:
For example, evaluating the partial differential with respect to the third parameter, we have:
More generally, we have
i.e. the dependent variable, y, is a (differentiable) function of a number of independent variables and parameters, and thus we seek to minimize
Minimizing the sum of squared residuals using differential calculus therefore equates to evaluating:
In general minimization of such functions requires numerical procedures, often based on standard procedures known as gradient methods (e.g. Newton-Raphson, Conjugate gradient and related procedures). The choice of optimization procedure and the manner of its search of the solution space is a complex one, since the success of these methods will depend on the complexity of the function to be fitted, the dataset supplied, the initialization approach adopted, the quality of the optimization sought (proximity of estimated parameter values to their global optimal values) and the processor time available. For many problems optimization by numerical methods is fast and effective, whereas in other instances it is extremely difficult and it may be impossible to determine the quality of the approximation.
In the OLS matrix method above, the sum of squared residuals to be minimized was shown to be of the form:
and it was assumed that the variance of the errors was a constant, i.e. that E(εTε)=M=σ2I (the homoscedasticity assumption). This can be included in the expression to be minimized, as follows:
but drops out in the OLS case as can be seen. However, if the variance is not homoskedastic the matrix M will have non-constant entries in the diagonal and hence the matrix M will not drop out during the minimization process. Likewise, if the observations exhibit correlation (typically autocorrelation) the off-diagonal elements of M will be non-zero also. In linear models it is still possible to obtain a closed form estimate of the parameters in this case, and following the same procedure for minimization as in OLS, the resulting estimates are obtained from the expression:
Clearly this result requires that both X and M are invertible matrices. If M is diagonal (there are no correlations between the observed data pairs) the inverse is simply the reciprocal of the entries and assuming these are all non-zero (and not extremely close to 0) then M will be well behaved. If the estimate above is re-written as:
where W is a weights matrix with the weights used being proportional to the inverse of the variances, the procedure is described as weighted least squares (WLS). A version of WLS that incorporates iterative updating of the parameter estimates is described below.
IRLS is a numerical procedure used primarily (but not exclusively) in robust regression, i.e. to minimize the impact of outliers. It incorporates weights into the expression to be minimized, which are updated at each iteration of the process in accordance with well defined rules. In broad terms the procedure commences with every observation assigned a weight of 1, thus is a conventional least squares operation. The data item(s) that are furthest from those predicted by this least squares model are then assigned a weighting that is less than 1 and the process is repeated until it meets some pre-determined convergence criteria. This may result in some data items having very low weights (close to 0) whilst many others remain with weights close to 1, thereby minimizing the effects of suspected outliers.
Although it uses elements of standard least squares procedures, the weightings are often based on the more general Lp norm, with p=1 or higher. In the case of p=1, the weights are defined to minimize the absolute value of deviations rather than the squared values. In this case, instead of the fixed solution for the parameters of a linear regression of the form derived above:
we have an iterative model
i.e. the parameter estimate at iteration i+1 is determined using a weighted version of the standard least squares solution in which the weighting matrix, W is a diagonal matrix with elements that are derived from the Lp norm of the residuals using the (2-p)th power. If p=2 these weights are all 1, so have no effect. If p=1 the weights are the absolute values of the reciprocal of the residuals.