As we have seen in the introduction to this topic on regression, the aim of simple linear regression is to fit a 'best' straight line of the form
to a set of n data pairs {xi,yi} that are samples from a population of values that are finite, continuous and realvalued. The fitting process is achieved using the method of (ordinary) least squares (OLS) in order to minimize the sum of squared deviations from the line to the observed data points. In this model the variable x is taken to be the independent variable and y is taken to be the dependent variable. This assumption has implications, not only suggesting a particular pattern of dependency or association, but also in terms of the best fit line produced. In general, if the variables x and y are reversed the best fit line will be different. In neither case does the best fit line minimize the distance of all observed points from the line chosen, only the distance squared in the selected plane (e.g. x or ydirection). The parameters α and β are estimated from the data, where α is the value at which the straight line crosses the yaxis (at x=0), also known as the intercept, and β is the slope of the regression line, indicating the rate of change of y with x.
There is a close relationship between the product moment correlation coefficient and the slope parameter, β. We noted earlier that the product moment correlation coefficient was given by:
where Sxy is the covariance of x and y. The estimates for α and β are given by:
From the expressions for the slope and the correlation coefficient we see that swapping x and y will yield a different pair of parameters. We can also see that:
The derivation of the expression for the slope is provided in the least squares section of this Handbook. In the example provided in the introduction, b=1.217, thus the intercept estimate is a=6.888+1.217*4=11.756. For a fuller discussion of this procedure see any basic text on statistics (for example, Yule and Kendall, [YUL1]).
Linear regression can be carried out whether or not it is statistically meaningful to do so  it is simply a form of curve fitting to data. However, if the linearity assumption is valid, the xi are known, and the yi are Normally distributed with means α+βxi and constant variance, σ2, then the following distributional results apply (where x0 is a fixed value):
Significance tests for both the regression coefficient, r, and the regression slope, b, are often provided for simple linear regression problems under the assumption of Normality. Tests for the significance of the regression slope and evaluation of confidence or prediction intervals use the tdistribution (see further, below).
Example: Hardness and abrasion resistance of synthetic rubber
This example is from Davies (1961, section 7.2, [DAV1]) and relates to tests on synthetic rubber samples. A total of 30 samples were analyzed and their hardness and abrasion resistance measured in a standardized manner. The data recorded is shown below. In this instance the decision on which variable is to be regarded as the dependent variable (y) and which the independent or predictor variable (x) is not entirely clear. However, abrasion loss is much more complicated and expensive to measure than hardness, so the objective of the analysis was to determine whether measuring hardness would provide a model for abrasion loss, thereby avoiding the need to measure the latter in subsequent samples.
Synthetic rubber samples
Specimen 
Abrasion loss, y 
Hardness, x 
1 
372 
45 
2 
206 
55 
3 
175 
61 
4 
154 
66 
5 
136 
71 
6 
112 
71 
7 
55 
81 
8 
45 
86 
9 
221 
53 
10 
166 
60 
11 
164 
64 
12 
113 
68 
13 
82 
79 
14 
32 
81 
15 
228 
56 
16 
196 
68 
17 
128 
75 
18 
97 
83 
19 
64 
88 
20 
249 
59 
21 
219 
71 
22 
186 
80 
23 
155 
82 
24 
114 
89 
25 
341 
51 
26 
340 
59 
27 
283 
65 
28 
267 
74 
29 
215 
81 
30 
148 
86 
The first step in the analysis of the data is to produce a scatterplot of the data pairs, and then to consider the appropriate form of model for the data observed. We might also usefully compute the mean values for the two variables measured (which are 170.4 and 70.3 respectively). In this instance we will use Excel and the Data Analysis addin that provides basic regression analysis facilities to carry out the investigation. Fully worked stepbystep calculations and discussion of this analysis are provided in Davies' book and may be readily replicated in Excel, although the results will differ from the published values due to rounding errors in the original document. Initially we examine the scatterplot of the data, and as can be seen, there appears to be a broad linear trend from top left to bottom right, thus indicating a negative correlation between the variables  abrasion loss falls as hardness increases. Note that in this diagram the scales are linear but the hardness range has been started at 40 because this is below the minimum observed data range and this helps clarify the relationship visually.
Scatterplot: Hardness and abrasion loss
Having determined that a linear model appears reasonable, analysis can proceed using the Excel regression option. However, before doing so we can examine the calculations required as this helps explain the subsequent analysis produced. First we recall the definitions provided above. We noted earlier that the product moment correlation coefficient was given by:
The estimates for α and β are given by:
For the data in question, we find the sums of products and sums of squares components are:
Sxy=22,946.47, Syy=225,011.4, and Sxx=4,299.87.
Thus using the formula provided the correlation coefficient, R, is 0.73771. Likewise we find the regression slope is 5.33655 and the intercept is 550.4151. Using the coefficients that have been fitted, the linear model is therefore of the form:
abrasion loss (y) = 550.42  5.34 x hardness (x)
From this predictor equation we can estimate values for y from x, and then compute the sums of squared errors SSE=(y_predy_obs)2. This residual error sum of squares is: 102,556.3 which is much less than the total sum of squares, TSS=Syy=225,011.4. Thus the model (regression) sum of squares, MSS=TSSSSE=122,455. The model therefore explains MSS/TSS% of the total variation, i.e. around 54% in this example, and MSS/TSS is known as the coefficient of determination. But we observe that this figure is R2, thus R2=(1SSE/TSS). This provides a more general way of defining the correlation coefficient, R, i.e. as the square root of the explained variation, with the sign (+/) of the coefficient obtained by examining the sign of the slope. More particularly, with multiple predictor variables the "Multiple R" coefficient is defined using the square root of the coefficient of determination and thus has a range [0,1], and the sign of the correlation with respect to each predictor variable is defined by examining the sign of the slope for that variable.
In this instance we have chosen to use Excel's automatic regression procedure rather than performing all the computations manually or in the spreadsheet. This enables us to include a variety of different outputs from the analysis in order to highlight different aspects of the data. The first table below shows the core regression statistics, starting with the correlation coefficient (described by Excel here as "Multiple R") and its squared value (the coefficient of determination) which identifies the proportion of the overall variation explained by the regression (54%)  as expected. The correlation is quite strong, but the explained variation is perhaps not as high as one might have hoped if this were to be used for predictive modeling. The adjusted R2 value (which is always smaller in absolute size than R2) takes into account the degrees of freedom associated with the model, which in this instance is 1 for the regression model and 28 (n2) for the residuals (see further, below). The formula applied in the case of SPSS is:
where p is the number of variables in the model (1 in this instance), C in the number of cases and p* is the number of coefficients in the model. There are pros and cons of using the adjusted version, but the general view is that it is an improved (closer to unbiased) measure for sample data. The standard error figure in the table is obtained from the sum of squared prediction errors divided by the degrees of freedom, which is the sample size minus 2 for the two parameters that have been estimated, thus we have: SSE/(n2)=3662.726 and the square root of this figure is the standard error, 60.52. This figure is used below when we compute confidence intervals for the regression line and for the data items.
Regression Statistics 

Multiple R 
0.74 
R2 
0.54 
Adjusted R2 
0.53 
Standard Error, s 
60.52 
Observations, n 
30 
The next two tables provide an analysis of this variance explanation, by identifying the component of the sums of squares (SS) due to the regression (that we have called MSS above) and that remaining, i.e. the residual variation (SSE above), together with the degrees of freedom associated with each. This enables the mean squares (MS) to be computed (as SS/df) and the Ftest ratio to be produced. If the underlying assumptions for the model are acceptable, then the observed Fvalue is highly significant (very low probability of arising by chance). Likewise, ttests on the intercept and slope values (shown here as t stat) are also highly significant, and using the tdistribution upper and lower confidence intervals for these parameters have been computed, again on the basis that the underlying model assumptions are valid.
ANOVA

df 
SS 
MS 
F 
Significance F 
Regression 
1 
122,455.04 
122,455.04 
33.43 
<0.001 
Residual 
28 
102,556.33 
3,662.73 


Total 
29 
225,011.37 




Coefficients 
Standard Error 
t Stat 
Pvalue 
Lower 95% 
Upper 95% 
Lower 95.0% 
Upper 95.0% 
Intercept 
550.42 
65.79 
8.37 
0.00 
415.66 
685.17 
415.66 
685.17 
Hardness 
5.34 
0.92 
5.78 
0.00 
7.23 
3.45 
7.23 
3.45 
As noted earlier, we can the coefficients to define the linear model, which is of the form:
abrasion loss = 550.42  5.34 x hardness
Predicted values obtained from this model can be plotted alongside the original dataset. We also noted above that on the assumption of Normality we have:
that is, the regression is approximately Normally distributed with mean equal to the regression line equation and variance that can be estimated from the expression:
Note that the figure 3663 is the (rounded) residual mean sum of squares, and is the squared standard error reported in the initial summary table above. The figure 4300 is simply the sum of squared deviations of the xvalues from their mean, which we have calculated separately in the spreadsheet. This expression can then be used to generate 95% confidence intervals for the regression line, in this instance using the Normal distribution (95% zvalue is ±1.96 times the expression above) or the tdistribution (which would be ±2.048 times based on n2=28df, thus yielding very slightly wider confidence intervals). The results are plotted below. Note that the main confidence intervals (red and blue lines) are for the regression line, not the data, and are narrowest at the mean value of the two variables. Effectively this is highlighting the fact that the line can be rotated around the mean and this will alter its slope and intercept values, whilst remaining within the colored envelope shown in the diagram. The confidence limits for the data are shown with green dotted lines and are simply the regression line value +/ 1.96 time the standard error, s.
Best fit regression line of abrasion on hardness, with associated 95% confidence intervals
Generalizing simple linear regression
The simple linear regression described above can be readily generalized, but in doing so it is convenient to use matrix algebra extensively, so readers should familiarize themselves with basic matrix operations before continuing. A typical linear relationship involving more than one independent or explanatory variable might be of the form:
where β is a column vector of p parameters to be determined and x is a row vector of independent variables with a 1 in the first column (this enables the intercept parameter(s) to be included). It should be noted that the expression shown below is also described as linear since it remains linear in the coefficients, β, even if it is not linear in the predictor variables, x, for example:
In the above example there is more than one predictor or independent variable, and in this case the model is described as multiple linear regression. The functional form in such expressions can be quite general. Common functional forms are simple polynomials with a single predictor variable (polynomial regression), functions of two or more predictor variables (which are described as interaction terms), and trigonometric functions (e.g. Fourier series).
If y={yi} is a set of n observations or measurements of the response variable, with corresponding recording of matching values for the set of independent variables, then a series of linear equations of the type shown above can be constructed. However, typically the number of observations will be greater than the number of coefficients (the system is said to be overdetermined). One approach to this problem is to seek a best fit solution, where best fit is taken to mean a solution for the vector of parameters β that minimizes the difference between the fitted model and observed values at these data points. In practice the procedure most widely applied seeks to minimize the sum of squared differences, hence the term least squares. Each linear expression requires an additional "error" component to allow for the degree of fit of the expression to the particular data values. If we denote these error components by εi, the general form of the expression above now becomes:
where X is now a matrix with 1’s in column 1, and ε is a vector of errors that in conceptual terms is assumed to represent the effects of unobserved variables and measurement errors in the observations. The matrix X has the general form:
where f1(xi) to fp(xi) are a set of p functions applied to one or more predictor variables, x, as recorded for each of n observations. As noted above, where there are multiple predictor variables the regression is described as multiple linear regression. Typically f1(x)=1 which yields a constant intercept term, and fk(x) is a relatively simple multidimensional function of the predictor variables, for example linear, quadratic or pairwise interaction terms.
In general we assume that the expected values of the errors in the expression are zero, E(ε)=0, and the variance, E(εTε)=σ2I, is constant (homoscedastic). Here 0 is a column vector of zeros and I is the identity matrix. This set of n equations in p unknowns (n>p) is typically solved for the vector β by seeking to minimize the sum of the squared error terms, εTε (Ordinary Least Squares, or OLS). The solution for the coefficients using this approach is obtained from the matrix expression:
and
where the hat (^) symbol denotes concern with sample estimates of population parameters. The sample estimates of the population error (ε) terms are described as ‘residuals’. The variance for such models is usually estimated from the residuals of the fitted model, using the expression:
Prediction intervals for individual values
A linear regression model provides a form of prediction, an estimation of values for the dependent variable, y. Each observed value, yi, will deviate by some amount (greater than or equal to 0) from the predicted value, but this does not provide an estimate of the range of values one might expect for any given predicted mean value of y for a given x. The range of values, or prediction interval, for an ordinary least squares model (assuming Normality of errors) can be obtained from the tdistribution with a probability of 100(1α)% and n2 df, using the formula:
Multiple linear regression is the term used to describe a linear regression model in which there are multiple predictor variables. In this case, as we have seen above, the design matrix is of the form:
The simplest model is one which is a pure linear additive model  one for which f1(x)=1 and fk+1(x)=xk. In order to illustrate this type of model we will use example data provided with MATLab, shown below. This is a set of 20 measurements of the dependent variable, y, based on 5 predictor variables, x1 to x5.
Data input table
x1 
x2 
x3 
x4 
x5 
y 
1125 
232 
7160 
85.9 
8905 
1.5563 
920 
268 
8804 
86.5 
7388 
0.8976 
835 
271 
8108 
85.2 
5348 
0.7482 
1000 
237 
6370 
83.8 
8056 
0.716 
1150 
192 
6441 
82.1 
6960 
0.313 
990 
202 
5154 
79.2 
5690 
0.3617 
840 
184 
5896 
81.2 
6932 
0.1139 
650 
200 
5336 
80.6 
5400 
0.1139 
640 
180 
5041 
78.4 
3177 
0.2218 
583 
165 
5012 
79.3 
4461 
0.1549 
570 
151 
4825 
78.7 
3901 
0 
570 
171 
4391 
78 
5002 
0 
510 
243 
4320 
72.3 
4665 
0.0969 
555 
147 
3709 
74.9 
4642 
0.2218 
460 
286 
3969 
74.4 
4840 
0.3979 
275 
198 
3558 
72.5 
4479 
0.1549 
510 
196 
4361 
57.7 
4200 
0.2218 
165 
210 
3301 
71.8 
3410 
0.3979 
244 
327 
2964 
72.5 
3360 
0.5229 
79 
334 
2777 
71.9 
2599 
0.0458 
As a first step in data analysis and modeling a graphical representation of the data often helps. With multiple regression (and multivariate regression) plotting the data is clearly difficult. One option is to generate a set of pairwise plots, for example using the R pairs() function. A pairs() plot for this example dataset is shown below and as can be seen the plots of most of the predictor variables against the dependent variable y are roughly linear or fairly scattered.
The fitted model we are considering is of the form:
To solve the regression of y on x, we need to generate the design matrix X, which is simply the source data for X augmented by an initial column of 1s:
Design matrix X
x0 
x1 
x2 
x3 
x4 
x5 
1 
1125 
232 
7160 
85.9 
8905 
1 
920 
268 
8804 
86.5 
7388 
1 
835 
271 
8108 
85.2 
5348 
1 
1000 
237 
6370 
83.8 
8056 
1 
1150 
192 
6441 
82.1 
6960 
1 
990 
202 
5154 
79.2 
5690 
1 
840 
184 
5896 
81.2 
6932 
1 
650 
200 
5336 
80.6 
5400 
1 
640 
180 
5041 
78.4 
3177 
1 
583 
165 
5012 
79.3 
4461 
1 
570 
151 
4825 
78.7 
3901 
1 
570 
171 
4391 
78 
5002 
1 
510 
243 
4320 
72.3 
4665 
1 
555 
147 
3709 
74.9 
4642 
1 
460 
286 
3969 
74.4 
4840 
1 
275 
198 
3558 
72.5 
4479 
1 
510 
196 
4361 
57.7 
4200 
1 
165 
210 
3301 
71.8 
3410 
1 
244 
327 
2964 
72.5 
3360 
1 
79 
334 
2777 
71.9 
2599 
Finally we apply the MATLab regress() function to obtain the unknown (estimated) parameters using least squares, which gives the result:
Notice that the predictor variable (x1) is not included as its parameter estimate was approximately zero, and the third and fifth variables have parameters close to zero, so at first glance one might think that their contributions are negligible. However, they relate to variables that have large absolute values, so dismissing these without further investigation would be unwise, as we show below. The coefficient of determination (R2) for this model is 81%, and the Fstatistic (the ratio of the mean sum of squares due to the regression over the mean sum of squares due to the residuals) is 11.99, which is highly significant, both suggesting that we have a good model in terms of fit, but perhaps more complicated than necessary.
With models that have potentially a large number of explanatory variables, and unknown interaction effects, it can be helpful to use a computationally intensive technique known as a tree model, to identify key predictors and interactions. In R this facility is implemented as the package tree and the function tree(), in MATLab as classregtree, but different software packages support a range of functions of this type, and some referred to as tree functions are provided primarily for classification purposes. In each case the algorithm applied should be examined before their use. The R package uses a binary division procedure, selecting the explanatory or predictor variable whose partitioning maximally divides the response variable.
In the current example the predictor x3 is identified as the major predictor, splitting the response variable at a value x3=6133. Higher values of the response variable are not split additionally, indicating that there are no substantive interaction effects at this end of the scale, but lower values do appear to be affected by interactions to a limited extent. For example there may be a effect involving an x3x4 interaction, so including the interaction of these two predictors in the original model may result in an improved final model. For a full discussion of the operation and application of tree models, see Crawley, (2007, ch.21 [CRA1]).
It is common to tabulate or plot the residuals for all cases, as a means of identifying possible outliers, as illustrated below.
The confidence interval for the red residual (case 1) does not span the zero position, which is why it is highlighted as a possible outlier. However, at the 5% level one would expect 1 in 20 cases to appear as possible outliers, so this may simply be the result of randomness in the data  but still worth checking in case there is something unusual regarding this case (e.g. sample taken at a different time, location or under different circumstances/by a different person). If an outlier is suspected as a genuine cause for concern, the record could be removed or a form of robust regression employed. The latter involves assigning weights to each of the data records (cases) using the iteratively reweighted least squares (IRLS) technique, and is thus computationally intensive. When applied to this example dataset, using one of many possible weighting schemes (the MATLab default, bisquare), the first case is assigned a weight of 0.025 whilst all other records are assigned weights between 0.67 and 0.99. Robust regression weighting schemes for univariate and multivariate regression are described in most software packages, and in specialized statistics texts such as Huber (1981, [HUB1]). Different schemes may be applied where the suspected outlier(s) are observed in the response variable (the standard situation), the predictor variables, or both response and predictor variables.
If the above dataset is analyzed in an integrated software environment, such as SPSS, a set of tabular and graphical outputs is available without the need to create the appropriate scripts (unlike R or MATLab). The two tables below illustrate this using SPSS and the same data as above, initially showing the coefficients and their estimated significance under the usual set of assumptions regarding such linear models, followed by the ANOVA table for this data. This suggests that most of the predictor variables have a nonsignificant impact, with the majority of the explanatory power coming from the constant term (the intercept) and variables 5.
Coefficients 


Model 
Unstandardized Coefficients 
Standardized Coefficients 
t 
Sig. 

B 
Std. Error 
Beta 

1 
(Constant) 
2.156 
.913 

2.360 
.033 
x1 
9.012E6 
.001 
.005 
.017 
.986 

x2 
.001 
.001 
.138 
1.041 
.315 

x3 
.000 
.000 
.408 
1.662 
.119 

x4 
.008 
.014 
.102 
.564 
.582 

x5 
.000 
.000 
.467 
1.921 
.075 
ANOVA 


Model 
Sum of Squares 
df 
Mean Square 
F 
Sig. 

1 
Regression 
4.108 
5 
.822 
11.989 
.000a 
Residual 
.960 
14 
.069 



Total 
5.068 
19 




Predictors: (Constant), x5, x2, x4, x3, x1 Dependent Variable: y 
If this regression analysis is rerun using just variable 5 we find that the constant term and coefficient alter somewhat but as expected do produce a model that still accounts for 70% of the variance. However, this is a somewhat ad hoc approach to variable inclusion or exclusion. More systematic approaches are widely available, most commonly using a procedure known as stepwise regression. This procedure involves systematically including or excluding variables in the model based on their contribution to the overall Fstatistic. In the above example, a stepwise regression using SPSS produces the result shown below, i.e. it identifies variables 3 and 5 together with the constant as providing the best model, achieving an R2 for this model of 79%:
Model Summary 


Model 
R 
R Square 
Adjusted R Square 
Std. Error of the Estimate 
Change Statistics 
Durbin Watson 

R Square Change 
F Change 
df1 
df2 
Sig. F Change 

1 
.835 
.698 
.681 
.2918258 
.698 
41.509 
1 
18 
.000 

2 
.887 
.787 
.762 
.2518854 
.090 
7.161 
1 
17 
.016 
1.731 
1. Predictors: (Constant), x3 

2. Predictors: (Constant), x3, x5 
A similar analysis including a (scaled) x3x4 interaction produces a result that explains slightly more of the overall variance than model 2, with just the constant, interaction and x5 variables included. (note that the DurbinWatson statistic cited as standard by SPSS is a measure of possible autocorrelation amongst the residuals, although it is not particularly meaningful in many instances. Its value ranges from [1,4], with 2 being no autocorrelation).
References
[CRA1] Crawley M J (2007) The R Book. J Wiley & Son, Chichester, UK
[DAV1] Davies O L ed. (1961) Statistical Methods in Research and Production. 3rd ed., Oliver & Boyd, London
[HUB1] Huber P J (1981) Robust Statistics. John Wiley & Sons, New York
[YUL1] Yule G U, Kendall M G (1950) An Introduction to the Theory of Statistics. Griffin, London, 14th edition (first edition was published in 1911 under the sole authorship of Yule)