Simple and multiple linear regression

Navigation:  Regression and smoothing >

Simple and multiple linear regression

Previous pageReturn to chapter overviewNext page

As we have seen in the introduction to this topic on regression, the aim of simple linear regression is to fit a 'best' straight line of the form

to a set of n data pairs {xi,yi} that are samples from a population of values that are finite, continuous and real-valued. The fitting process is achieved using the method of (ordinary) least squares (OLS) in order to minimize the sum of squared deviations from the line to the observed data points. In this model the variable x is taken to be the independent variable and y is taken to be the dependent variable. This assumption has implications, not only suggesting a particular pattern of dependency or association, but also in terms of the best fit line produced. In general, if the variables x and y are reversed the best fit line will be different. In neither case does the best fit line minimize the distance of all observed points from the line chosen, only the distance squared in the selected plane (e.g. x- or y-direction). The parameters α and β are estimated from the data, where α is the value at which the straight line crosses the y-axis (at x=0), also known as the intercept, and β is the slope of the regression line, indicating the rate of change of y with x.

There is a close relationship between the product moment correlation coefficient and the slope parameter, β. We noted earlier that the product moment correlation coefficient was given by:

where Sxy is the covariance of x and y. The estimates for α and β are given by:

From the expressions for the slope and the correlation coefficient we see that swapping x and y will yield a different pair of parameters. We can also see that:

The derivation of the expression for the slope is provided in the least squares section of this Handbook. In the example provided in the introduction, b=-1.217, thus the intercept estimate is a=6.888+1.217*4=11.756. For a fuller discussion of this procedure see any basic text on statistics (for example, Yule and Kendall, [YUL1]).

Linear regression can be carried out whether or not it is statistically meaningful to do so - it is simply a form of curve fitting to data. However, if the linearity assumption is valid, the xi are known, and the yi are Normally distributed with means α+βxi and constant variance, σ2, then the following distributional results apply (where x0 is a fixed value):

Significance tests for both the regression coefficient, r, and the regression slope, b, are often provided for simple linear regression problems under the assumption of Normality. Tests for the significance of the regression slope and evaluation of confidence or prediction intervals use the t-distribution (see further, below).

Example: Hardness and abrasion resistance of synthetic rubber

This example is from Davies (1961, section 7.2, [DAV1]) and relates to tests on synthetic rubber samples. A total of 30 samples were analyzed and their hardness and abrasion resistance measured in a standardized manner. The data recorded is shown below. In this instance the decision on which variable is to be regarded as the dependent variable (y) and which the independent or predictor variable (x) is not entirely clear. However, abrasion loss is much more complicated and expensive to measure than hardness, so the objective of the analysis was to determine whether measuring hardness would provide a model for abrasion loss, thereby avoiding the need to measure the latter in subsequent samples.

Synthetic rubber samples

Specimen

Abrasion loss, y

Hardness, x

1

372

45

2

206

55

3

175

61

4

154

66

5

136

71

6

112

71

7

55

81

8

45

86

9

221

53

10

166

60

11

164

64

12

113

68

13

82

79

14

32

81

15

228

56

16

196

68

17

128

75

18

97

83

19

64

88

20

249

59

21

219

71

22

186

80

23

155

82

24

114

89

25

341

51

26

340

59

27

283

65

28

267

74

29

215

81

30

148

86

The first step in the analysis of the data is to produce a scatterplot of the data pairs, and then to consider the appropriate form of model for the data observed. We might also usefully compute the mean values for the two variables measured (which are 170.4 and 70.3 respectively). In this instance we will use Excel and the Data Analysis add-in that provides basic regression analysis facilities to carry out the investigation. Fully worked step-by-step calculations and discussion of this analysis are provided in Davies' book and may be readily replicated in Excel, although the results will differ from the published values due to rounding errors in the original document. Initially we examine the scatterplot of the data, and as can be seen, there appears to be a broad linear trend from top left to bottom right, thus indicating a negative correlation between the variables - abrasion loss falls as hardness increases. Note that in this diagram the scales are linear but the hardness range has been started at 40 because this is below the minimum observed data range and this helps clarify the relationship visually.

Scatterplot: Hardness and abrasion loss

Having determined that a linear model appears reasonable, analysis can proceed using the Excel regression option. However, before doing so we can examine the calculations required as this helps explain the subsequent analysis produced. First we recall the definitions provided above. We noted earlier that the product moment correlation coefficient was given by:

The estimates for α and β are given by:

For the data in question, we find the sums of products and sums of squares components are:

Sxy=-22,946.47, Syy=225,011.4, and Sxx=4,299.87.

Thus using the formula provided the correlation coefficient, R, is -0.73771. Likewise we find the regression slope is -5.33655 and the intercept is 550.4151. Using the coefficients that have been fitted, the linear model is therefore of the form:

abrasion loss (y) = 550.42 - 5.34 x hardness (x)

From this predictor equation we can estimate values for y from x, and then compute the sums of squared errors SSE=(y_pred-y_obs)2. This residual error sum of squares is: 102,556.3 which is much less than the total sum of squares, TSS=Syy=225,011.4. Thus the model (regression) sum of squares, MSS=TSS-SSE=122,455. The model therefore explains MSS/TSS% of the total variation, i.e. around 54% in this example, and MSS/TSS is known as the coefficient of determination. But we observe that this figure is R2, thus R2=(1-SSE/TSS). This provides a more general way of defining the correlation coefficient, R, i.e. as the square root of the explained variation, with the sign (+/-) of the coefficient obtained by examining the sign of the slope. More particularly, with multiple predictor variables the "Multiple R" coefficient is defined using the square root of the coefficient of determination and thus has a range [0,1], and the sign of the correlation with respect to each predictor variable is defined by examining the sign of the slope for that variable.

In this instance we have chosen to use Excel's automatic regression procedure rather than performing all the computations manually or in the spreadsheet. This enables us to include a variety of different outputs from the analysis in order to highlight different aspects of the data. The first table below shows the core regression statistics, starting with the correlation coefficient (described by Excel here as "Multiple R") and its squared value (the coefficient of determination) which identifies the proportion of the overall variation explained by the regression (54%) - as expected. The correlation is quite strong, but the explained variation is perhaps not as high as one might have hoped if this were to be used for predictive modeling. The adjusted R2 value (which is always smaller in absolute size than R2) takes into account the degrees of freedom associated with the model, which in this instance is 1 for the regression model and 28 (n-2) for the residuals (see further, below). The formula applied in the case of SPSS is:

where p is the number of variables in the model (1 in this instance), C in the number of cases and p* is  the number of coefficients in the model. There are pros and cons of using the adjusted version, but the general view is that it is an improved (closer to unbiased) measure for sample data. The standard error figure in the table is obtained from the sum of squared prediction errors divided by the degrees of freedom, which is the sample size minus 2 for the two parameters that have been estimated, thus we have: SSE/(n-2)=3662.726 and the square root of this figure is the standard error, 60.52. This figure is used below when we compute confidence intervals for the regression line and for the data items.

Regression Statistics

Multiple R

0.74

R2

0.54

Adjusted R2

0.53

Standard Error, s

60.52

Observations, n

30

The next two tables provide an analysis of this variance explanation, by identifying the component of the sums of squares (SS) due to the regression (that we have called MSS above) and that remaining, i.e. the residual variation (SSE above), together with the degrees of freedom associated with each. This enables the mean squares (MS) to be computed (as SS/df) and the F-test ratio to be produced. If the underlying assumptions for the model are acceptable, then the observed F-value is highly significant (very low probability of arising by chance). Likewise, t-tests on the intercept and slope values (shown here as t stat) are also highly significant, and using the t-distribution upper and lower confidence intervals for these parameters have been computed, again on the basis that the underlying model assumptions are valid.

ANOVA

 

df

SS

MS

F

Significance F

Regression

1

122,455.04

122,455.04

33.43

<0.001

Residual

28

102,556.33

3,662.73

 

 

Total

29

225,011.37

 

 

 

 

 

Coefficients

Standard Error

t Stat

P-value

Lower 95%

Upper 95%

Lower 95.0%

Upper 95.0%

Intercept

550.42

65.79

8.37

0.00

415.66

685.17

415.66

685.17

Hardness

-5.34

0.92

-5.78

0.00

-7.23

-3.45

-7.23

-3.45

As noted earlier, we can the coefficients to define the linear model, which is of the form:

abrasion loss = 550.42 - 5.34 x hardness

Predicted values obtained from this model can be plotted alongside the original dataset. We also noted above that on the assumption of Normality we have:

that is, the regression is approximately Normally distributed with mean equal to the regression line equation and variance that can be estimated from the expression:

Note that the figure 3663 is the (rounded) residual mean sum of squares, and is the squared standard error reported in the initial summary table above. The figure 4300 is simply the sum of squared deviations of the x-values from their mean, which we have calculated separately in the spreadsheet. This expression can then be used to generate 95% confidence intervals for the regression line, in this instance using the Normal distribution (95% z-value is ±1.96 times the expression above) or the t-distribution (which would be ±2.048 times based on n-2=28df, thus yielding very slightly wider confidence intervals). The results are plotted below. Note that the main confidence intervals (red and blue lines) are for the regression line, not the data, and are narrowest at the mean value of the two variables. Effectively this is highlighting the fact that the line can be rotated around the mean and this will alter its slope and intercept values, whilst remaining within the colored envelope shown in the diagram. The confidence limits for the data are shown with green dotted lines and are simply the regression line value +/- 1.96 time the standard error, s.

Best fit regression line of abrasion on hardness, with associated 95% confidence intervals

Generalizing simple linear regression

The simple linear regression described above can be readily generalized, but in doing so it is convenient to use matrix algebra extensively, so readers should familiarize themselves with basic matrix operations before continuing. A typical linear relationship involving more than one independent or explanatory variable might be of the form:

where β is a column vector of p parameters to be determined and x is a row vector of independent variables with a 1 in the first column (this enables the intercept parameter(s) to be included). It should be noted that the expression shown below is also described as linear since it remains linear in the coefficients, β, even if it is not linear in the predictor variables, x, for example:

In the above example there is more than one predictor or independent variable, and in this case the model is described as multiple linear regression. The functional form in such expressions can be quite general. Common functional forms are simple polynomials with a single predictor variable (polynomial regression), functions of two or more predictor variables (which are described as interaction terms), and trigonometric functions (e.g. Fourier series).

If y={yi} is a set of n observations or measurements of the response variable, with corresponding recording of matching values for the set of independent variables, then a series of linear equations of the type shown above can be constructed. However, typically the number of observations will be greater than the number of coefficients (the system is said to be over-determined). One approach to this problem is to seek a best fit solution, where best fit is taken to mean a solution for the vector of parameters β that minimizes the difference between the fitted model and observed values at these data points. In practice the procedure most widely applied seeks to minimize the sum of squared differences, hence the term least squares. Each linear expression requires an additional "error" component to allow for the degree of fit of the expression to the particular data values. If we denote these error components by εi, the general form of the expression above now becomes:

where X is now a matrix with 1’s in column 1, and ε is a vector of errors that in conceptual terms is assumed to represent the effects of unobserved variables and measurement errors in the observations. The matrix X has the general form:

where f1(xi) to fp(xi) are a set of p functions applied to one or more predictor variables, x, as recorded for each of n observations. As noted above, where there are multiple predictor variables the regression is described as multiple linear regression. Typically f1(x)=1 which yields a constant intercept term, and fk(x) is a relatively simple multi-dimensional function of the predictor variables, for example linear, quadratic or pairwise interaction terms.

In general we assume that the expected values of the errors in the expression are zero, E(ε)=0, and the variance, E(εTε)=σ2I, is constant (homoscedastic). Here 0 is a column vector of zeros and I is the identity matrix. This set of n equations in p unknowns (n>p) is typically solved for the vector β by seeking to minimize the sum of the squared error terms, εTε (Ordinary Least Squares, or OLS). The solution for the coefficients using this approach is obtained from the matrix expression:

and

where the hat (^) symbol denotes concern with sample estimates of population parameters. The sample estimates of the population error (ε) terms are described as ‘residuals’. The variance for such models is usually estimated from the residuals of the fitted model, using the expression:

Prediction intervals for individual values

A linear regression model provides a form of prediction, an estimation of values for the dependent variable, y. Each observed value, yi, will deviate by some amount (greater than or equal to 0) from the predicted value, but this does not provide an estimate of the range of values one might expect for any given predicted mean value of y for a given x. The range of values, or prediction interval, for an ordinary least squares model (assuming Normality of errors) can be obtained from the t-distribution with a probability of 100(1-α)% and n-2 df, using the formula:

Multiple linear regression

Multiple linear regression is the term used to describe a linear regression model in which there are multiple predictor variables. In this case, as we have seen above, the design matrix is of the form:

The simplest model is one which is a pure linear additive model - one for which f1(x)=1 and fk+1(x)=xk. In order to illustrate this type of model we will use example data provided with MATLab, shown below. This is a set of 20 measurements of the dependent variable, y, based on 5 predictor variables, x1 to x5.

Data input table

x1

x2

x3

x4

x5

y

1125

232

7160

85.9

8905

1.5563

920

268

8804

86.5

7388

0.8976

835

271

8108

85.2

5348

0.7482

1000

237

6370

83.8

8056

0.716

1150

192

6441

82.1

6960

0.313

990

202

5154

79.2

5690

0.3617

840

184

5896

81.2

6932

0.1139

650

200

5336

80.6

5400

0.1139

640

180

5041

78.4

3177

-0.2218

583

165

5012

79.3

4461

-0.1549

570

151

4825

78.7

3901

0

570

171

4391

78

5002

0

510

243

4320

72.3

4665

-0.0969

555

147

3709

74.9

4642

-0.2218

460

286

3969

74.4

4840

-0.3979

275

198

3558

72.5

4479

-0.1549

510

196

4361

57.7

4200

-0.2218

165

210

3301

71.8

3410

-0.3979

244

327

2964

72.5

3360

-0.5229

79

334

2777

71.9

2599

-0.0458

As a first step in data analysis and modeling a graphical representation of the data often helps. With multiple regression (and multivariate regression) plotting the data is clearly difficult. One option is to generate a set of pairwise plots, for example using the R pairs() function. A pairs() plot for this example dataset is shown below and as can be seen the plots of most of the predictor variables against the dependent variable y are roughly linear or fairly scattered.

moore

The fitted model we are considering is of the form:

To solve the regression of y on x, we need to generate the design matrix X, which is simply the source data for X augmented by an initial column of 1s:

Design matrix X

x0

x1

x2

x3

x4

x5

1

1125

232

7160

85.9

8905

1

920

268

8804

86.5

7388

1

835

271

8108

85.2

5348

1

1000

237

6370

83.8

8056

1

1150

192

6441

82.1

6960

1

990

202

5154

79.2

5690

1

840

184

5896

81.2

6932

1

650

200

5336

80.6

5400

1

640

180

5041

78.4

3177

1

583

165

5012

79.3

4461

1

570

151

4825

78.7

3901

1

570

171

4391

78

5002

1

510

243

4320

72.3

4665

1

555

147

3709

74.9

4642

1

460

286

3969

74.4

4840

1

275

198

3558

72.5

4479

1

510

196

4361

57.7

4200

1

165

210

3301

71.8

3410

1

244

327

2964

72.5

3360

1

79

334

2777

71.9

2599

Finally we apply the MATLab regress() function to obtain the unknown (estimated) parameters using least squares, which gives the result:

Notice that the predictor variable (x1) is not included as its parameter estimate was approximately zero, and the third and fifth variables have parameters close to zero, so at first glance one might think that their contributions are negligible. However, they relate to variables that have large absolute values, so dismissing these without further investigation would be unwise, as we show below. The coefficient of determination (R2) for this model is 81%, and the F-statistic (the ratio of the mean sum of squares due to the regression over the mean sum of squares due to the residuals) is 11.99, which is highly significant, both suggesting that we have a good model in terms of fit, but perhaps more complicated than necessary.

With models that have potentially a large number of explanatory variables, and unknown interaction effects, it can be helpful to use a computationally intensive technique known as a tree model, to identify key predictors and interactions. In R this facility is implemented as the package tree and the function tree(), in MATLab as classregtree, but different software packages support a range of functions of this type, and some referred to as tree functions are provided primarily for classification purposes. In each case the algorithm applied should be examined before their use. The R package uses a binary division procedure, selecting the explanatory or predictor variable whose partitioning maximally divides the response variable.

In the current example the predictor x3 is identified as the major predictor, splitting the response variable at a value x3=6133. Higher values of the response variable are not split additionally, indicating that there are no substantive interaction effects at this end of the scale, but lower values do appear to be affected by interactions to a limited extent. For example there may be a effect involving an x3x4 interaction, so including the interaction of these two predictors in the original model may result in an improved final model. For a full discussion of the operation and application of tree models, see Crawley, (2007, ch.21 [CRA1]).

It is common to tabulate or plot the residuals for all cases, as a means of identifying possible outliers, as illustrated below.

residualplotmregress

The confidence interval for the red residual (case 1) does not span the zero position, which is why it is highlighted as a possible outlier. However, at the 5% level one would expect 1 in 20 cases to appear as possible outliers, so this may simply be the result of randomness in the data - but still worth checking in case there is something unusual regarding this case (e.g. sample taken at a different time, location or under different circumstances/by a different person). If an outlier is suspected as a genuine cause for concern, the record could be removed or a form of robust regression employed. The latter involves assigning weights to each of the data records (cases) using the iteratively re-weighted least squares (IRLS) technique, and is thus computationally intensive. When applied to this example dataset, using one of many possible weighting schemes (the MATLab default, bisquare), the first case is assigned a weight of 0.025 whilst all other records are assigned weights between 0.67 and 0.99. Robust regression weighting schemes for univariate and multivariate regression are described in most software packages, and in specialized statistics texts such as Huber (1981, [HUB1]). Different schemes may be applied where the suspected outlier(s) are observed in the response variable (the standard situation), the predictor variables, or both response and predictor variables.

If the above dataset is analyzed in an integrated software environment, such as SPSS, a set of tabular and graphical outputs is available without the need to create the appropriate scripts (unlike R or MATLab). The two tables below illustrate this using SPSS and the same data as above, initially showing the coefficients and their estimated significance under the usual set of assumptions regarding such linear models, followed by the ANOVA table for this data. This suggests that most of the predictor variables have a non-significant impact, with the majority of the explanatory power coming from the constant term (the intercept) and variables 5.

Coefficients

Model

Unstandardized Coefficients

Standardized Coefficients

t

Sig.

B

Std. Error

Beta

1

(Constant)

-2.156

.913

 

-2.360

.033

x1

-9.012E-6

.001

-.005

-.017

.986

x2

.001

.001

.138

1.041

.315

x3

.000

.000

.408

1.662

.119

x4

.008

.014

.102

.564

.582

x5

.000

.000

.467

1.921

.075

 

ANOVA

Model

Sum of Squares

df

Mean Square

F

Sig.

1

Regression

4.108

5

.822

11.989

.000a

Residual

.960

14

.069

 

 

Total

5.068

19

 

 

 

Predictors: (Constant), x5, x2, x4, x3, x1    Dependent Variable: y

If this regression analysis is re-run using just variable 5 we find that the constant term and coefficient alter somewhat but as expected do produce a model that still accounts for 70% of the variance. However, this is a somewhat ad hoc approach to variable inclusion or exclusion. More systematic approaches are widely available, most commonly using a procedure known as stepwise regression. This procedure involves systematically including or excluding variables in the model based on their contribution to the overall F-statistic. In the above example, a stepwise regression using SPSS produces the result shown below, i.e. it identifies variables 3 and 5 together with the constant as providing the best model, achieving an R2 for this model of 79%:

Model Summary

Model

R

R Square

Adjusted R Square

Std. Error of the Estimate

Change Statistics

Durbin-

Watson

R Square Change

F Change

df1

df2

Sig. F Change

1

.835

.698

.681

.2918258

.698

41.509

1

18

.000

 

2

.887

.787

.762

.2518854

.090

7.161

1

17

.016

1.731

1. Predictors: (Constant), x3

2. Predictors: (Constant), x3, x5

A similar analysis including a (scaled) x3x4 interaction produces a result that explains slightly more of the overall variance than model 2, with just the constant, interaction and x5 variables included. (note that the Durbin-Watson statistic cited as standard by SPSS is a measure of possible autocorrelation amongst the residuals, although it is not particularly meaningful in many instances. Its value ranges from [1,4], with 2 being no autocorrelation).

References

[CRA1] Crawley M J (2007) The R Book. J Wiley & Son, Chichester, UK

[DAV1] Davies O L ed. (1961) Statistical Methods in Research and Production. 3rd ed., Oliver & Boyd, London

[HUB1] Huber P J (1981) Robust Statistics. John Wiley & Sons, New York

[YUL1] Yule G U, Kendall M G (1950) An Introduction to the Theory of Statistics. Griffin, London, 14th edition (first edition was published in 1911 under the sole authorship of Yule)