Logistic regression for proportion data

Navigation:  Regression and smoothing >

Logistic regression for proportion data

Previous pageReturn to chapter overviewNext page

In many instances response data are expressed in the form of proportions rather than absolute values. For example, the proportion of people who experience a particular side effect when taking a course of drugs, or the proportion of devices that fail after a particular stress test. Proportions p in these, and many other examples, are obtained by taking a ratio of the number of events that have a particular characteristic (often referred to as successes) and dividing by the total number of events in the sample, n. With a series of samples n>2 we have a series of proportions for which a simple test for proportions (or comparison of two proportions) is not adequate.

The Binomial distribution is the natural choice when analyzing proportion data, but as we have seen earlier in this Handbook, this distribution describes data which has a finite range [0,1] and is only approximated by the Normal distribution for large n. Furthermore, the variance is defined as npq, thus for a fixed value of n the variance changes with p - small values of p or q give low variances (approaching 0) whilst when p=q=0.5 the variance reaches its maximum. The usual requirements of linear regression modeling are therefore not met and hence an alternative approach is required. The standard approach in this case is to use a logit transform. We have seen earlier that the logistic function has an S-shaped form which is asymptotic to 0 and 1 for very large and small values, and has a (non-linear) equation of the form:

thus whatever values z takes, the response is a continuous variable in the range [0,1], which is exactly the kind of result we need for data that are proportions. A simple log transformation, known as the logit transform, produces the result:

In terms of generalized linear modeling (GLIM) the logit provides the link function and nowadays it is GLIM software functionality that is used to provide logistic regression.

In many research projects there are number of possible explanatory variables, and a simple linear combination of these would be of the form:

Using this observation we see that we can apply the ideas of multiple linear regression to logit transformed proportion data. However, in sample data it is not uncommon for some counts to be 0 (e.g. no-one was infected after using vaccine V), or n (everyone was infected after being exposed to bacterium B). In such cases the odds, p/q, will be either 0 or infinite, whilst in other instances the odds may be very close to becoming infinite. This must be handled by an adjustment to the data. A simple approach is to add 0.5 to the count of successes and optionally to add 1 to the total, n, assuming that n is not small (e.g. n>10). Some software packages will make such adjustments automatically (and may object if fractionally adjusted counts are provided), whilst with others it will be up to the user to apply or specify such adjustments. Another issue arises where the sample sizes vary substantially and weighting of the regression procedure may be used to minimize this effect.

The net effect of the variations in implementations of GLIM mean that obtaining comparable results from different software packages may be difficult. Examining the model fit (e.g. by using the plot(model) facility in R) and plotting the resulting curve on the scatterplot is often the only means of evaluating the different results. A worked example using R and MATLab illustrates this.

Example: Vehicle fuel efficiency

A set of data on the performance of cars in terms of fuel efficiency is provided with MATLab. In summary form this provides a set of total counts and poor performance counts for car models that fall into 12 weight bands. The source data are as follows:

weight        total        poor

2100        48        1

2300        42        2

2500        31        0

2700        34        3

2900        31        8

3100        21        8

3300        23        14

3500        21        17

3700        21        19

3900        16        15

4100        17        17

4300        21        21

The model for z in this case is simply a two parameter linear expression of the form:

Analyzing this data in MATLab using the glmfit() procedure yields parameter estimates b0=-13.38 and b1=0.0042, whilst with R the parameters are estimated as b0=-11.99 and b1=0.0039, i.e. similar but not the same. Both results produce a very good fit to the data, with the R fit being shown in the graph below. Note that with R, the data input should be in the form success/failure or in this example, "poor"/"not poor", so the "not poor" variable must be computed from the total unless the input data is amended. As expected, this analysis confirms that heavy cars are less full efficient and can be effectively modeled in this way. Integrated statistical packages vary in their implementations of logistic regression, some providing this via a general purpose GLIM facility (e.g. SPSS) whilst others provide a range of options - for example SAS/STAT provides logistic regression via the generalized linear modeling procedure (GENMOD) and via a specifically logistic procedure (LOGISTIC). For more details on modeling using proportion data (e.g. with categorical data and mixed data types), refer to the manuals for the individual package being used and for R users see Crawley (2007, ch. 16, [CRA1]).

carmileage

References

[CRA1] Crawley M J (2007) The R Book. J Wiley & Son, Chichester, UK