Poisson regression for count data

Navigation:  Regression and smoothing >

Poisson regression for count data

Previous pageReturn to chapter overviewNext page

Poisson regression applies where the response variable is a count (e.g. crime incidents, cases of a disease) rather than a continuous variable. This model may also be applied to standardized counts or “rates”, such as disease incidence per capita, species of tree per square kilometer. It assumes the response variable has a Poisson distribution whose expected value (mean) is dependent on one or more predictor variables. Typically the log of the expected value is assumed to have a linear relationship with the predictor variables.

As Crawley (2007, p527, [CRA1]) notes, linear regression is not appropriate for such data since:

the linear model might lead to the prediction of negative counts

the variance of the response variable is likely to increase with the mean

the errors will not be Normally distributed, and

zeros are difficult to handle in transformations

Furthermore, an excess of zeros in many sample datasets may present problems when attempting to apply this form of regression and special variants of GLIM regression have been devised to handle such situations (e.g. using a Negative Binomial model rather than a Poisson model). Most software packages provide support for count data regression using some form of GLIM or similar facility, e.g. glm() in R, GENMOD in SAS/STAT, Generalized Linear Modeling (Count data) in SPSS. Although the response variable is in the form of counts, the predictors can be continuous variables or nominal or ordinal values. For more details, with specific reference to its use in R including worked examples, see Zeileis et al. (2008, [ZEI1]), from whom the table below is derived. The two zero-inflated models listed essentially involve separating the data into a simple model for the zeros and a zero-truncated distribution model for the 1+ counts (e.g. Poisson or NB).

Overview of count regression models in R (after Zeileis et al. (2008)

Type

Distribution

Method

Description

R Package

GLM

Poisson

 

ML

Poisson regression: classical GLM, estimated by maximum likelihood (ML)

stats

Quasi

"quasi-Poisson regression": same mean function, estimated by quasi-ML (QML) or equivalently generalized estimating equations (GEE), inference adjustment via estimated dispersion parameter - useful for modeling when over-dispersion is encountered

stats

Adjusted

"adjusted Poisson regression": same mean function, estimated by QML/GEE, inference adjustment via sandwich covariances

stats

Negative Binomial (NB)

ML

NB regression: extended GLM, estimated by ML including additional shape parameter

MASS

Zero augmented

Poisson

ML

zero-inflated Poisson (ZIP), hurdle Poisson

pcsl

Negative Binomial (NB)

ML

zero-inflated NB (ZINB), hurdle NB

pcsl

Count data is often provided in the form of contingency tables, for which all variables are count or nominal, and when modeled using Poisson regression the approach is referred to as log-linear modeling.

Example: Car insurance claims modeling

To illustrate this we show below a simple table of insurance data from Aitken et al. (1989, [AIT1], referred to in the SAS/STAT documentation but here we use SPSS to conduct the regression analysis. The table below shows the number of policyholders (N), Claims (C) and the type of car (in three groups, small medium and large) and age of driver (in two groups, younger and older). The objective is to model the claims level from the information on car type and driver age, bearing in mind the different numbers of policyholders involved. The data could represent actual records from 6 different insurance companies, although those provided below are synthetic. Likewise, in a real example the data would probably extend to more insurers, a more detailed classification of car types, a finer classification of driver ages, and other predictor variables, such as sex of driver, occupational group of driver etc. However, for our purposes the example provided will suffice.

i

N

C

Car

Age

1

500

42

s

1

2

1200

37

m

1

3

100

1

l

1

4

400

101

s

2

5

500

73

m

2

6

300

14

l

2

With Poisson regression we assume the count data is approximately Poisson distributed (so the mean is approximately equal to the variance) and the data is analyzed following log transformation. The model we shall apply in this example is of the form:

In this expression the variables car and age are indicators, i.e. take the values 0 or 1 as appropriate. To model this data in SPSS a new variable, offset=ln(N) is created, and then the generalized linear models option selected for count data with a Poisson loglinear model (see below):

glim_spss

The response variable is C and the predictors are the type of Car and the Age of the driver, with offset specified as our variable offset. The Model tab specifies the main effects, which in this case are simply the variables Car and Age, and no other effects (e.g. interactions) have been specified. The output from the analysis produces a variety of statistics, including goodness of fit measures, significance of parameter estimates and details of parameter estimates with confidence intervals. Here we focus on the last of these, as the model appears to provide a good fit to the (limited) data available. Below we show the parameter estimates obtained, which encouragingly are almost identical to those produced by SAS/STAT. Note that two predictors (car=small and age=2) have a zero as the parameter estimate since these were found to be redundant. Also note the relatively wide confidence intervals for some of the parameters:

Parameter Estimates

Parameter

B

Std. Error

95% Wald Confidence Interval

Hypothesis Test

Lower

Upper

Wald Chi-Square

df

Sig.

(Intercept)

-1.317

.0903

-1.494

-1.140

212.732

1

.000

[car=l]

-1.764

.2724

-2.298

-1.230

41.959

1

.000

[car=m]

-.693

.1282

-.944

-.441

29.180

1

.000

[car=s]

0a

.

.

.

.

.

.

[age=1]

-1.320

.1359

-1.586

-1.054

94.339

1

.000

[age=2]

0a

.

.

.

.

.

.

(Scale)

1b

 

 

 

 

 

 

Dependent Variable: C

Model: (Intercept), car, age, offset = offset

a. Set to zero because this parameter is redundant.

b. Fixed at the displayed value.

Opting to save the predicted values (which are in logarithmic form) and taking exponentials of these results, the predicted levels of claims can then be obtained, as shown below:

Observed claims

Predicted claims

42

35.80

37

42.97

1

1.23

101

107.20

73

67.03

14

13.77

This model is a very simple example of the use of Poisson regression. For more details and examples the reader is recommended to refer to Cameron and Trivedi (1998, [CAM1]) and/or the seminal work on generalized linear modeling by McCullagh and Nelder (1989, [MCC1]). An SPSS online tutorial provides worked examples of Poisson regression for a similar insurance example to that described above, and for a model of shipping incidents (response) with predictors including the type of ship, year of construction and period in service.

Many examples of this kind of modeling are available for the R project software, and in many ways this is a more transparent environment for exploring and developing such models. The reason for this is that rarely are models as simply stated as in the above example, nor are the results as clear cut. Often a series of experiments needs to be carried out, dealing with issues such as over-dispersion (where the residual deviance is much greater than the degrees of freedom), where there are many more zeros than would be expected for a Poisson distribution, or where there may be some factors or effects not adequately modeled. This is an iterative process, as described earlier in our initial discussion of statistical modeling, and R is well-suited to such investigations.

References

[AIT1] Aitkin M, Anderson D, Francis B, Hinde J (1989) Statistical Modelling in GLIM, Oxford Science Publications, Oxford, UK

[CAM1] Cameron A, Trivedi P (1998) Regression analysis of count data. Cambridge University Press, Cambridge, UK

[CRA1] Crawley M J (2007) The R Book. J Wiley & Son, Chichester, UK

[MCC1] McCullagh P, Nelder J A (1989) Generalized Linear Models. 2nd ed. Chapman & Hall, London

[ZEI1] Zeileis A, Kleiber C, Jackman S (2008) Regression Models for Count Data in R. J of Statistical Software, 27(8), available from: http://www.jstatsoft.org/v27/i08/