<< Click to Display Table of Contents >> Navigation: Regression and smoothing > Poisson regression for count data 
Poisson regression applies where the response variable is a count (e.g. crime incidents, cases of a disease) rather than a continuous variable. This model may also be applied to standardized counts or “rates”, such as disease incidence per capita, species of tree per square kilometer. It assumes the response variable has a Poisson distribution whose expected value (mean) is dependent on one or more predictor variables. Typically the log of the expected value is assumed to have a linear relationship with the predictor variables.
As Crawley (2007, p527, [CRA1]) notes, linear regression is not appropriate for such data since:
•the linear model might lead to the prediction of negative counts
•the variance of the response variable is likely to increase with the mean
•the errors will not be Normally distributed, and
•zeros are difficult to handle in transformations
Furthermore, an excess of zeros in many sample datasets may present problems when attempting to apply this form of regression and special variants of GLIM regression have been devised to handle such situations (e.g. using a Negative Binomial model rather than a Poisson model). Most software packages provide support for count data regression using some form of GLIM or similar facility, e.g. glm() in R, GENMOD in SAS/STAT, Generalized Linear Modeling (Count data) in SPSS. Although the response variable is in the form of counts, the predictors can be continuous variables or nominal or ordinal values. For more details, with specific reference to its use in R including worked examples, see Zeileis et al. (2008, [ZEI1]), from whom the table below is derived. The two zeroinflated models listed essentially involve separating the data into a simple model for the zeros and a zerotruncated distribution model for the 1+ counts (e.g. Poisson or NB).
Overview of count regression models in R (after Zeileis et al. (2008)
Type 
Distribution 
Method 
Description 
R Package 
GLM 
Poisson

ML 
Poisson regression: classical GLM, estimated by maximum likelihood (ML) 
stats 
Quasi 
"quasiPoisson regression": same mean function, estimated by quasiML (QML) or equivalently generalized estimating equations (GEE), inference adjustment via estimated dispersion parameter — useful for modeling when overdispersion is encountered 
stats 

Adjusted 
"adjusted Poisson regression": same mean function, estimated by QML/GEE, inference adjustment via sandwich covariances 
stats 

Negative Binomial (NB) 
ML 
NB regression: extended GLM, estimated by ML including additional shape parameter 
MASS 

Zero augmented 
Poisson 
ML 
zeroinflated Poisson (ZIP), hurdle Poisson 
pcsl 
Negative Binomial (NB) 
ML 
zeroinflated NB (ZINB), hurdle NB 
pcsl 
Count data is often provided in the form of contingency tables, for which all variables are count or nominal, and when modeled using Poisson regression the approach is referred to as loglinear modeling.
Example: Car insurance claims modeling
To illustrate this we show below a simple table of insurance data from Aitken et al. (1989, [AIT1], referred to in the SAS/STAT documentation but here we use SPSS to conduct the regression analysis. The table below shows the number of policyholders (N), Claims (C) and the type of car (in three groups, small medium and large) and age of driver (in two groups, younger and older). The objective is to model the claims level from the information on car type and driver age, bearing in mind the different numbers of policyholders involved. The data could represent actual records from 6 different insurance companies, although those provided below are synthetic. Likewise, in a real example the data would probably extend to more insurers, a more detailed classification of car types, a finer classification of driver ages, and other predictor variables, such as sex of driver, occupational group of driver etc. However, for our purposes the example provided will suffice.
i 
N 
C 
Car 
Age 
1 
500 
42 
s 
1 
2 
1200 
37 
m 
1 
3 
100 
1 
l 
1 
4 
400 
101 
s 
2 
5 
500 
73 
m 
2 
6 
300 
14 
l 
2 
With Poisson regression we assume the count data is approximately Poisson distributed (so the mean is approximately equal to the variance) and the data is analyzed following log transformation. The model we shall apply in this example is of the form:
In this expression the variables car and age are indicators, i.e. take the values 0 or 1 as appropriate. To model this data in SPSS a new variable, offset=ln(N) is created, and then the generalized linear models option selected for count data with a Poisson loglinear model (see below):
The response variable is C and the predictors are the type of Car and the Age of the driver, with offset specified as our variable offset. The Model tab specifies the main effects, which in this case are simply the variables Car and Age, and no other effects (e.g. interactions) have been specified. The output from the analysis produces a variety of statistics, including goodness of fit measures, significance of parameter estimates and details of parameter estimates with confidence intervals. Here we focus on the last of these, as the model appears to provide a good fit to the (limited) data available. Below we show the parameter estimates obtained, which encouragingly are almost identical to those produced by SAS/STAT. Note that two predictors (car=small and age=2) have a zero as the parameter estimate since these were found to be redundant. Also note the relatively wide confidence intervals for some of the parameters:
Parameter Estimates 


Parameter 
B 
Std. Error 
95% Wald Confidence Interval 
Hypothesis Test 

Lower 
Upper 
Wald ChiSquare 
df 
Sig. 

(Intercept) 
1.317 
.0903 
1.494 
1.140 
212.732 
1 
.000 
car=l 
1.764 
.2724 
2.298 
1.230 
41.959 
1 
.000 
car=m 
.693 
.1282 
.944 
.441 
29.180 
1 
.000 
car=s 
0a 
. 
. 
. 
. 
. 
. 
age=1 
1.320 
.1359 
1.586 
1.054 
94.339 
1 
.000 
age=2 
0a 
. 
. 
. 
. 
. 
. 
(Scale) 
1b 






Dependent Variable: C Model: (Intercept), car, age, offset = offset 

a. Set to zero because this parameter is redundant. 

b. Fixed at the displayed value. 
Opting to save the predicted values (which are in logarithmic form) and taking exponentials of these results, the predicted levels of claims can then be obtained, as shown below:
Observed claims 
Predicted claims 
42 
35.80 
37 
42.97 
1 
1.23 
101 
107.20 
73 
67.03 
14 
13.77 
This model is a very simple example of the use of Poisson regression. For more details and examples the reader is recommended to refer to Cameron and Trivedi (1998, [CAM1]) and/or the seminal work on generalized linear modeling by McCullagh and Nelder (1989, [MCC1]). An SPSS online tutorial provides worked examples of Poisson regression for a similar insurance example to that described above, and for a model of shipping incidents (response) with predictors including the type of ship, year of construction and period in service.
Many examples of this kind of modeling are available for the R project software, and in many ways this is a more transparent environment for exploring and developing such models. The reason for this is that rarely are models as simply stated as in the above example, nor are the results as clear cut. Often a series of experiments needs to be carried out, dealing with issues such as overdispersion (where the residual deviance is much greater than the degrees of freedom), where there are many more zeros than would be expected for a Poisson distribution, or where there may be some factors or effects not adequately modeled. This is an iterative process, as described earlier in our initial discussion of statistical modeling, and R is wellsuited to such investigations.
References
[AIT1] Aitkin M, Anderson D, Francis B, Hinde J (1989) Statistical Modelling in GLIM, Oxford Science Publications, Oxford, UK
[CAM1] Cameron A, Trivedi P (1998) Regression analysis of count data. Cambridge University Press, Cambridge, UK
[CRA1] Crawley M J (2007) The R Book. J Wiley & Son, Chichester, UK, 2nd ed 2015
[MCC1] McCullagh P, Nelder J A (1989) Generalized Linear Models. 2nd ed. Chapman & Hall, London
[ZEI1] Zeileis A, Kleiber C, Jackman S (2008) Regression Models for Count Data in R. J of Statistical Software, 27(8), available from: http://www.jstatsoft.org/v27/i08/