Statistical modeling

<< Click to Display Table of Contents >>

Navigation:  Statistical concepts >

Statistical modeling

Statistical (or stochastic) modeling is the process of finding a suitable mathematical model that can be used to describe or 'fit' an observed dataset, where the observations are subject to uncertainty or randomness, and can be regarded as having been drawn from one or more probability distributions. For this purpose a mathematical model is treated as a set of one or more equations, in one or more independent variables and one (or occasionally more) dependent variables. The dependent variable(s) are sometimes referred to as response variables or endogenous variables. The independent variables are likewise sometimes described as explanatory, predictor or exogenous variables. It is often convenient to describe the dependent variable by the letter y and the independent variable(s) by the letter x. If there are a set of observations with a single dependent variable and a single independent variable, {yi,xi}, this can be conveniently represented as a vector pair, {y,x}. With a single dependent and multiple independent variables the data can be represented as a vector and matrix {y,X}, and with multiple dependent variables, as a matrix pair {Y,X}.

Statistical models typically involve modeling the dependent variable(s) using some form of linear, or occasionally, non-linear combination of the independent variables based on the observed data, i.e. y=f(X) or Y=f(X). Invariably the observed dataset will not be exactly fitted by the model chosen and there will be some unexplained component or residual error, typically included in models using the Greek symbol ε or in the case where an error matrix is required, the Latin symbol U may be used.

Statistical models may involve discrete-valued (or categorical) data and/or continuous valued data. The general class of statistical models for which all variables are continuous and the dependent variable is a vector, are known as regression models. With discrete models the population being studied is meaningfully divided into distinct groups or categories, which may have one or more levels (e.g. high and low). In many cases discrete models are analyzed using analysis of variance methods (ANOVA).

Linear models are linear in their parameters, not in the independent variables. Thus the model:

in which β represents a vector of parameters to be determined, is a simple example of a linear statistical model (a regression model in this case) — the expression shown below is also linear since it remains linear in the coefficients, β, even if it is not linear in the predictor variables, x:

A typical example of a linear discrete model would be of the form:

which states that the observed or measured value y (observation j in group i) is a linear combination of an overall mean value, μ, plus a treatment or group effect, T, plus some unexplained random variation, or error, e. The most general form of the linear model (the general linear model or GLM) can be represented by the matrix equation:

The matrices Y, X and U are as defined earlier, but in this case the matrix B represents a matrix of parameters rather than just a vector,β. This model embraces many of the main forms of statistical model, including univariate and multivariate linear and polynomial regression and analysis of variance and covariance. Somewhat confusingly there is a further set of models known as generalized linear models, perhaps best abbreviated to GLIM (rather than GLM) although this is actually the name of a software package designed in the 1970s (and no longer available) to handle this generalization. GLIM was developed as a form of unified statistical model, incorporating GLMs but also models that deal with count data, proportion data and binary data. GLIM is particularly useful where the errors are not Normally distributed (e.g. the distribution of errors is very skewed and/or peaky) and/or where the variance is not constant (as is the case with the Poisson and Gamma distributions). Somewhat confusingly, the R function that supports the GLIM functionality is glm() — see further, the table of R functions below. In broad terms the GLIM approach consists of three components: a probability distribution (also known as the error structure) from the exponential family (which includes the Normal, Poisson and Binomial); a linear predictor (i.e. some form of linear model of the type described above, without the error component); and a link function which associates the mean of the distribution function to the linear predictor.

Crawley (2007, Ch9 and p349, [CRA1]) provides a useful summary of the core statistical models included within the R package, which we show below. This reflects the kinds of facility to be found in all the main statistical software packages, although in addition to these are many other forms of statistical modeling, such as time series analysis, spatial autoregressive modeling and simulation modeling.

Summary of core statistical models in R (after Crawley, 2007)

Model function



fits a linear regression model with Normal errors and constant variance; generally this is used for regression analysis using continuous explanatory variables


fits analysis of variance models with Normal errors, constant variances and the identity link; generally used for categorical explanatory variables or for a mix of categorical and continuous explanatory variables (analysis of covariance)


fits generalized linear models (GLIM, not to be confused with GLM) to data using categorical or continuous explanatory variables by specifying one of a family of error structures (e.g. Poisson for count data, Binomial for proportion data) and a particular link function


fits generalized additive models to data with one of a family of error structures (as per glm, above) in which the explanatory variables can (optionally) be fitted as arbitrary smoothed functions using non-parametric smoothers rather than specific parametric functions


fit linear mixed effects models with specified mixtures of fixed effects and random effects and allow for the specification of a correlation structure among the explanatory variables and autocorrelation of the response variable (e.g. time series effects with repeated measures). lmer allows for non-Normal errors and non-constant variance with the same error families as a GLM


non-linear least squares regression


fits a specified non-linear function in a mixed-effects model where the parameters of the non-linear function are assumed to be random effects; allows for the specification of correlation structure among the explanatory variables and autocorrelation of the response variable (as per the lmer function)

loess (lowess)

fits a local regression model with one or more continuous explanatory variables using non-parametric techniques to produce a smoothed curve or model surface


fits a regression tree model using binary recursive partitioning whereby the data are successively split along coordinate axes of the explanatory variables so that at any node, the split is chosen that maximally distinguishes the response variable in the left and right branches. With a categorical response variable the tree is called a classification tree and the model used for classification assumes that the response variable follows a multinomial distribution


[CRA1] Crawley M J (2007) The R Book. J Wiley & Son, Chichester, UK. 2nd ed. 2015