Measures of Complexity and Model selection

Navigation:  Key functions and expressions >

Measures of Complexity and Model selection

Previous pageReturn to chapter overviewNext page

The statistical notion of complexity differs somewhat from the broader and popular notions, although in some situations statistical methods do examine how 'complex' a particular statistical model is. Typically complexity measures are based on Shannon's information statistic, which derives from Shannon's analysis of the information content (or lack of it) in a message, where the message is treated as a coded series of independent and identically distributed (iid) random variables. He showed that the average length of the shortest possible encoded message is obtained as the Shannon's information statistic (or Shannon's Entropy statistic) divided by the number of symbols used in the coding scheme. This measure gives rise to the so-called Diversity statistic. In the context of statistical thermodynamics the information statistic, I, provides a measure of the overall state of a system. Typically this is expressed using the Gibbs Entropy formula, which is of the form κI, where κ is a constant (known as the Bolzmann constant).

In the context of model complexity, a variety of measures have been proposed, with various forms of Information Criterion statistics being widely adopted. Most frequently encountered are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). However, recently a further information theoretic approach, known as Minimum Description Length (or MDL), has gained popularity. The central idea behind MDL is that there is a relationship between the underlying regularity in a data set (and hence one's ability to model such regularity) and one's ability to compress the data. A dataset that is highly compressible is one that exhibits greater regularity than a dataset that is random. Many software packages provide AIC and BIC/SBC measures, though few also offer the MDL option. Typically these measures are provided for comparing alternative regression models, time series models and clustering algorithms. Some packages, such as SAS/STAT provide a broader range of such measures, which it describes as parsimony indicators, i.e. indicators of how parsimonious the model is with respect to the use of parameters for a given level of model fit.

Information statistic (Entropy), I (Shannon’s I)

A measure of the amount of pattern, disorder or information, in a set {xi} where pi is the proportion of events or values occurring in the ith class or range.

Note that if pi=0 then pilog2(pi) is 0. I takes values in the range [0,log2(k)]. The lower value means all data falls into 1 category, whilst the upper means all data are evenly spread. This is simplest to see when looking at an example, as in the table below. Here there are 10 classes (i=1,2...10). In the first example almost all observations are in the first class, i=1, and the product pln(p) is close to 0 for all classes, so I=0. The second example shows the opposite extreme, with all classes having the same probabilities, with the statistic yielding a value of 3.32, which is the maximum value for 10 classes and is simply log2(10).

 

 

Example 1

 

Example 2

i

p

ln(p)

-pln(p)

 

p

ln(p)

-pln(p)

1

0.9999900..

-  0.00

0.00

 

0.10

- 3.32

0.33

2

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

3

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

4

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

5

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

6

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

7

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

8

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

9

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

10

0.0000011..

-19.78

0.00

 

0.10

- 3.32

0.33

Sum

1.00

 

0.00

 

1.00

 

3.32

Information statistic (Diversity), Div

If Shannon’s entropy statistic (see above) is standardized by the number of classes, k, it provides a measure known as the Diversity, with a range of values from 0 to 1:

In the examples illustrated in the table above, the first case would have a diversity value of 0, and then second a value of 1.

Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC)

The AIC and BIC measures are used to evaluate how effectively alternative models describe a dataset. Complex models often achieve an improved fit to data by their use of a large number of parameters. However, as the number of parameters increases it can be argued that the model provides a diminishing level of information about the data, until the point where the number of parameters equals the number of data items, and the model provides no additional information (or insight) into the data. Models with too many parameters are sometimes described as over-fitted.

Suppose a particular model, M, is fitted to a dataset comprised of n observations, using k parameters. Each observation is either fully explained by the model or has some difference (or residual) in its value from that predicted by the model. The sum of squared residuals, RSS, is a measure of how well the model fits the data. The AIC statistic for model M, in its simplest form, is computed as:

or more generally as

where L is the maximized log likelihood function and q (which is effectively a penalty factor weighting the number of parameters in the model) is usually taken as 2.. A corrected version of the AIC formula (known as AICc) is often applied where n/k is small (less than 40), with the 2k element being replaced by 2kC where C=n/(n-k-1). If a set of alternative models is fitted to the data and the AICc is computed for each model, the model with the lowest AICc is regarded as that providing the best compromise between the quality of fit and the model complexity.

The BIC measure (also known as the Schwartz Criterion or SBC) is similar to the AIC but is stricter in the sense that additional parameters are penalized to a greater degree, with the effect that models of lower complexity tend to be selected. If the data are Normally distributed the BIC can be written as:

where σ2e is the error variance. More generally this is written, as per the AIC, as:

Note that the AIC and BIC formulas are the same if q=ln(n). Many software packages provide both AIC and SBC measures, though few also offer the MDL option described below. Typically these measures are provided for comparing alternative regression models, time series models and clustering algorithms. A variant of the above is also often provided, suggested by Bozdogan (1987), known as the Consistent AIC, and is computed as:

The R software environment includes a function, AIC(model_list,q), which computes both AIC and BIC measures. In the default example for this function a linear multiple regression model is fitted to a dataset comprised of a single dependent variable, fertility, and five other predictor variables, each defined for 47 districts of Switzerland (the swiss dataset). The computed values for this model are AIC: 326.07, BIC: 339.02 - note that the absolute level of the measures are not meaningful. If the same analysis is carried out with only 3 of the independent variables (denoted Education, Catholic and Agriculture) the AIC is slightly higher, at 331. It may be that the advantages of using fewer explanatory variables in the model is desirable, especially as it has little impact on the AIC measure. A somewhat different example is provided in de Smith et al. (2009) [DES1] where three different types of regression models are applied to educational attainment data from Georgia, USA. The AIC score in this example was lowest for the model with the highest number of (effective) parameters, which suggests that the model with the lowest score (geographically weighted regression, or GWR) has improved explanatory capabilities despite it greater apparent complexity - i.e. the improved fit of the model more than outweighed the need to fit additional parameters.

Minimum Description Length (MDL)

As noted above, there is a relationship between the underlying regularity in a data set and one's ability to compress the data. A dataset that is highly compressible is one that exhibits greater regularity than a dataset that is random. In our discussion on probability theory we noted that for finite sets of M possible outcomes (events), typical event probabilities are of the order of 2-M or less.

Now consider a related question, which arises in information theory, that of representing a binary sequence using another, shorter set of binary sequences in a loss-less manner. This representation of a longer string by one or more shorter strings is a form of modeling, and the reduction in the amount of data to be stored is known as compression. The simplest binary sequence is of length 1, and is either {0} or {1}, i.e. two possible outcomes. The next simplest is the set {0,0},{0,1},{1,0},{1,1}, i.e. four possible outcomes. In general, with M bits there are 2M possible binary strings (arrangements of the 0/1 bits in the string). Of these strings, one is of the form {0,0...0}, i.e. all zeros, and one is of the form {1,1,....1} i.e. all ones. Thus if we take a binary string, C, and use it as a form of code to describe the possible binary strings, a code of length 1 bit can only describe 2 of the 2M strings. A 2-bit code could describe 4 of the outcomes, and more generally a k-bit code could describe 2k of the strings. So, if C is a k-bit code it can describe at most 2M-k/2M = 2-k of the strings, and typically longer codes would be required for many of the strings. Thus a very small fraction of the 2M possible strings can be coded using k-bits - most require longer codes. There is a connection between this idea of coding strings of possible outcomes and that of probability distributions - in broad terms the code lengths can be regarded as probabilities. More specifically, if P(x) is some probability distribution and C is a coding scheme, and LC(x) is the length of the code string, C, used for the set {x}, then:

Hence if P(x)=1/8 the binary code length = -log2(1/8)=3. This relationship is two way, in the sense that if a particular probability distribution is used to describe some observed dataset then there is a set of codes that corresponds to this distribution with code lengths that are long for small probabilities and short for large probabilities. Distributions that fit the data well have shorter corresponding code lengths, and this relationship enables one to use code lengths as a means of helping select the 'best' from a set of possible alternative models or distributions. In practice the MDL approach is very similar to the BIC model, but includes an extra term. Where BIC includes two terms, an error term and a model complexity term, MDL adds a third element which is sometimes described as a 'curvature' term. In fact, the BIC approach and MDL approaches are indistinguishable under certain conditions related to the selection of the Bayesian priors.

References

[BOZ1] Bozdogan H (1987) Model Selection and Akaike's Information Criterion (AIC): The General Theory and Its Analytical Extensions. Psychometrika, 52, 345-370

[DES1] de Smith M J, Goodchild M F, Longley P A (2009) Geospatial Analysis: A Comprehensive Guide. 3rd ed., Troubador, Leicester, UK

[GRU1] Grunwald P (2008) MDL Tutorial Video lecture. Available from http://videolectures.net/icml08_grunwald_mld/

[HAN1] Hansen M H, Yu B (2001) Model Selection and the Principle of Minimum Description Length. J of the American Statistical Association, 96, 454, 746-774

[SCH1] Schwarz G (1978) Estimating the Dimension of a Model. Annals of Statistics, 6, 461–464

[SHA1] Shannon C, Weaver W (1949) The Mathematical Theory of Communication. Univ. Illinois Press, Urbana, IL, USA

MDL Research Organization: http://www.mdl-research.org/