<< Click to Display Table of Contents >>
## Measures of spread |

The simplest measure of the spread of a distribution is the range, that is, the difference between the largest and smallest values recorded. However, this only provides very limited information regarding the pattern of spread, and several other measures are used in conjunction with, or in preference to, the range. Amongst these the so-called five number summary values are: the minimum and maximum values; the median (the middle value), and the upper quartile and lower quartile values, and the variance (the mean squared deviation of observations from the mean).

If a sample dataset is arranged in size order, from smallest to largest, then five number summary values are often computed and displayed graphically using Box plots (see further, below). Box plots (or box-whisker plots) are a form of exploratory data analysis (EDA) provided in many data analysis and graphing packages (e.g. Minitab, SPSS, STATA, Grapher, Matplotlib, Mondrian, WinBUGS, GeoXP). Together with distribution plots and scatter plots they provide one of the three main ways in which statistical data are examined graphically.

Because box plots are less familiar to many, and of particular use in examining outliers, they are describe in some detail here (see figure, below). The box plots in this diagram are for a set of radioactivity observations made at 1008 sites in Germany on one particular day in 2004, with some minor modifications for the purposes of this plotting exercise. The plot on the left (Column 1) is a summary representation of readings made at 200 of the sites. The plot on the right (Column 2) shows data from a further 808 locations and their readings. Side-by-side box plots provide a quick way of comparing the pattern of spread of two or more distributions.

A box plot consists of a number of distinct elements. The example in the diagram above was generated using MATLab Statistics Toolbox and we provide definitions below that apply to this particular implementation:

•The lower and upper lines of the "box" in the center of the plot window are the 25th and 75th percentiles of the sample (the lower quartile and the upper quartile). The distance between the top and bottom of the box is the inter-quartile range (IQR)

•The line in the middle of the box is the sample median. If the median is not centered in the box it is an indication of skewness

•The whiskers are lines extending above and below the box. They show the extent of the rest of the sample (unless there are outliers). Assuming no outliers, the maximum of the sample is the top of the upper whisker and the minimum of the sample is the bottom of the lower whisker (as in the Column 2 plot illustrated). By default, an outlier is a value that is more than 1.5 times the IQR away from the top or bottom of the box (a hinge value of 1.5), so with outliers the whiskers and hinge line show a form of trimmed range, i.e. excluding the outliers (n.b. the term hinge is also used in statistics to refer to locations within the main data range, in some instances matching the upper and lower quartile values)

•A symbol, e.g. a small circle, at the top and/or bottom of the plot is an indication of an outlier in the data. This point may be the result of a data entry error, a poor measurement or perhaps a highly significant observation

•The notches in the box are a graphic confidence interval about the median of a sample. A side-by-side comparison of two notched box plots is sometimes described as the graphical equivalent of a t-test. Box plots do not have notches by default

The subsections which follow provide brief descriptions of each of the main measures of univariate distribution spread used within the statistics literature. Many software packages provide such measures as individual functions and as composite (or summary) functions, whenever a suitable dataset is analyzed, i.e. as a basic form of Exploratory Data Analysis (EDA). For example, in the case of the "R" environment, the "quantile()" and "summary()" functions provide many measures of spread, as described below:

R: quantile(x) — where x is an input vector of data values and the output is a vector containing the minimum (min), lower quartile (LQ), Median, upper quartile (UQ) and maximum (max) values. For example, taking the swiss dataset, Agriculture column (a column vector of 47 items), we have

0% |
25% |
50% |
75% |
100% |

1.20 |
35.90 |
54.10 |
67.65 |
89.70 |

An R function that provides the same information for all data columns, plus additional information, is summary(). This operates on the type of object specified in the argument, in a manner that makes sense for that object (e.g. a multi-column vector object, or matrix; or a model that has been run). For the swiss dataset, which is effectively a matrix, this yields:

Fertility |
Agriculture |
Examination |
Education |
Catholic |
Infant.Mortality |

Min. : 35.00 |
Min. : 1.20 |
Min. : 3.00 |
Min. : 1.00 |
Min. : 2.150 |
Min. : 10.80 |

1st Qu.:64.70 |
1st Qu.:35.90 |
1st Qu.:12.00 |
1st Qu.: 6.00 |
1st Qu.: 5.195 |
1st Qu.:18.15 |

Median :70.40 |
Median :54.10 |
Median :16.00 |
Median : 8.00 |
Median : 15.140 |
Median :20.00 |

Mean :70.14 |
Mean :50.66 |
Mean :16.49 |
Mean :10.98 |
Mean : 41.144 |
Mean :19.94 |

3rd Qu.:78.45 |
3rd Qu.:67.65 |
3rd Qu.:22.00 |
3rd Qu.:12.00 |
3rd Qu.: 93.125 |
3rd Qu.:21.70 |

Max. :92.50 |
Max. :89.70 |
Max. :37.00 |
Max. :53.00 |
Max. :100.000 |
Max. :26.60 |

The Range is simply the difference between the maximum and minimum values of a set. Thus Range{xi}=Xn‑X1. With a sample of size n, the mean range is simple the Range/n.

R: range(x)

In an ordered set, 25% of data items are less than or equal to the upper bound of this range. For a continuous distribution the LQ or Q1 is the set of values from 0% to 25% (0.25) obtained from the cumulative distribution of the values or function. Treatment of cases where n is even and n is odd, and when i runs from 1 to n or 0 to n vary. LQ={X1, … X(n+1)/4}. If Q2 is the median of a set of data items, Q1 is the median of the values from the minimum up to and including Q2 . The R operator quantile(x) provides the minimum, maximum, median and lower and upper quartiles

In an ordered set 75% of data items are less than or equal to the upper bound of this range. For a continuous distribution the UQ is the set of values from 75% (0.75) to 100% obtained from the cumulative distribution of the values or function. Treatment of cases where n is even and n is odd, and when i runs from 1 to n or 0 to n vary. UQ={X3(n+1)/4, … Xn}. If Q2 is the median of a set of data items, Q3 is the median of the values from the maximum down to and including Q2

Inter-quartile range, IQR, Q3-Q1

The difference between the lower and upper quartile values, hence covering the middle 50% of the distribution. The inter-quartile range can be obtained by taking the median of the dataset, then finding the median of the upper and lower halves of the set. The IQR is then the difference between these two secondary medians. IQR=UQ-LQ=Q3-Q1. The IQR is a robust measure of spread as it is unaffected by outliers in the upper and lower tails of a sample.

The range computed with a specified percentage (proportion), t/2, of the highest and lowest values removed to eliminate outliers and extreme values. For small samples a specific number of observations (e.g. 1) rather than a percentage, may be ignored. In general an equal number, k, of high and low values are removed (if possible). TRt=Xn(1‑t/2)‑Xnt/2, tε[0,1]; TR50%=IQR

The variance is the average squared difference of values in a dataset from their population mean, μ, or from the sample mean (also known as the sample variance where the data are a sample from a larger population). For angular measure see circular variance, below.

This measure of spread is one of the most widely used and is often described as the mean squared deviation (MSD). Differences are squared to remove the effect of negative values (the summation would otherwise be 0). The squaring process has other results, however. These include greatly increasing the weight given to large (positive or negative) values, with the result that the variance can be large as a result of the contribution from a few outliers — to this extent it is not regarded as being a robust statistic. The size of the variance is also out of scale with the original data and is typically adjusted by taking the square root in order to rescale the measure to the data, giving the root mean squared deviation (RMSD) or standard deviation.

The population variance is often denoted by the symbol μ2 or σ2. The estimated population variance (sample variance) is often denoted by s2 or by σ2 with a ^ symbol above it. With two variables, x and y, that vary together, the term co-variance is used and the notation Sxy indicates the co-variance of x and y. With this notation Sxx then means the co-variance of x with itself, which is, of course, the variance. The basic formula for variance for a discrete dataset is:

or with data grouped into frequency classes, with the xi defined as the mid-value of each frequency class:

If the term in brackets is squared and simplified, is equivalent to:

The continuous distribution equivalent to this is

where the integral is over the full range of possible x-values. This formula can be expanded as illustrated below, in the same manner as for the discrete case, which demonstrates that the variance can also be computed as the 2nd crude moment minus the first central moment (the mean) squared.

The variance is positive definite and only equals 0 if all values are constant (there is no variation). If a and b are constants Var(aX+b)=Var(aX)+Var(b)=a2Var(X), with the constant term b dropping out as its variance is 0.

If the variance is computed from a sample of size n, the result will typically underestimate the population variance because there may be unsampled data that are larger or smaller on average than those in the sample taken. To correct for this underestimation the sample variance is divided by n-1 rather than n, and this produces an unbiased estimate of the population variance. The adjusted divisor, n-1, which is derived below, is also referred to as the degrees of freedom, since the computation involves includes the sample mean, which effectively pre-determines the choice of 1 data item, with the remaining n-1 available to be freely selected.

The proof of this result is not often presented, but is as follows: Let μ be the population mean and σ2 the population variance, with xi as a single observation from a sample of size n from the population. Then we have:

but by definition, the sum of elements xi from their mean=0, so the last term drops out and this expression reduces to:

For a large number of samples, the mean value of left hand side of this expression will tend to nσ2 by the definition of σ2, whilst at the same time the mean value of the second term on the right hand side will tend to n(σ2/n), giving

If there are k unweighted samples, the combined variance can be computed using the formula:

where n is the total number of observations across all samples and n-k represents the degrees of freedom of the estimate (one degree of freedom for each of the k mean values).

R: var(x) — sample variance

Data that represents angular measure, or circular data, must be analyzed in a different manner than conventional real-valued datasets. For example, the circular mean is computed by treating the directional observations, {θi}, as lying on a unit circle, hence regarding the observations as vectors. These are transformed using cos() and sin() functions to obtain a dataset from which the direction of the mean or resultant vector, r, can be calculated. The circular variance is calculated from the length of the mean vector through the simple formula var=1-|r| and the standard deviation is then calculated as:

The result is in radians, and may be converted to degrees by multiplying by 180/π.

Standard deviation, SD, s or RMSD

The square root of the variance, hence it is the Root Mean Squared Deviation (RMSD). The population standard deviation is often denoted by the symbol σ. The estimated population standard deviation is denoted typically by σ with a ^ symbol above it or by s.

Standard error of the mean, SE

The estimated standard deviation of the mean values of n samples from the same population. It is simply the sample standard deviation reduced by a factor equal to the square root of the number of samples, n>=1

Root mean squared error (RMSE) and mean squared error (MSE)

The square root of the squared deviation of samples, xi, from a known set of true values, zi. If the set zi are estimated by the mean of the sampled values then RMSE is equivalent to the RMSD (i.e. the standard deviation):

Mean deviation/error, MD or ME

The mean deviation of samples, xi, from the known set of true values, zi

Generally this measure is provided as an indicator of error, and more typically absolute deviations are taken to form the mean absolute deviation or mean absolute error, as described below.

Mean absolute deviation/error, MAD or MAE

The mean absolute deviation is a measure of the deviation of sample data, xi , from a known set of true values, zi :

Often the set of true values is replaced by a measure of centrality estimated from the sample data, m(x), such as the mean or median of the samples:

When m(x) is the median of the sample the measure is sometimes referred to as the median absolute deviation. Although less mathematically convenient than the standard deviation it is more robust, since values are not squared. Unlike the standard deviation it is always finite.

Literally the pattern of common (or co-) variation observed in a collection of two (or more) datasets, or partitions of a single dataset. Note that if the two sets are the same the covariance is the same as the variance

Cov(x,x)=Var(x) and is sometimes written as Sxx with the standard deviation written as Sx or SDx

The product moment correlation or Pearson’s correlation coefficient, r, is based on the computation of covariance. It is a measure of the similarity between two (or more) paired datasets and is the ratio of the covariance to the product of the standard deviations. If the two datasets are the same or perfectly matched this will give a result r=1, where r=Cov(x,y)/SDxSDy or:

The ratio of the standard deviation to the mean, sometime computed as a percentage. If this ratio is close to 1, and the distribution is strongly left skewed, it may suggest the underlying distribution is Exponential. Note, mean values close to 0 may produce unstable results.

CV=100σ/μ

The ratio of the variance to the mean, sometime computed as a percentage. If this ratio is close to 1, and the distribution is unimodal and relates to count data, it may suggest the underlying distribution is Poisson. Note, mean values close to 0 may produce unstable results.