<< Click to Display Table of Contents >>
## Measures of central tendency |

Amongst the most basic of statistical measures are various forms of averages, or mean values. Since these statistics attempt to provide a single measure to summarize a large number of values, they focus on representing the whole dataset through central rather than extreme values, and are thus referred to as measures of central tendency. In the sections that follow the data are defined as a set of observed values {xi}, and associated with each observation there may be a weighting or frequency or probability, {fi}. In such cases the total weight or frequency, N, is the sum of the individual weights or frequency values:

If the individual frequencies are divided by their total value, N, they are referred to as relative frequencies, or in some cases probabilities, {pi}, and by definition will sum to 1:

Whilst the Arithmetic mean is the most widely used in statistics, its principal application is to untransformed continuous valued variables. With other types of variable alternative measures are more appropriate. For nominal data the mode is useful. For ratio data the harmonic mean should be used, whilst for percentage growth and rate data the geometric mean may be required. Several of these measures are special instances of the power mean.

The arithmetic mean or average of a set of data values, {xi}, is known as the sample mean where the data are a sample from a larger population. If the set {fi} are regarded as weights rather than frequencies the result is known as the weighted mean. The population mean is often denoted by the symbol μ. In many instances the sample mean is the best (unbiased) estimate of the population mean and is sometimes denoted by μ with a ^ symbol above it) or as a variable such as x with a bar above it. A number of similar formulas are used to compute the arithmetic mean, depending on whether the individual values are weighted by frequencies, weights or probabilities. The simplest form, with no weighting, is

The frequency weighted version (in which groups of x-values that are the same are each assigned a frequency value, f) is:

If X is a random variable with probability density function f(x) then the arithmetic mean is the expected value of X, also written as E(X), or the Expected value of X:

In statistical analysis the arithmetic mean is the most common measure of central tendency applied, but it is by no means always the most appropriate, as the following sections explain. Many of the variants of mean value are particular examples of the so-called Power Mean.

The expected value, E(X) characterizes a density function in terms of its arithmetic mean. However, this provides a very limited characterization, and other measures of distribution spread and shape are often needed to obtain a better picture of the form of a distribution, and to assist in fitting model distributions to sample data. The generalization of the expected value of X to the expected value of Xr, r=1,2... is an important approach to determining such additional characterizations. These expected values of powers of X are known as moments.

R: mean(x) –This computes the single arithmetic mean. Additional functions are available to compute the columnwise or rowwise means of a matrix. For example, for one of the test datasets included at the end of this handbook: the SWISS dataset (47 rows, 6 variables) the following results are obtained from the colMeans option:

R: colMeans(swiss)

Fertility |
Agriculture |
Examination |
Education |
Catholic |
Infant.Mortality |

70.14255 |
50.65957 |
16.48936 |
10.97872 |
41.14383 |
19.94255 |

The harmonic mean, H, is the mean of the reciprocals of the data values, which is then adjusted by taking the reciprocal of the result. The harmonic mean is less than or equal to the geometric mean, which is less than or equal to the arithmetic mean:

At first sight this seems a rather strange procedure, but is typically applied where the data, {xi}, are some form of ratio measure. For example, a vehicle traveling at 40kph for 1 hour and then traveling at 60kph for the next hour will cover a total of 100kms in 2 hours, so 50kph i.e. (40+60)/2. However, if the vehicle traveled 40kms at 40kph and then 40kms at 60kph, i.e. the same distance at two different speeds or rates of travel, they will have only traveled 80kms in total and the time taken will be less than 2 hours. In fact the time taken would be 1 hour plus 40 minutes, so 100 minutes, giving an average speed of 80*60/100=48kph. Using the harmonic mean calculation we can compute this result directly, as:

A more pragmatic reason for using harmonic means rather than arithmetic means is in situations where the denominator in a ratio can be quite small, resulting in extremely large ratio values. Suppose we are given a set of three ratio values: 90, 50, 10, and we know that each ratio is computed with a broadly similar numerator, e.g around 100. This implies that the denominator ranges from 1.11 in the first instance to 2 and finally 10. The arithmetic mean of the ratio values gives 150/3=50. Now the sum of the denominators is 13.11 and the sum of the numerators is 300, so the overall mean would be expected to be around 23. This illustrates the excessive weighting that is given to small denominator values in arithmetic averages. For this reason some application areas, such as financial services, may prefer to use the harmonic mean or weighted harmonic mean. A common application area is in the reporting of price to earnings ratios (P/E ratios) applied to a single company over a number of years, or a company using the current market price and future (forecast) earnings, or for summarizing the average P/E ratios for an entire portfolio of stock holdings (which would typically be weighted to reflect the varying size of these holdings).

The geometric mean, G, is the mean defined by taking the products of the data values (which must all be positive) and then adjusting the value by taking the nth root of the result. The geometric mean is greater than or equal to the harmonic mean and is less than or equal to the arithmetic mean

hence

The geometric mean is thus simply the arithmetic mean of the logs of the variable in question. The geometric mean is most often applied when the data are percentage growth values, such as annual interest rates. An example would be making an investment of $100 at an annual interest rate of 10%. At the end of years 1,2,3 and 4 the capital would have grown to 110, 121, 133.1 and finally 146.41. The total rise is 46.41% over 4 years, and the arithmetic mean annual rise would be 11.6%, which is clearly incorrect. The geometric means would simply take the 4th root of 1.1x1.1x1.1x1.1 which would, of course, give 10%. Another example application would be to estimate the population in a town in 2005, say, from census data for 2000 and 2010. Given the figures from the census of 30,000 (for the year 2000) and 40,000 (in 2010), we would estimate the population in 2005 as [30000x40000]1/2 = 34,641 rather than (30000+40000)/2=35,000.

The Power Mean (or generalized mean), M, can be regarded as the general (limit) expression for many mean values:

Values for p give the following means: p=1 arithmetic; p=2 root mean square; p=‑1 harmonic. Limit values for p (i.e. as p tends to these values) give the following means: p=0 geometric; p=‑∞ minimum; p=∞ maximum

Trim mean/Olympic mean/truncated mean

The trim or truncated mean is the arithmetic mean value computed with a specified percentage (proportion), t/2, of values removed from each tail to eliminate the highest and lowest outliers and extreme values. For small samples a specific number of observations (e.g. 1) rather than a percentage, may be ignored. In general an equal number, k, of high and low values should be removed and the number of observations summed should equal n(1‑t) expressed as an integer. This variant is sometimes described as the Olympic mean, as was used in the past when scoring Olympic ice-skating and gymnastics for example (removing the highest and lowest marks given by the judges):

A typically example would be in a set of 10 measurements or scores, the highest and lowest are discarded and the mean computed from the remaining 8 values. Or, setting t=20%, t/2=10% so the expression above would become:

When t/2=25% the trim mean is the mean of the central two quartiles of the distribution and is sometimes called the mid-mean.

The Winsorized mean (named after the biostatistician C P Winsor) is similar to the trimmed mean, but instead of discarding extreme values they are simply replaced by a value that is derived from the remainder of the data. Examples include replacing values that lie beyond the 5th and 95th percentiles to the values at the 5th and 95th percentiles and then calculating the mean, or replacing extreme values with the next higher or lower value in the ordered set of data.

Circular data, such as wind direction records, are not directly suitable for use in calculating mean values. This is clear when one considers the mean direction associated with two measurements, 350 degrees and 10 degrees. The mean direction is clearly 0 degrees, but the arithmetic mean gives the result 180 degrees. To resolve this difficulty we take the set of directions {θi} from a given origin with respect to a predefined direction (e.g. due North or 0 degrees) and transform them from polar coordinates (assuming the observations to lie on a unit circle) to Cartesian coordinates, calculate the mean of the transformed data and then back-transform the result. Angular measure can be in degrees or radians, but generally radians are used as standard functions only support arguments in radians. So, to obtain the mean we compute two vector components (Northing and Easting):

The resultant vector, r, has a mean or preferred direction:

For example, with three vectors at ‑80°, 90° and 90° from horizontal the resultant mean direction is +80.3°; with the two vectors 350° and 10° the resultant is 0° (due North). Note that in this example all observation vectors have unit length and the resultant vector, r, has length |r| that is in the range [0,1]. Values close to 1 indicate that the circular variance is close to 0 (as this is simply 1-|r|), i.e. the data are tightly clustered around the mean.

The most common or frequently occurring value in a set. This set, whilst usually numeric, can simply be nominal (e.g. a set of individuals with black, brown, fair or red hair). Where a set has one dominant value or range of values it is said to be unimodal; if there are several commonly occurring values or ranges it is described as multi-modal. Note that (mean‑mode)≈3x(mean‑median) for many unimodal distributions. The mode is a robust statistic, in that it is not especially sensitive to outliers, but can be in some instances. Furthermore, as noted the mode is not necessarily unique (e.g. multi-modal and uniform distributions) and there are some so-called pathological distributions for which no mode exists.

For strictly continuous distributions the mode is not well defined, since each value occurs only once. There are various solutions to this difficulty. One option is to separate the continuous distribution into equal interval divisions of x, and look for the most frequently occurring division or divisions. A difficulty here is that the selection of the interval width may affect the results. A second approach is to compute the slope of the continuous curve along its length and identify those locations (values of x) for which the slope is 0, and for which the slope transitions from positive to negative in the immediate neighborhood (identifying a local maximum).

The median is the middle value in an ordered set of data if the set contains an odd number of values, or the average of the two middle values (if this is meaningful) where the set contains an even number of values.

Med{xi}=X(n+1)/2 ; n odd

Med{xi}=(Xn/2+X(n/2+1) )/2; n even

For a continuous distribution the median is the x value corresponding to the 50% point (p=0.5) obtained from the cumulative distribution of the values or function, as illustrated below — in this example, for a Normal distribution with mean 0 and standard deviation 1:

Mathematically the median for continuous probability distributions, f(x), is the value, m, for which

If the cumulative distribution function of a discrete probability distribution is plotted, the median value will generally fall between two discrete values. In the example below, where the Poisson distribution has been plotted, with mean value, m=25, the median lies between 24 and 25.

The median is of particular use where the distribution is asymmetric and/or where the middle-value is sought as a representative central value. The median also minimizes the expression for the mean deviation.

R: median(x)

The mid-range is simply the middle value of the range:

MR{xi}=range/2

and hence is non-robust in most common situations as it is highly sensitive to outliers (it only takes into account the two most extreme values from a sample). The mid-range is the point for which the maximum deviation of a set of values is minimized (hence it is a form of minimax statistic).