﻿ Statistical data

# Statistical data

Navigation:  »No topics above this level«

# Statistical data

Statistics (plural) is the field of science that involves the collection, analysis and reporting of information that has been sampled from the world around us. The term sampled is important here. In most instances the data we analyze is a sample (a carefully selected representative subset) from a much larger population. In a production process, for example, the population might be the set of integrated circuit devices produced by a specific production line on a given day (perhaps 10,000 devices) and a sample would be a selection of a much smaller number of devices from this population (e.g. a sample of 100, to be tested for reliability). In general this sample should be arranged in such a way as to ensure that every chip from the population has an equal chance of being selected. Typically this is achieved by deciding on the number of items to sample, and then using equi-probable random numbers to choose the particular devices to be tested from the labeled population members. The details of this sampling process, and the sample size required, is discussed in the section Sampling and sample size.

The term statistic (singular) refers to a value or quantity, such as the mean value, maximum or total, calculated from a sample. Such values may be used to estimate the (presumed) population value of that statistic. Such population values, particular key values such as the mean and variance, are known as parameters of the pattern or distribution of population values.

In many instances the question of what constitutes the population is not as clear as suggested above. When undertaking surveys of householders, the total population is rarely known, although an estimate of the population size may be available. Likewise, when undertaking field research, taking measurements of soil contaminants, or air pollutants or using remote sensing data, the population being investigated is often not so well-defined and may be infinite. When examining a particular natural or man made process, the set of outcomes of the process may be considered as the population, so the process outcomes are effectively the population.

Since statistics involves the analysis of data, and the process of obtaining data involves some kind of measurement process, a good understanding of measurement is important. In the subsections that follow, we discuss the question of measurement and measurement scales, and how measured data can be grouped into simple classes to be produce data distributions. Finally we introduce two issues that serve to disguise or alter the results of measurement in somewhat unexpected ways. The first of these is the so-called statistical grouping affect, whereby grouped data produce results that differ from ungrouped data in a non-obvious manner. The second of these is a spatial effect, whereby selection of particular arrangement of spatial groupings (such as census districts) can radically alter the results one obtains.

Perhaps one of the mostly hotly debated topics in recent years has been the rise of so-called "Big Data". In an article "Big Data: Are we making a big mistake?" in the Financial Times, March 2014, Tim Harford addresses these issues and more, highlighting some of the less obvious issues posed by Big Data. Perhaps primary amongst these is the bias that is found in many such datasets. Such biases may be subtle and difficult to identify and impossible to manage. For example, almost all Internet-related Big Data is intrinsically biased in favor of those who have access to and utilize the Internet most, with demographic and geographic bias built-in. The same applies for specific services, such as Google, Twitter, Facebook, mobile phone networks, opt-in online surveys, opt-in emails — the examples are many and varied, but the problems are much the same as those familiar to statisticians for over a century. Big Data does not imply good data or unbiased data, and Big Data presents other problems — it is all to easy to focus on the data exploration and pattern discovery, identifying correlations that may well be spurious — a result of the sheer volume of data and the number of events and variables measured. With enough data and enough comparisons, statistically significant findings are inevitable, but that does not necessarily provide real insights, understanding, or identification of causal relationships. Of course there are many important and interesting datasets where the collection and storage is far more systematic, less subject to bias, recording variables in a direct manner, with 'complete' and 'clean' records. Such data are stored and managed well and tend to be those collected by agencies who supplement the data with metadata (data about data) and quality assurance information.

Measurement

In principle the process of measurement should seek to ensure that results obtained are consistent, accurate (a term that requires separate discussion), representative, and if necessary independently reproducible. Some factors of particular importance include:

framework — the process of producing measurements is both a technical and, to an extent, philosophical exercise. The technical framework involves the set of tools and procedures used to obtain and store numerical data regarding the entities being measured. Different technical frameworks may produce different data of varying quality from the same set of entities. In many instances measurement is made relative to some internationally agreed standard, such as the meter (for length) or the kilogram (for mass). The philosophical framework involves the notion that a meaningful numerical value or set of values can be assigned (using some technical framework) to attributes of the entities. This is a model or representation of these entity attributes in the form of numerical data — a person's height is an attribute that we can observe visually, describe in words, or assign a number to based on an agreed procedure relative to a standard (in meters, which in turn is based on the agreed measurement of the speed of light in a vacuum)

observer effects — in both social and pure science research, observer effects can be significant. As a simple example, if we are interested in measuring the temperature and air quality in a process clean room, the presence of a person taking such measurements would inevitably have some affect on the readings. Similarly, in social research many programmes can display the so-called Hawthorne Effect in which changes (often improvements) in performance are partially or wholly the result of behavioral changes in the presence of the observer (reflecting greater interest in the individuals being observed)

metrics — when measuring distance in the plane using Euclidean measure the results are invariant under translation, reflection and rotation. So if we use Euclidean measure we can safely make measurements of distances over relatively small areas and not worry about the location or orientation at which we took the measurements and made the calculation. However, over larger areas and/or using a different metric, measurements may fail the invariance test. In the case of measurements that seek to compute distances, measurements made using the so-called City block or Manhattan distance are not invariant under rotation. Likewise, Euclidean distance measurements give incorrect results over larger distances on the Earth's surface (e.g. >20 kilometers). When making other forms of measurement similar issues apply (e.g. the effect of the local gravitational field on weight, the local magnetic field on magnetic flux, etc.)

temporal effects — measurement made at different times of the day, days of the year and in different years will inevitably differ. If the differences are simply random fluctuations in a broadly constant process (results are unaffected by temporal translation of the data) the process is described as being stationary. If a trend exists (which could be linear, cyclical or some other pattern) the process is said to be non-stationary. All too often consideration of the temporal aspect of measurement is omitted, e.g. a person's height will be measured as shorter in the evening as compared with the morning, a persons academic or sporting achievement can be significantly affected by when they were born (see Gladwell, 2008, for an extensive discussion of this issue, [GLA1]) — the issue is always present even if it is not of direct concern. Frequently the sequence of event measurement is important, especially where humans are doing the measurements or recordings, since issues such as concentration become important over time; event sequences may also be explicitly monitored, as in control charts, time series analysis and neural network learning

spatial effects — measurements made at different locations will typically exhibit spatial variation. If all locations provided identical data the results would be a spatially uniform distribution. If the results are similar in all directions at all locations, then the process is described as isotropic (i.e. results are rotationally invariant). If the results are similar at all locations (i.e. the results are translationally invariant) then the process can be described as stationary. In practice most spatial datasets are non-stationary

Measurement scales

Measurement gives rise to values, such as counts, sets of decimal values, binary responses (yes/no, presence/absence) etc., which may be of different types (scales). The principal scales encountered are:

Nominal (or Categorical): data is really just assignment to named classes, such as Red, Blue, Green — or Utah, Nevada, New York. An attribute is nominal if it successfully distinguishes between groups, but without any implied ranking or potential for arithmetic. For example, a telephone number can be a useful attribute of a place, but the number itself generally has no numeric meaning. It would make no sense to add or divide telephone numbers, and there is no sense in which the number 9680244 is more or better than the number 8938049. Likewise, assigning arbitrary numerical values to classes of land type, e.g. 1=arable, 2=woodland, 3=marsh, 4=other is simply a convenient form of naming (the values are still nominal)

Ordinal: this term refers to data values that involves a concept of order, from least to greatest and may include negative numbers and 0. A set of apparently ordered categories such as: 1=low, 2=medium, 3=high, 4="don't know" does not form an ordinal scale. An attribute is ordinal if it implies a ranking, in the sense that Class 1 may be better than Class 2, but as with nominal attributes arithmetic operations do not make sense, and there is no implication that Class 3 is worse than Class 2 by the precise amount by which Class 2 is worse than Class 1. An example of an ordinal scale might be preferred locations for residences — an individual may prefer some areas of a city to others, but such differences between areas may be barely noticeable or quite profound. Analysis of nominal and ordinal data is often qualitative, or uses visualization techniques to highlight interesting patterns, and may use non-parametric statistical methods especially when count data are available

Interval: numeric data that exhibits order, plus the ability to measure the interval (distance) between any pair of objects on the scale (e.g. 2x-x3x-2x). Data are of interval type if differences make sense, as they do for example with measurements of temperature on the Celsius or Fahrenheit scales

Ratio: interval plus a natural origin, e.g. temperature in degrees Kelvin, weights of people (i.e. so x=2y is meaningful); Interval or ratio scales are required for most forms of (parametric) statistical analysis. Data are ratio scaled if it makes sense to divide one measurement by another. For example, it makes sense to say that one person weighs twice as much as another person, but it makes no sense to say that a temperature of 20 Celsius is twice as warm as a temperature of 10 Celsius, because while weight has a meaningful absolute zero Celsius temperature does not (but on an absolute scale of temperature, such as the Kelvin scale, 200 degrees can indeed be said to be twice as warm as 100 degrees). It follows that negative values cannot exist on a ratio scale.

Cyclic: modulo data — like angles and clock time. Measurements of attributes that represent directions or cyclic phenomena have the awkward property that two distinct points on the scale can be equal — for example, 0 and 360 degrees. Directional data are cyclic (see the sample wind rose diagram below) as are calendar dates. Arithmetic operations are problematic with cyclic data, and special techniques are needed to handle them. For example, it makes no sense to average 1° and 359° to get 180°, since the average of two directions close to north clearly is not south. Mardia and Jupp (1999, [MAR1]) provide a comprehensive review of the analysis of directional or cyclic data

Cyclic data — Wind direction and speed, single location

Bar charts, Histograms and Frequency distributions

Bar chart: The process of measurement may produce data that are recorded as counts and assigned to purely nominal classes, for example counts of different bird species in a woodland. In this instance a simple bar chart may be produced to illustrate the different relative frequencies of each species. Each class is assigned an individual vertical or horizontal bar and typically each bar being the same width (so height indicates relative frequency). Bars are separated by distinct gaps and the order in which the bars are placed on the horizontal or vertical axis is of no importance. The example below (upper diagram) shows the results of the UK parliamentary election in May 2010. The bar chart indicates the seats one in the "first past the post" system used currently in the UK, with a geographic map of the spread of these results. The lower diagram shows the same data but with the geography amended to minimize the visual distortion caused by constituencies having very different areas. For color versions of these charts see the web or PDF editions of this Handbook.

BBC UK Election 2010 results

source: http://news.bbc.co.uk/1/shared/election2010/results/. Note that the first diagram is familiar but misleading as it does not reflect the density of voters, suggesting the seats obtained by parties in large rural areas are somehow more significant than densely populated urban areas. This view can be corrected in various ways, most simply by adjusting the areas to reflect the populations in each. The result is a proportional map, with areas distorted but still recognizable, as illustrated in the second diagram.

Histogram: If measurements yield numerical values on an interval or ratio scale, these can be grouped into classes and the counts (or frequencies) in each class plotted as a bar chart in which the order on the horizontal axis (or x-axis) is important. A bar chart of this type is called a histogram and should be plotted without spaces between the vertical bars reflecting the continuous nature of the scale (see example of light bulb life data, below). The term histogram was introduced by Karl Pearson in the late 19th century to describe any chart of this type, especially charts in which the horizontal axis represented time. He liked the idea that the Greek word histos, which means anything placed vertically, like a ship's mast, is similar to the word historical, giving the idea of a frequency chart with a time-based x-axis..

Frequency distribution: A frequency distribution is a tabulated set of sample data, showing the number of occurrences of events or observations that fall into distinct classes or that have particular values. As such, it can be seen as a convenient way of avoiding the need to list every data item observed separately. However, frequency distributions can often provide greater insight into the pattern of sample values, and enables these patterns to be compared with well-understood standard distributions, such as the Binomial (discrete) and Normal (continuous) distribution. A simple example is shown in the table below together with the chart (or histogram) of the data. In this table there are 17 equal interval classes, for each of which the number of light bulbs in a sample of N=150 that fail after a certain time are listed.

Length of life of electric light bulbs — tabulated and histogram

Life (hours)

Frequency

Histogram of frequencies

0-200

1

200-400

3

400-600

2

600-800

10

800-1000

19

1000-1200

21

1200-1400

23

1400-1600

18

1600-1800

17

1800-2000

10

2000-2200

8

2200-2400

5

2400-2600

5

2600-2800

4

2800-3000

2

3000-3200

1

3200-3400

1

Total

150

after Pearson E S (1933, [PEA1])

(i) it has a single category or class containing the most frequent bulb life (1200-1400hrs) — this category is called the mode, and because there is a single mode, the distribution is said to be unimodal

(ii) the set of classes in the tabulated list are not really correctly defined — the boundaries are indeterminate, and should be specified as [0,199.9],[200-399.9], etc (or similar) or better still [0,<200], [200,<400] etc (in Pearson's paper, which was primarily concerned with production control and sampling, he actually only supplied the frequency diagram, not the tabulated data) — the precise definition of the boundaries of classes avoids the problem of deciding how to assign values that lie on the boundary (e.g. a bulb with measured lifespan of exactly 200 hours)

(iii) each class is the same width (duration) and every data value is allocated to a unique class; however, when performing certain calculations, such as computing the mean value, a decision has to be made as to whether to use the recorded frequencies in the various classes or bins, or the source data (if available). If the frequencies have to be used, it is necessary to define a representative value for each interval, which is usually taken to be the mid-interval value. Note that this assumption hides the within-class variation in values which may create some errors in computations, especially if the class widths are large. The question of bin selection is discussed later in this section

(iv) the width (duration) of each class is somewhat arbitrary and this choice significantly affects the form of the frequency distribution. If the class width was very small (1 hour say) most classes would contain the frequency 0, and a few would contain just 1 failure. At the opposite extreme, if the class width was 3400 hours all the results would be in just the one class. In both these examples very little information would be gained from inspecting the pattern of frequencies. Selecting the class boundaries and number of classes is an important operation — it should ensure that the minimum of information is lost, whilst also ensuring that the distribution communicates useful and relevant information. Many authors recommend the use of an odd number of classes, and there are a myriad of rules-of-thumb for choosing the number of classes and class boundaries (see Class Intervals, below)

(v) all the data fits into the classes (in this example). This is often not possible to achieve with equal interval classes, especially at the upper and lower ends of the distribution. Indeed, frequency distributions with very long tails are common, and often the final category is taken as 3000+ for example

(vi) the data being analyzed in this example can be regarded as a continuous variable (lifespan of the bulb) and is a single variable (i.e. univariate data)

There are several extensions and variations that can be applied to the above model. The first is to rescale the vertical axis by dividing each class value by the total sample size (N=150), in which case the data are described as relative frequencies, and in examples such as this, the values can be considered as estimated probabilities.

A second important variant is the extension of the frequency table and chart to multivariate and multi-dimensional cases. In the bivariate case the data may simply be separate measures applied to the same classes, or they may be joint measures. For example, suppose that our classes show the heights of individuals in a large representative sample. The first column of a bivariate frequency tabulation might show the frequency distribution for men over 18 years, whilst the second column shows the same data but for women. However, if the mix of those sampled included fathers and sons, one could construct a two-way or joint frequency distribution (or cross-tabulation) of the men with classes "Tall" and "Short", where Tall is taken as over some agreed height. The table below illustrates such a cross-tabulation, based on a study of families carried out by Karl Pearson and Dr Alice Lee from 1893 onwards:

Cross-tabulation of father-son height data

 Father short Father tall Total fathers Son short 250 89 339 Son tall 215 446 661 Total sons 465 535 1000

simplified, after K Pearson and A Lee (1903, Table XXII [PEA2]; the overall sample size of 1000 families and the cell entries are simply a proportional reduction from the 1078 cases in the original data).

In this example each part of the frequency distribution is divided into just 2 classes, but each could readily have been separated into 3 or more height bands. Indeed, the original table is divided into 20 rows and 17 columns (illustrated in full in the Probability section of this Handbook), but inevitably many of the table entries are blank. Row and column totals have been provided, and these are sometimes referred to as marginal frequencies or marginal distributions. They are essentially the univariate frequency distributions for the rows and columns taken separately.

As with the univariate frequency data, this table could be converted to relative frequencies by dividing through by 1000, but it also affords another perspective on the data; we can consider questions such as: "what is the probability that a tall son has a tall father?" If the data are truly representative of the population of fathers and sons, then the estimated probability is 446/1000 or 44.6%. But when we examine the table, we find that there are far more tall fathers and tall sons than short fathers and short sons. We could then ask "does this estimate of probability suggest that tall fathers have tall sons, i.e. some genetic or other relationship factor?". Overall we can see from the totals entries that 53.5% of our sample fathers are tall and 66.1% of the sons are tall, and if these two groups were completely independent we might reasonably expect 53.5% x 66.1% of the father-son combinations to be tall (applying the rule of multiplication for independent probabilities). But this combination is actually only 35.4%, so the 44.6% finding does suggest a relationship, but whether it is significant (i.e. highly unlikely to be a chance result) requires more careful analysis using a particular statistical technique, contingency table analysis. Cross-classifications of this kind do not require numeric classes or classes derived from numeric values as in this example — in many instances the rows contain classes such as "Success, Failure" or "Survived, Died" and the columns might contain "Treatment A, Treatment B, Placebo, No treatment", with the table entries providing a count of the number of plants, patients etc. recorded in that combination of classes. In general such multivariate classification tables are restricted to 2-way, and occasionally 3-way analysis, and rarely are the number of classes in each dimension of the classification large if analyzed in this manner — often they are 5 or less.

Frequency distributions can also by multi-dimensional. For example, the distribution of cases of a particular disease around a point source of contamination might be measured in distance bands and radial sectors around this location. This pattern might then be compared with a known bivariate frequency distribution, such as the bivariate Normal distribution. In three dimensions one could be looking at the distribution of bacteria in cheese, or the distribution of stars in a region of space.

Class intervals, bins and univariate classification

If sampled data are measurements of a continuous variable, x, such as the light bulb lifespans described above, then the standard procedure in frequency chart (or histogram) production is to create a set of equal width class intervals (or bins) and count the frequencies occurring in each interval. The values at which the bins are separated are often referred to as cut-points. The number of intervals to be used is a matter for the researcher to determine, depending on the problem requirements. It is often helped, in interactive software packages, by viewing a display of the resulting histogram as different options are selected. For visualization purposes it is desirable to limit the number of classes to between 5 or 9, as using large numbers of classes (20+) can be difficult to display and interpret with clarity, and an odd number of intervals will ensure there is a central class. On the other hand, with a large set of observations that exhibit considerable spread across the range, a larger number of classes may be more helpful and will avoid the problem of having much of the real variation hidden by large class intervals.

There are several rules of thumb for determining the ideal number of bins and/or the width for fixed-width bins for real-valued continuous data. These include the following (n is the number of observations or data items to be grouped, k is the number of classes, h is the bin width, s is the standardized average spread or standard deviation of the sample data):

These options use the range and a pre-selected bin width to define the number of bins, k, or alternatively the number of bins is specified and the range used to determine the bin width, h. Note that if the distribution has a very long tail, e.g. a few data items that are very much larger or smaller than all the others, these formulas will produce excessively wide bins.

The next formula is due to Scott (1979, [SCO1]) and uses the standard deviation of the dataset, s, rather than the range to determine bin width:

Thus for 1000 data items with a standard deviation of 25, h=9. The number of bins still remains to be chosen, and this will be a matter of choice again, but could safely use the range calculation for k, above, in most cases. Scott's model is built on an analysis of the optimal properties of a binning arrangement with constant bin widths and an examination of the ideas of so-called kernel density estimation (KDE) techniques. The latter use all the data points to create a smooth estimated probability distribution (or probability density function), which has been shown to produce excellent results but may require a considerable amount of data processing.

As mentioned earlier, if the frequencies are to be used in computations it is necessary to define a representative value for each interval, which is usually taken to be the mid-interval value. Thus if the bin width is h, and the mid-interval value is xi, the interval has a range from xi-h/2 to xi+h/2. This assumption hides the within-interval variation in values which may create some errors in computations, especially if the class width are large. The so-called Sheppard's correction, named after its author William Sheppard (1897), is an adjustment to estimates of the variance when (Normally distributed) fixed width bins are used. Without correction the computations tend to over-estimate the variance since they effectively treat all values in a range as the same as the mid-value. Sheppard's correction to the variance is -h2/12, an amount that is the variance of the Uniform distribution defined over an interval of width, h.

The table below provides details of a number of univariate classification schemes together with comments on their use. Such schemes are essentially a generalization of fixed-width binning. Many statistical software packages provide classification options of the types listed, although some (such as the box, Jenks and percentile methods) are only available in a limited number of software tools.

The scheme described in the table as Natural breaks or Jenks' method is an automated procedure utilizing the following algorithm:

Step 1: The user selects the attribute, x, to be classified and specifies the number of classes required, k

Step 2: A set of k‑1 random or uniform values are generated in the range [min{x},max{x}]. These are used as initial class boundaries or 'cut points'

Step 3: The mean values for each initial class are computed and the sum of squared deviations of class members from the mean values is computed. The total sum of squared deviations (TSSD) is recorded

Step 4: Individual values in each class are then systematically assigned to adjacent classes by adjusting the class boundaries to see if the TSSD can be reduced. This is an iterative process, which ends when improvement in TSSD falls below a threshold level, i.e. when the within class variance is as small as possible and between class variance is as large as possible. True optimization is not assured. The entire process can be optionally repeated from Step 1 or 2 and TSSD values compared

Univariate binning/classification schemes

Classification scheme

Description/application

Unique values

Each value is treated separately — this is effectively a nominal data classification model

Manual classification

The analyst specifies the boundaries between classes/bins as a list, or specifies a lower bound and interval or lower and upper bound plus number of intervals required. This approach is widely used in statistical software packages

Equal interval

The attribute values are divided into n classes with each interval having the same width=range/n

Exponential interval

Intervals are selected so that the number of observations in each successive interval increases (or decreases) exponentially

Equal count or quantile

Intervals are selected so that the number of observations in each interval is the same. If each interval contains 25% of the observations the result is known as a quartile classification. Ideally the procedure should indicate the exact numbers assigned to each class, since they will rarely be exactly equal

Percentile

In the standard version equal percentages (percentiles) are included in each class, e.g. 20% in each class. In some implementation of percentile plots (specifically designed for exploratory data analysis, EDA) unequal numbers are assigned to provide classes that, for example, contain 6 intervals: <=1%, >1% to <10%, 10% to <50%, 50% to <90%, 90% to <99% and >=99%

Natural breaks/Jenks

Used within some software packages, these are forms of variance-minimization classification. Breaks are typically uneven, and are selected to separate values where large changes in value occur. May be significantly affected by the number of classes selected and tends to have unusual class boundaries. Typically the method applied is due to Jenks, as described in Jenks and Caspall (1971, [JEN1]), which in turn follows Fisher (1958, [FIS1]). Very useful for visualization work, but unsuitable for comparisons

Standard deviation (SD)

The mean and standard deviation of the data are calculated, and values classified according to their deviation from the mean (z-transform). The transformed values are then grouped into classes, usually at intervals of 1.0 or 0.5 standard deviations. Note that this often results in no central class, only classes either side of the mean and the number of classes is then even. SD classifications in which there is a central class (defined as the mean value +/-0.5SD) with additional classes at +/- 1SD intervals beyond this central class, are also used

Box

A variant of quartile classification designed to highlight outliers, due to Tukey (1977, Section 2C, [TUK1]). Typically six classes are defined, these being the 4 quartiles, plus two further classifications based on outliers. These outliers are defined as being data items (if any) that are more than 1.5 times the inter-quartile range (IQR) from the median. An even more restrictive set is defined by 3.0 times the IQR. A slightly different formulation is sometimes used to determine these box ends or hinge values

Supervised binning and classification

Some statistical software packages differentiate between unsupervised and supervised schemes. These terms have different meanings within different packages and application areas, which can be confusing. In broad terms an unsupervised method utilizes the data directly, whereas a supervised method cross-refers the sample data to some other dataset that is already divided into a number of distinct classes or categories. It then uses this other dataset to guide (or supervise) the classification process.

In SPSS, for example, supervised (or optimal) binning refers to a procedure in which the source data is divided into bins using cut-points that seek to minimize the mix of a separate, but linked, nominal variable in each bin. For example, the variable to be binned might be household income in \$000s p.a., and the supervisor or control variable might be the level of education achieved by the main earner of the household. The principal technique used, known as MDLP, starts by placing every (sorted) data item (observation) into a single large bin. The bin is then divided using cut-points, and the mix of the linked nominal variable in each bin is examined (using an Entropy or Diversity statistic). If every entry in the bin has the same linked nominal category then the Entropy measure will be 0, which is regarded as optimal with respect to the nominal variable. On the other hand if there is a large mix of nominal variables represented, of roughly equal numbers, the bin will have a higher Entropy score. The algorithm adjusts the cut points and increases the number of cut points (and hence bins) to achieve an improvement in the total Entropy of the binning process.

In remote-sensing applications (for example, multi-spectral satellite imagery) the task is to classify individual image pixels into groups, which may be pre-defined (e.g. land use categories, such as Forest, Grasslands, Buildings, Water etc) or derived from the data. Unsupervised classification in this instance refers to the use of wholly automated procedures, such as K-means clustering, in order to group similar pixels. Supervised classification refers to a multi-stage process, in which the dataset is compared to a reference dataset that has already been classified, and the similarity between pixels in the dataset to be classified and the reference set is used as a means for achieving the 'best' classification. Clearly procedures such as this, which arise in a number of disciplines, essentially belong in the realm of multivariate data classification, which may or may not use statistical techniques and measures as part of that process.

Scale and arrangement

In the preceding subsections we have seen that determining the ideal number and size of bins can be a quite complicated exercise. It was noted that with too many bins only frequencies of 1 and 0 would be recorded, whereas with very few bins, almost all the variation in the data would be hidden within the bin, or class, with little or no variation detectable between classes. This is often the exact opposite of the ideal classification or grouping schemes, where the aim is generally to minimize within-class variance as compared to between class variance — making sure that classes or groupings are as homogeneous as possible. Two additional, and somewhat unexpected factors, come into play when such groupings are made. These are known as the statistical effect and the arrangement effect.

To understand the statistical effect (which is a scale or grouping effect) look at the regional employment statistics shown in the Table below (after de Smith et al. (2018, [DES1]). Areas A and B both contain a total of 100,000 people who are classified as either employed or not. In area A 10% of both Europeans and Asians are unemployed (i.e. equal proportions), and likewise in Area B we have equal proportions (this time 20% unemployed). So we expect that combining areas A and B will give us 200,000 people, with an equal proportion of Europeans and Asians unemployed (we would guess this to be 15%), but it is not the case — 13.6% of Europeans and 18.3% of Asians are seen to be unemployed! The reason for this unexpected result is that in Area A there are many more Europeans than Asians, so we are working from different total populations.

Regional employment data — grouping effects

 Employed (000s) Unemployed (000s) Total (000s) (Unemployed %) Area A European 81 9 90 (10%) Asian 9 1 10 (10%) Total 90 10 100 (10%) Area B European 40 10 50 (20%) Asian 40 10 50 (20%) Total 80 20 100 (20%) Areas A and B European 121 19 140 (13.6%) Asian 49 11 60 (18.3%) Total 170 30 200 (15%)

There is a further, less well known problem, which has particular importance in the process of elections and census data collection but also has much wider implications. This is due to the way in which voting and census areas are defined. Their shape, and the way in which they are aggregated, affects the results and can even change which party is elected. The Grouping Data diagram below illustrates this issue for an idealized region consisting of 9 small voting districts. The individual zone, row, column and overall total number of voters are shown in diagram A, with a total of 1420 voters of whom roughly 56% (800) will vote for the Red party (R) and 44% (620) for the Blue party (B). With 9 voting districts we expect roughly 5 to be won by the Reds and 4 by the Blues, as is indeed the case in this example. However, if these zones are actually not the voting districts themselves, but combinations of the zones are used to define the voting areas, then the results may be quite different. As diagrams B to F show, with a voting system of “first past the post” (majority in a voting district wins the district) then we could have a result in which every district was won by the Reds (Case C), to one in which 75% of the districts were won by the Blues (Case F). So it is not just the process of grouping that generates confusing results, but also the pattern of grouping. We are rarely informed of the latter problem, although it is one that is of great interest to those responsible for defining and revising electoral and census district boundaries.

Grouping Data — Zone arrangement effects on voting results

This is not just a problem confined to voting patterns and census data. For example, suppose the information being gathered relates to the average levels of lead and zinc in the soil within each field. Samples based on different field boundaries would show that in some arrangements the average proportion of lead in the soil exceeded that of zinc, whilst other arrangements would show the opposite results.

References

[DES1] de Smith M J, Goodchild M F, Longley P A (2018) Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools. 6th edition, The Winchelsea Press, UK. Available from: https://www.spatialanalysisonline.com/

[FIS1] Fisher W D (1958) On grouping for maximal homogeneity. J. of the American Statistical Association, 53, 789-98

[GLA1] Gladwell M (2008) Outliers — the story of success. Alan Lane/Penguin, London

[JEN1] Jenks G F, Caspall F C (1971) Error on choroplethic maps: Definition, measurement, reduction. Annals of American Geographers, 61, 217-44

[MAR1] Mardia K V, Jupp P E (1999) Directional statistics. 2nd ed., John Wiley, Chichester

[PEA1] Pearson E S (1933) A Survey of the Uses of Statistical Method in the Control and Standardization of the Quality of Manufactured Products. J. Royal Stat. Soc., 96,1, 21-75

[PEA2] Pearson K, Lee A (1903) On the Laws of Inheritance in Man: I. Inheritance of Physical Characters. Biometrika, 2(3), 357-462

[SCO1] Scott D W (1979) On optimal and data-based histograms. Biometrika 66,3, 605–610