Exploratory Data Analysis

Navigation:  Data exploration >

Exploratory Data Analysis

Previous pageReturn to chapter overviewNext page

In the late 1960s and early 1970s computers began to have more friendly user interfaces: better input, in terms of keyboards and user-controlled data entry stations, and better output in the form of print and early visual displays. Obtaining visually usable output was still difficult, as Anscombe (1973, [ANS1]) wrote:

The user is not showered with graphical displays. He can get them only with trouble, cunning and a fighting spirit!

One result of changes in accessibility to computers and improvements to input/output devices was the growth in statistical software that could take advantage of these developments. In particular, it became much easier to obtain multiple analyses of the same dataset, with output in multiple formats - tables, graphs, summary statistics and so forth. Statisticians started to investigate data in a more experimental manner, analyzing larger datasets involving multiple variables. This in turn led them to question whether the well-established approach of formal experimental design and hypothesis testing was always the right approach - in many instances a more experimental, investigative, exploratory stance might be extremely beneficial. And central to this new approach were improvements in graphical output options.

In the sections below we start with Anscombe's paper, in which he presented a convincing argument for the use of graphics as an essential part of statistical analysis (see also, Graphics and Visualization topic). We then address the broader question of exploratory versus confirmatory analysis, brought to the fore by J W Tukey, one of the leading statisticians of the post-War era (and who, coincidentally, was married to Anscombe's sister-in-law).

Anscombe's quartet

In 1973 the statistician Frank Anscombe wrote an article in the American Statistician, which became widely known through his Yale colleague Edward Tufte's book [TUF1]. Anscombe's paper highlighted the need for combining basic exploratory procedures, notably simple graphs, with numerical measures such as means and variances. He gave the example of four datasets (sometimes referred to as the Anscombe quartet) which appeared to be the same when examined using means, variances and correlation coefficients, but which are seen to be very different when graphed. The datasets are shown below followed by graphs of each set of data pairs with a best fit linear equation through each:

Anscombe data sets (1973) - data

Datasets:

x1

y1

x2

y2

x3

y3

x4

y4

10.000

8.040

10.000

9.140

10.000

7.460

8.000

6.580

8.000

6.950

8.000

8.140

8.000

6.770

8.000

5.760

13.000

7.580

13.000

8.740

13.000

12.740

8.000

7.710

9.000

8.810

9.000

8.770

9.000

7.110

8.000

8.840

11.000

8.330

11.000

9.260

11.000

7.810

8.000

8.470

14.000

9.960

14.000

8.100

14.000

8.840

8.000

7.040

6.000

7.240

6.000

6.130

6.000

6.080

8.000

5.250

4.000

4.260

4.000

3.100

4.000

5.390

19.000

12.500

12.000

10.840

12.000

9.130

12.000

8.150

8.000

5.560

7.000

4.820

7.000

7.260

7.000

6.420

8.000

7.910

5.000

5.680

5.000

4.740

5.000

5.730

8.000

6.890

Mean

9.000

7.501

9.000

7.501

9.000

7.500

9.000

7.501

Variance

11.000

4.127

11.000

4.128

11.000

4.123

11.000

4.123

Correlation, r

0.816

 

0.816

 

0.816

 

0.817

 

Anscombe data sets (1973) - graphs

anscombe

The graphs show examples where: (a) the data may lie approximately on a straight line: (b) may lie on some polynomial curve (i.e. the linear model assumption appears incorrect); (c) may lie on a straight line but there is a distinct outliers that warrants further investigation - without this outlier a different straight line fit would be obtained; or (d) showing a case in which all of the regression line slope is accounted for by a single value. A linear regression of any of the datasets yields the same result, with a best fit line of approximately:

Anscombe suggests that this simple example demonstrates why graphing data is an essential part of statistical investigation. He also recommends examination of the residuals, i.e. the differences between the predicted values for y (obtained from the best fit line) and the observed values. A graph of the residuals, as for example the first set of graphs, below, helps identify whether there is a broadly random spread of residuals (as in the first graph), or some apparent pattern which may or may not be obvious from the source data (as in the subsequent graphs). Essentially these graphs remove the regression equation from the data and leave a set of data whose values are spread about 0 and are easier to interpret.

Anscombe data - residuals vs x values

anscombe_resids

Furthermore, because linear regression assumes that the residuals are random Normally distributed errors, confidence intervals can be provided for each residual and plotted. The size of these confidence intervals will depend on the level of confidence one chooses - in theory a 100% confidence interval would be of infinite extent, a 95% confidence interval would be quite wide, and a 10% confidence interval would be very narrow. In the diagram below we have plotted the 90% confidence intervals (or prediction interval) and residual values for each of the 11 observations (cases) in the first of Anscombe's datasets. As can be seen, almost all of the residuals (identified by the circles in the diagram) lie reasonably close to the center line (0 residual) and the confidence intervals (identified by the vertical bars) in all by two instances straddle the 0 line. The two red lines (cases 3 and 8) just fail to straddle this line, so there is a possibility that these cases might be unusual, but with a 95% interval, which is wider, the intervals do straddle the center line suggesting that the residuals could reasonably be put down to random variation.

anscombe_resids_set1

Tools such as simple graphs of datasets and residuals serve to assist the process of identifying outliers, identification of unusual patterns, selecting an appropriate model, revising hypotheses, and formulating sounder interpretations of possible causal relationships. They might suggest data transformation or the adoption of a completely different approach to analysis, or highlight the need for additional data collection. However, these kinds of graphics have their limitations. Anscombe recognized the desirability of finding graphical means for handling multivariate data (i.e. apart from repeated pairwise graphics), but he also appreciated the technical difficulties of addressing such questions at the time. He suggests the use of what he calls a triple scatterplot, or TSCP (described in Tukey's 19070-71 publication, [TUK1]), which is essentially what we now call a bubble plot - a scatter plot with point sizes varying to reflect the size of a third variable. However, for more complex datasets such approaches have their limitations, and even today visualization of high-dimensional datasets presents considerable difficulties.

Exploration versus Confirmation

Tukey (1977, [TUK1]) suggested that much of statistics should commence with exploratory data analysis (EDA), with statisticians being cast in the role of data detectives. Instead of commencing with a formal statement of the problem to be solved, followed by creating an experimental design that would enable the question(s) to be addressed, one would start with a general idea of the problem and an open and skeptical mind, and then iteratively explore the problem and any data available (or initial samples thereof) and develop the problem definition and design required. Those readers who have studied the section entitled The Statistical Method earlier in this Handbook will immediately recognize Tukey's central contribution to the modern approach to statistical analysis. Part of this emphasis on exploration is a recognition that many techniques in classical statistics are not particularly robust - they are often make assumptions about the underlying distribution of the data or errors that are rarely met, or which can only be met after the data has been examined, 'cleaned', outliers detected and handled (e.g. corrected or removed, or separately analyzed), missing values catered for, transformations undertaken, model specifications improved and so forth.

So, for Tukey, EDA was an approach or attitude rather than a set of tools. However, he did introduce a number of tools, including the box plot and the stem and leaf plot (no longer used), as graphical techniques that were designed to support such detective work. As computing power has increased, and visual interfaces improved, so new families of approaches to EDA have been developed. These include: New graphical techniques - including multiple (matrix or lattice) plots, tree-structured graphs, parallel coordinate plots, 3D visualizations and interactive graphics; Dimensionality reduction techniques, such as factor analysis and principal components analysis, designed to simplify complex and large multivariate datasets, in order that essential components can be identified; Cluster detection procedures, such as K-means clustering and hierarchical clustering techniques, designed to locate and consolidate clusters of multidimensional data; and robust statistical techniques, designed to cater for the lack of strict adherence to underlying assumptions and/or to be less affected by small changes to the dataset. These various procedures can now be augmented with purely computational methods, whereby the computer environment is used as a form of experimental laboratory, enabling a wide variety of statistical and non-statistical simulations to be carried out. As such, EDA can be seen as forming a part of a much larger field of scientific investigation, which may or may not involve statistical analysis as an intrinsic part of the research and modeling process.

In addition to emphasizing visualization as a valid and important form of data analysis, Tukey highlighted the importance of using robust (or resilient) summary measures, such as the median and quartiles, rather than the mean and standard deviation, and the need to focus attention on outliers and residuals obtained following initial data examination. As an example, Tukey proposed a robust best fit line procedure in which "the data points are divided into three groups according to smallest, middle, or largest x values, and the line with an equal number of points above and below it in each of the outer groups is fitted. The resulting parameter estimates are resistant to the effects of data points extreme in y, or x, or both" [JON1].

Finally, Tukey highlighted the value of data transformation (or re-expression) of datasets, as a means of clarifying features that might otherwise be hidden (see also, Freeman-Tukey transforms). For example, datasets whose distributions contain a wide range of frequencies can be difficult to decipher, especially when graphed. He introduced a variant of the standard histogram, which he calls the hanging rootogram, where the frequency scale is transform as the square root of the frequencies, and the plotted bars are adjusted so that their tops align with a distribution to which data is being fitted, rather than aligned with the x-axis (see illustration below).

Rootogram of sample data and fitted Poisson distribution

rootogram

source: based on rootogram function in R library vcd (visualization of categorical data) after M Friendly (2000, [FRI1])

Until recent times EDA and data mining tools have not offered support for spatial or spatio-temporal EDA (ESDA, ESTDA). However there are tools now available such as those providing network spatialization and visualization (e.g. Gephi), and neural network mapping (SOM Toolbox). The simplest form of ESDA involves the computation of statistical summaries of spatial attribute tables and grid values. Useful references on ESDA include the GeoDa workbook and GeoViz toolkit documentation. Graphical analysis of such data tend to be histograms, pie charts, box plots and/or scatter plots, all of which have been described in preceding topics in this Handbook. None of these provides an explicitly spatial perspective on the data. However, where such facilities are dynamically linked to mapped and tabular views of the data they can provide a powerful toolset for ESDA purposes. The selection of objects through such linking may be programmatically defined (e.g. all values lying more than 2 standard deviations from the mean) or user defined, often by graphical selection. The latter is known as brushing, and generally involves selection of a number of objects (e.g. points) from a graphical or mapped representation. Selected features are automatically highlighted (linked) in the other views of the dataset.

References

[AND1] Andrienko N, Andrienko G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer-Verlag

[ANS1] Anscombe F J (1973) Graphs in Statistical Analysis. American Statistician, 27, 17-21

[FRI1] Friendly M (2000) Visualizing Categorical Data. SAS Institute, Cary, NC.

[JON1] Johnstone I M, Velleman P F (1985) The Resistant Line and Related Regression Methods. J Amer. Statistical Association, 80, 392, 1041-1054

[NIST] NIST eHandbook: EDA: http://www.itl.nist.gov/div898/handbook/eda/section3/eda.htm

[TUF1] Tufte E R (2001) The Visual Display of Quantitative Information. 2nd ed., Graphics Press, Cheshire, Connecticut

[TUK1] Tukey J W (1977) Exploratory Data Analysis. Addison-Wesley (originally published in a limited edition version in 1970-71)