<< Click to Display Table of Contents >> Navigation: Data exploration > Graphs, plots and visualization |
The use of graphs and diagrams has long been an integral part of statistics. Their misuse has an equally long tradition, as has previously been discussed (see Misuse, misinterpretation and bias). Visualization is a broader term, being an extension of conventional graphics to 3D, interactive and video forms of communication. Collectively visualization tools provide facilities for both exploratory data analysis (EDA, [TUK1]) and for communication of findings to others [CHA1, FRI1]]. In this topic we look briefly at the main forms of visualization and how these can be used for each of these two, distinct, purposes. There are a number of excellent online galleries that provide examples of good visualization, in many instances with associated OpenSource code (e.g. in Python or R). Links to a selection of these galleries are provided at the end of this topic.
In the context of communicating to others, the work of Edward Tufte has been particularly influential in the last 30 years, notably his self-published book "The visual display of quantitative information" [TUF1]. The importance of clarity in the presentation of graphics, consistency and labeling of scales, provision of titles, legends and associated explanations, cannot be underestimated. Tufte (2001, 2nd ed.) commences with a summary of what we should look for in order to achieve excellence in statistical graphics:
"graphical displays should:
•show the data
•induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else
•avoid distorting what the data have to say
•present many numbers in a small space
•make large data sets coherent
•encourage the eye to compare different pieces of data
•reveal the data at several levels of detail, from a broad overview to the fine structure
•service a reasonably clear purpose: description, exploration, tabulation, or decoration
•be closely integrated with the statistical and verbal description of the data set"
These bullet points, originally stated some 30 years ago, are of continued relevance today. Of particular interest is the observation that graphics can present a large volume of data in a very compact and digestible form. This is of increasing relevance today, as the range, complexity and sheer volume of data risks overwhelming the would-be statistical analyst. As a general rule, all graphs and diagrams should include a heading and clear labeling of the constituent parts: axes, displayed objects etc., and where appropriate should define the source of the data and any special aspects of the data (e.g. geographical or temporal, missing data, non-linear axis scaling and units, number of cases plotted). The use of graphs for which there are gaps at the start or within the range of either axis should be avoided wherever possible, and where unavoidable should use clear graphical indications and/or explanations for the arrangement provided. For example, there is little value in providing a scatterplot of data whose y-values range from 100,000 to 100,010 with a y-axis commencing at 0.
Some would argue that we are now in a post-Tufte graphics world, with the development of an extensive range of interactive graphical toolsets — from those which respond with facilities such as zoom, pan, highlighting and drill-down, to interactive whiteboards and smartboards with drag-able elements, and immersive interfaces that allow users to modify parameters and variables. Amongst the latter, UCLA's Statistics Online Resource Center (SOCR, formerly in Java, now in HTML5) and http://www.distributome.org/ are a good examples in an educational context whilst Jfreecharts is an example of one of the many free Java-based toolkits for interactive chart and graph creation (see also https://public.tableau.com/s/. Note that many web browsers no longer support Java plug-ins as standard owing to emerging security risks. AMCharts, which was Flash-based and is now Javascript-based is another interesting interactive graphics package. Tufte, though largely now retired from academic life, has remained in touch and active with these developments. The work of Hans Rosling in bringing vibrant, interactive graphics to the forefront of modern communications has been of enormous significance — see for example the brief presentation by the recently deceased Prof Rosling on YouTube: https://www.youtube.com/watch?v=jbkSRLYSojo
In many instances statistical charting facilities are used in essentially static situations (i.e. the data content is static), but there are applications (many of a non-statistical nature) which require dynamic capabilities — for example in real-time financial systems, in command and control environments, and in process and quality control. Typically these are implemented through graphical toolsets and libraries, and may include automatic visual warnings of unusual events or exceptions.
There is a huge variety of graphical forms available to present statistical information. The choice will be in part determined by the nature of the data, but also by the purpose to which the data will be put and the type of audience the information is targeted at. The principal types are listed below, followed by links to online galleries of example charts that illustrate each type. In addition, examples of 2D and 3D statistical graphics are to be found throughout this Handbook.
The terms graph, plot, chart, figure and diagram all are used interchangeably, although graph and plot are usually reserved for line and point data, whereas chart, figure and diagram cover a broader range, including the use of elements (such as images) that may serve to highlight the arrangement or meaning of data, rather than the data themselves. Different authors and different software packages use varying terminology and definitions for some types of chart, so readers are recommended to refer to the detail of the description and/or documentation before utilizing any of the more specialized facilities. In the descriptions we have provided below we have opted to use the definitions given by the US National Institute of Science and Technology (NIST), and have included links to their online pages where appropriate.
The following is a brief summary of the main types of 2D statistical graphs and charts provided in many software packages:
•Scatter diagrams (also called scatter plots or scattergrams) — these show (x,y) pairs as points and are often used to plot the variation of a dependent variable, y, against an independent variable, x. In the example shown below data pairs are plotted as circles and a best fit line (a linear regression line of y on x) has been included on the graph (see the R code samples section for code and data). This is a very basic example of the use of scatter or point diagrams. The data show how countries with higher income inequality, such as the UK, Portugal and the USA, also experience the highest levels of health problems and social problems (as measured by the Index selected by the authors [WIL1]). This contrasts with countries like Japan and the Scandinavian nations, whose income inequalities are much lower and whose health and social problems are also computed as being much lower. Example variants of scatter plots include multiple plots (see for example correlation matrix plots), the use of different symbology (shapes, colors, labeling) to identify distinct groups of points within the overall plot, and the use of interactive facilities in modern software and web environments (discussed further below). In general, points of a scatter plot should not be linked by linear segments unless there is a clear sequence (e.g. a time sequence) that warrants this form of presentation
Example scatter plot: Income inequality vs Index of Health and Social Problems (after Wilkinson and Pickett, 2009)
•Probability plots — a probability plot is a diagram that plots the cumulative distribution of observed data against the cumulative distribution points of the target distribution — most typically, a Normal distribution. Typically probability plots were displayed on probability paper although nowadays a graphical display equivalent thereof is used. An example, for data on Radon measurements, is shown below. The plots provide a visual indication of the goodness of fit of the observed data to the target distribution after a range of transformations, and are usually accompanied by measure of goodness of fit, such as the Anderson-Darling test. Depending on the way in which these plots are produced, the intercept and slope of the straight line that best fits the data may provide information on the parameters of the fitted distribution
Example goodness-of-fit plots — Radon data
•Line plots — there are many variants of line plots, from simple graphs showing a sequence of y values as x increases, to best fit lines through scatterplots, and to more complex multi-line graphs. The latter include time series and autocorrelation graphs, control charts, financial charts (with open-high-low-close or 'OHLC' indicators), correlograms and many others. Line charts are amongst the clearest and most widely understood form of chart. When preparing such charts it is important that the linear segments connecting sequential points are meaningful over the range in question — if not, other forms of display such as bar charts or scatter plots may be more appropriate
•Bar charts and histograms — although we have grouped these two diagram types under the same heading, it is often useful to distinguish between them. Bar charts typically apply to categorical data, and show the frequencies in each category. The bars have equal width and a distinct gap between them to make this as clear as possible. Histograms are similar, but show the frequency patterns for data measured on a single variable, which may be discrete or continuous. The vertical bars of histograms should be adjacent or very close to adjacent, thus indicating that the frequencies plotted apply to the entire class or range in question, and then continue to the next higher class or range. Histograms for which the class intervals (widths) vary should be of variable width, as it is the area of the histogram bar that indicates the intensity or density of the data over that interval. Variable width histograms are rarely used in practice, with every effort being made to provide equal intervals of the data. For upper and lower classes with undefined limits (e.g. <0, >100) there is no agreed procedure how to display such data, but common practice (which is clearly inadequate) is to assign the same fixed width to such ranges
There are many variants of histograms and bar charts — single, multiple, 3D, stacked, split, cumulative and more specialized versions, such as the bivariate histogram illustrated below (MATLab code by Mauro Gerber). This shows two histograms back to back, in this case representing two batch of ceramic test data from the NIST. It provides an effective visual means for comparing two datasets, highlighting the differences in the mean values and the broad similarity of the distribution shapes.
Bi-histogram of NIST ceramic strength dataset
•Quantile-Quantile or QQ plots — the QQ plot is a means of comparing two sample datasets. The plot is essentially the same as a probability plot, but applied to two datasets. The cumulative distribution of each dataset is constructed and quantiles (often percentiles) are identified. Ideally each quantile slot will contain roughly the same proportion of the data observations (arranged in ascending order), with the largest data value in each slot marking the upper limit of that quantile — hence QQ plots are best applied when there is a reasonable amount of data, e.g. >100 observations. The units plotted on the two axes are the units in which the data is measured, not the quantiles themselves (see also, the informative QQ plots article on Wikipedia). QQ plots are also used to compare samples of the same size from separate batches, a and b say. In this case each sample is arranged in size order {A1,A2,...An} and {B1,B2,...Bn} and points on the graph are the pairs (Ai,Bi) — if the batches match the point pairs will lie on a diagonal line. Probability plots can be seen as a special case of a QQ plot in which the second distribution is a theoretical one rather than an observed dataset
•Box plots — Box plots are a convenient way of summarizing key features of an observed dataset and for comparing this with one or more other similar datasets — we have previously used the example of a box plot when discussing measures of distribution spread and outlier detection, and have included the graphic below. Here it was shown how box plots provide: the median (the middle value); the upper quartile and lower quartile values (which form the frame of the central box); the whiskers and hinges (multiples of the inter-quartile range) and the explicit identification of any outliers (defined as data items lying beyond the hinge values)
Box plot
•Vector plots — some multivariate datasets include direction and magnitude information, particularly time series at single locations and measurements made across multiple locations. Examples include wind direction, magnetic flux, fluid flow, gradient etc. Graphical representation of such datasets for single locations is usually achieved using rose diagrams (see below), whilst with multiple locations vectors are drawn at each sample or grid location representing single measurements. Some software packages refer to diagrams of this type as quiver charts or feather charts, depending on the particular characteristics of the display required
•Pie charts — this is a very popular form of visual representation for data that is grouped or categorical, with values recorded in proportions or percentages. By varying the radius of the pie chart the relative size of different datasets can be represented (increasing as the square root of the radius), and particular segments can be highlighted using 'exploded' versions of the standard arrangement. Pie charts are not particularly good at displaying data where there are a large number of groups and/or groups that contribute a very small proportion of the whole. Pie charts can be used in multiples, e.g. on maps, but Tufte (2001, [TUF1]) argues that this is a very poor and confusing approach to use
•Polar plots and rose diagrams — data that includes and implicit or explicit directional or temporal component can often be displayed most effectively using a plot that uses polar coordinates rather than Cartesian coordinates. Examples include the wind speed and direction rose diagrams and the Bivariate Normal plot provided earlier in this Handbook, and Florence Nightingale's mortality diagram, illustrated on the back inside cover of this Handbook. A recent use of this kind of diagram, showing the impact of UK Covid deaths in 2020/2021, is available on website of the Oxford University's Centre for Evidence-Based Medicine, and is shown below:
Polar plot: Florence Nightingale Diagrams of Deaths in England & Wales
•Bubble plots — bubble plots are essentially scatterplots where the data items have an additional value or weighting (a third dimension) to be displayed. This is achieved by varying the size and/or color of the points. Note that the first two dimensions may be variables or spatial coordinates, with the third dimension being the additional variable. Some software tools display the bubbles as pseudo-3D objects, i.e. looking like bubbles or shaded balls. In the diagram below a bubble plot has been created using polar data, with random location and size, and colors determined by the angle of rotation. Bubble transparency has been set to 75% (from MatPlotLib, Python language examples). See also the Hans Rosling video mentioned earlier, where he uses bubble plots to great effect.
Bubble plot on polar chart
•Forest plots — Forest plots, as illustrated earlier in this reference work (meta analysis section) are yet another form of specialized chart. In this instance a stack of trials or experimental results relating to the same subject area are placed one below each other, with horizontal lines indicating the confidence interval for the trial as a whole (smaller trials having wider confidence intervals typically) and mean values indicated with variable 'blob' sizes reflecting the size of the trial or experiment. Forest plots have become a major tool in the arsenal of meta-data analysis, as they help to visualize multiple sets of data, both separately and when combined, in a clear manner, in order aid informed decision-making.
•Ternary diagrams — these are charts with three axes rather than two. The axes are usually arranged as an equilateral triangle. In many instances the diagram has a scale of 0-100% on each axis, and the data for a tri-variate observation is plotted against these three percentage axes — this is particularly common in Earth Science applications (e.g. the points are soil samples from a number of locations, and the axes are the proportion of sand, clay and humus, or of trace elements, in each sample). An example ternary chart is shown below, with each point identifying the proportion of Copper (Cu), Tin (Sn) and Zinc (Zn) in samples of a bronze-like metal alloy:
Ternary plot of 15 metal alloy samples
•Parallel coordinate plots (PCP) — This is a relatively recent form of graphical display designed to show multivariate data in an interactive environment. In the example illustrated below five variables from census data relating to distinct census tracts are included: house values, income levels, crimes recorded (residential burglaries and vehicle thefts), open space and the percentage of housing with deficient plumbing. Each variable is shown with a [min, max] vertical scale and a linking line that corresponds to the case (census tract). Lines are colored according to a user-chosen single variable and classification rule. By selecting a single line, such as that shown, the variable values are displayed and other visualization windows, such as various forms of maps and graphs, show the selected object highlighted. This particular example was created using GeoViz, an interactive tool that is no longer available.
Parallel coordinate plot
•Star, radar and spider diagrams — In this type of chart each variable is plotted in a separate direction, with the length of the arm being proportional to the variable magnitude. It is typically used for comparing the levels of between 3 and 7 variables on a single chart. A simple example is illustrated below (as with the PCP, above, this is an interactive graphic tool when displayed within the relevant software environment).
For spatial datasets each mapped region will have an individual star plot, which can be mapped as an overlay onto a classified base map to provide a star plot map that displays multiple variables visually and simultaneously. As with the PCP plot, the example below was created using GeoViz and displays six variables this time: house values, income levels, crimes recorded, open space, the percentage of housing with deficient plumbing, and a neighborhood measure
Radar/Star plot 2
•Hybrids — almost every type of chart can be plotted in multiples and/or with other chart types overlaid. For example, a chart might display a histogram with a line graph overlaid, or a series of pie charts might be plotted on a scatter diagram layout
•Maps — maps provide a widely used form of graphical presentation that has particular benefits for displaying statistical data whose spatial distribution is important. Map-based systems (geospatial information systems) are now one of the most prevalent and effective forms of communication with a global audience, particularly with the rise of web-based mapping tools, Google maps and associated databases of economic, social and health data
•Tree diagrams — tree diagrams, or dendograms, are a particular form of acyclic network diagram in which two or more lines or 'arcs' extend from a single vertex and each arc terminates in a node that may then form part of a subsequent section of the tree. They have an important role in statistics in the context of decision analysis. Applications include simple multi-stage probability diagrams, multivariate classification, and statistical model building. Tree diagrams are widely used in non-statistical application areas such as organization charts, family trees and decision charts
Most 2D graphics have 3D visualization equivalents. These 3D variants are typically used to enable an additional variable to be plotted, for example a bar chart in 3D might show the same information as its 2D equivalent, but with the third dimension enabling data from a number of years or a number of different groups, to be plotted on a single diagram. Presentation of 2D information with a 3D visualization is unnecessary and can be confusing, so is not recommended.
In addition to charting data on 3 variables, there are specialized 3D charts that are designed to show data for which the x and y values represent spatial coordinates and the z variable provides the data item. In general only a single z-item is displayed at a time, and may be a conventional graphic, such as a point or bar, or a surface representation such as a contour map or surface model (e.g. a wire-frame with or without shading). Below we provide a few examples of these variants, noting always that the range of possible graphics is very large.
3D scattergrams - although point-based data can be plotted as a 3D diagram, the results are not very satisfactory, as it is difficult to determine on a 2D format where the points actually are located. There are various solutions to this problem, one of which is to plot the projections of the 3D points onto the 3 orthogonal planes x-y, x-z and y-z, as illustrated below. Another option is to provide a vertical line from the x-y plane with height z.
3D scattergram — projected data
3D bar and frequency plots — It can be helpful to plot multiple datasets using a 3D form of bar chart or histogram, although the clarity of presentation can be reduced if the presentation is too crowded. Often the bars are shown as solid blocks, but in the example below (a Python example from the MatPlotLib gallery) flat bars are used to improve clarity. Here the data is showing the relative frequency distributions for 4 datasets, at values y=0,10,20 and 30 — labeling of the axes could clearly be improved.
3D bar chart
Contour plots — 3D continuous surface datasets can be plotted using 2D contour map representations, with the contour intervals filled or not, depending on the user preferences. The implication of such a representation is that every location in the x-y plane has a single z-value associated with it. In the example illustrated below, the source data is the level of zinc in parts per million (ppm) in a set of 98 soil samples taken near the River Meuse, in the Netherlands. A (geo)statistical interpolation technique, known as Kriging, was used to create estimated values for the level of zinc over the entire region (interpolation to a fine grid) and then the grid was used to produce the filled contour diagram by interpolating contour lines between grid intersections.
Contour plot: Predicted zinc levels (in ppm) in the neighborhood of the River Meuse
Surface and wireframe plots — The preceding diagram could also have been displayed as a surface, using a simple wire-frame model, again with or without infilling and 'lighting' of the surface. Frequently such models clarify the form of a surface, but unless an interactive facility is provided (e.g. rotation, tilting, scaling) some areas may be indistinct or hidden.
Similar issues arise for synthetic surfaces, such as the Gaussian surface shown below. In this example a positive and negative Gaussian function (essentially a Normal distribution) has been applied centered at two locations, a fine grid generated and then rendered to provide the visual impression of a smoothly varying synthetic surface. The two blue colors identify grid cells that are visible when looking across this landscape in an arc from the top of the Gaussian hill — dark blue areas are those that are visible. Likewise, the red and yellow markings show the visible and invisible cells when a visual transect in the directions of the three green lines is taken from an offset above the hill top (see the web or PDF versions for color images).
Gaussian surface model
Dynamic and Interactive graphics
A major change has occurred in the display of graphics in the last few years. Whereas almost all graphs and charts since the mid-1700s had been static entities, designed for reproduction on paper or static display devices, the rise in computer power and the growth of the Internet has changed almost the entire rule book. The basic principles laid out by Tufte still apply, but it is now possible to interact with the graphics in order to clarify the information they seek to display. This can be achieved in many different ways, some simple, others more complex. Many of these procedures have been developed in order to facilitate exploratory data analysis (EDA), others have communication as their primary function. Below we describe a small sample of these in order to provide an indication of some of the key facilities available.
Many software packages now facilitate multiple views of a dataset — for example, a display of a data table and simultaneous display of a statistical chart, such a scattergram of two or more variables (columns) from the dataset. Where such facilities are dynamically linked they can provide a powerful toolset for EDA purposes. By dynamically linked we mean that selection of data items (records) in one view (e.g. the tabular view) highlights those records in another view (e.g. the relevant points in the scattergram). This concept is known as linking. The selection of records of interest may be by simply item selection from table rows, but may also be by some form of sweep across a number of records, or displayed objects in a window — this facility is known as brushing. The diagram below illustrates this process in action using crime and income data for zones in Columbus, Ohio. By brushing a part of the scatterplot (top left) we have selected a group of high crime-low income records, and these are then automatically highlighted (linked) in the other views of the dataset (histogram, map and attribute table in the example illustrated). This particular example was generated using the GeoDa package.
Brushing and linking: Columbus, Ohio crime data
Brushing and linking is one example of recent advances in the application of graphical techniques to EDA. Another major development has been in labeling and data content. In a static diagram or map everything one wishes to communicate must be pre-defined and on display. If there is too much information the graphic can be unusable or multiple graphics may be required. With interactive graphics the situation changes because each component of the graphic can be defined to provide additional information when the user interacts with it. The simplest way to use such facilities is to provide additional labeling information that is displayed when an item is clicked or sees a 'mouse over' event, but so-called drill-down facilities provide a much greater range of facilities.
A clickable object may be a higher level grouping of more detailed data, and by selecting the object and new set of data, becomes accessible. A good example of such facilities can be seen at the Amcharts website: https://www.amcharts.com/demos/. Facilities such as these enable large volumes of statistics (data) to be navigated and investigated, through a graphical interface, without the need for the creation of large numbers of complex statistical graphs and charts. However, it is important to note that such facilities are only as good as the underlying datasets and interface implementation, and are essentially communication and EDA tools rather than analytical in function.
References
[CHA1] Chambers J, Cleveland W, Kleiner B, Tukey P (1983) Graphical Methods for Data Analysis, Wadsworth.
[FRI1] Friendly M (2000) Visualizing Categorical Data. SAS Institute, Cary, NC.
[TUF1] Tufte E R (2001) The Visual Display of Quantitative Information. 2nd ed., Graphics Press, Cheshire, Connecticut
[TUK1] Tukey J W (1977), Exploratory Data Analysis. Addison Wesley, Reading, MA.
[WIL1] Wilkinson R, Pickett K (2009) The Spirit Level — Why Equality is Better for Everyone. Allen Lane, London (Penguin edition, with updates, 2010)
AMCharts: Javascript based interactive charting tool: https://www.amcharts.com/demos/
GeoDa: Geospatial analysis for lattice data: https://spatial.uchicago.edu/software
Grapher: Product examples: https://www.goldensoftware.com/products/grapher/grapher.shtml
NIST eHandbook: Graphical techniques, with worked examples and explanations: https://www.itl.nist.gov/div898/handbook/eda/section3/eda33.htm
MatPlotLib: Gallery: https://matplotlib.org/2.0.2/gallery.html
UCLA's Statistics Online Resource Center (SOCR Charts project, formerly in Java): https://socr.umich.edu/HTML5/
Wikipedia: Charts: https://en.wikipedia.org/wiki/Chart