Data preparation and cleaning

<< Click to Display Table of Contents >>

Navigation:  Statistical data >

Data preparation and cleaning

Careful data preparation is an essential part of statistical analysis. This step assumes the data have been collected, coded and recorded, and now reside in some form of data store. This will often be in the form of a simple table (sometimes referred to as a data matrix) or SQL-compatible database. Depending on how the data were coded and stored, variables may or may not have suitable descriptions, coding or data types assigned, and if not, this is one of the first tasks to carry out. At this stage it will become apparent if any data items are incompatible with the data type assignments, for example text coded in numeric fields, or missing data that requires entry of a suitable 'missing data' code. Data provided from third parties, whether governmental or commercial, should include a metadata document that describes all aspects of the data in a standardized and thorough manner [UNI1].

Analysis of the dataset for duplicates is often the next step to undertake. There may be many reasons for duplicates existing in datasets, these include: genuine duplicates on one or more variables; data entry errors; multiple returns for the same case; duplicates on a subset of variables (as opposed to entirely duplicate records); and duplicates representing deliberate coding to the same reference. Depending on the nature and validity of the duplicates, decisions have to made on how they are to be treated. In some instances data will need to be de-duplicated, in others data will be retained unchanged, whilst in some instances additional data will be added to records to ensure the duplicates are separately identifiable. When analyzing duplicates using Exploratory Data Analysis (EDA) tools, duplicates may be hidden — for example, point maps of crime incidents frequently under-represent the concentration of crimes in certain locations as these are often recorded as co-located. Identification of duplicates may, of itself, be a form of EDA, identifying genuine co-incident results, or perhaps highlighting data coding protocols (such as assigning particular disease incidence to doctors surgeries or hospitals rather than the home address of the individual).

Zero and null (missing data) occurrences form a special group of duplicates that apply to one or more variables being studied. In many datasets the number of zeros recorded may be very large and their inclusion may totally distort analysis of the variables in question. Software tools may provide the option to mask out (i.e. hide) zeros from subsequent analysis. For example, a data collection device might record a value of an environmental variable, such as wind speed and direction, every 10 minutes. For perhaps 50% of all data items logged the speed might be below the threshold for measurement with the result that directional information also has no real meaning. Analysis of the dataset might choose to exclude the zero values from some EDA visualizations and statistical analyses, as these would overwhelm the results — this is not to say that such data be ignored, but that it should be separated for some parts of the analysis.

EDA methods will also tend to highlight exceptional data values, anomalies and outliers (specific measurements or entire cases) that require separate examination and/or removal from subsequent analysis. Note that this analysis is taking place on the source data, not post-processed information, although the measurement and recording process itself may have effectively pre-determined some of the possible characteristics of the source data (e.g. the coding applied, the resolution of measurement and recording equipment, any systematic data filtering applied during measurement or recording etc.).

In the case of outliers, there are several options of how they should be dealt with, and these will depend on the particular problem and form of analysis being considered. If the outlier is known to be an error (e.g. a mis-coding, by placing a decimal point in the wrong place) it can be corrected or removed. It may be an event of great interest, in which case it warrants separate examination and analysis — this again may result in the item(s) being removed from the rest of the dataset. It can also be altered in a systematic manner, for example: changing the value to be 3 standard deviations from the mean; "Winsorizing" the value, whereby it is amended up or down to the adjacent value in a sorted series; or effectively excluded by computing statistics based on forms of trimmed measures, such as the trimmed mean.

Once a dataset has undergone preliminary inspection and cleaning, further amendments may be made in order to support subsequent analyses and the use of specific statistical models. It may be desirable for such amendments to result in the creation of a new data table or data layer, thereby ensuring that the source data remains untouched and available for re-inspection and analysis. In some instances (very large datasets) it is preferable to extract a representative sample of records and then apply modifications to this extracted set. Data in this new or modified layer may be subject to re-coding, grouping into new groups or classes, and/or apply some form of data transformation (for example applying a transformation to a continuous variable to improve the fit to the Normal distribution). A very large number of transformations are possible, many of these being supported in standard statistical analysis packages. For certain data types (such as temporal and spatial datasets) a specialized set of transformations are used, which reflect the serial and neighborhood aspects of such data. For example, with temporal data, various forms of temporal averaging, seasonal adjustments and filtering may be applied, whilst in spatial analysis such changes may be based on local, focal or zonal computations (see further, de Smith et al., 2018, [DES1]).


[DES1] de Smith M J, Goodchild M F, Longley P A (2018) Geospatial Analysis: A Comprehensive Guide to Principles, Techniques and Software Tools. 6th edition, The Winchelsea Press, UK. Available from:

[UNI1] United Nations Statistical Commission and Economic Commission for Europe of the United Nations (UNECE) (1995) Guidelines for the Modelling of Statistical Data and Metadata. Conference of European Statisticians, Methodological material, United Nations, Geneva: