Missing data and data errors

Navigation:  Statistical data >

Missing data and data errors

Previous pageReturn to chapter overviewNext page

Missing data is a wide-ranging term for the absence of expected data from a recorded sample, which may occur for many reasons and in many ways. In can be a small, easily managed problem, or a larger problem that raises questions about the usability of the dataset. Missing data can involve all relevant data for a record or set of records, or it could be missing values from a set of measurements relating to individual records (incomplete records). Each different situation requires separate consideration, both as to the seriousness of the problem, and to the means by which such difficulties are to be addressed. In the following paragraphs we discuss some examples of the key issues and approaches to resolving them. We then look in more detail at some of the techniques and tools provided within statistical software packages that are designed to assist in these situations.

In sample surveys the most common reasons are non-response, partial responses (only some questions answered) and spoiled responses. Surveys may be structured to ensure sample sizes are increased to a level at which the target response is achieved taking into account non-response and unusable responses, although this can be difficult when the survey involves quotas (e.g. the study must include responses from 50 women between the ages of 40 and 60, 50 between 60 and 80, and so forth). In general it is very difficult to avoid the problem that sample surveys or trials will yield incomplete response data, particularly where there are many questions or variables being examined. Such problems can lead to biased results and need to be addressed as early as possible in the overall data collection design and implementation phase of a project. Missing data may also be encountered when an experiment or trial is undertaken and unforeseen circumstances make some of the data unusable or impossible to obtain. For example, the present author conducted a controlled trial of three different types of multi-lingual keyboard in the European Commission headquarters in Brussels. A total of 96 staff were recruited (48 for week 1, 48 for week 2) to undertake computer-controlled typing tests in a variety of real and synthetic languages. All test sessions were completed with no data losses until one morning the building was picketed and attacked by French farmers protesting against proposed changes to the Common Agricultural Policy. They entered the building and were only removed after tear gas was used, which in turn shut down the lower levels of the building and our morning's data session was lost! Fortunately the exercise included enough replicates, including a complete replicate of the entire experiment in week 2, so analysis of the results was only marginally affected (see Evans, 1988 [EVA1]). In trials of medical procedures one or more of those involved in the trial may be unable to complete the trial due to illness or other, unrelated factors, or maybe one sample becomes contaminated or is not of the correct strength, so the data has to be discarded. This may be all the data relating to one or more participants in the trial, but more commonly relates to one or data items relating to a case - for example, a missing test result. Datasets that rely on remote-sensing equipment frequently demonstrate missing data, when equipment fails or has to be taken out of use temporarily for servicing or other reasons.

For certain experiments and analyses, such as some randomized block designs, loss of even a single data item is important because it upsets the balanced nature of the design and its subsequent analysis.  Where observations are unavailable for a single unit this may be partially overcome by estimating or imputing the missing data from the remaining information, thereby turning the unbalanced design back to a balanced form. For example, in a randomized block design (see further, Cox, 1958 [COX1]) with k blocks and t treatments a simple estimate for a single missing value is given by (kB-tT-G)/(k-1)(t-1) where B is the total of all remaining observations in the block containing the missing observation, T is the total of observations on the missing treatment, and G is the grand total. This provides a simple form of averaging for a single missing value - least squares techniques or simple iterative estimation can be used to extend the concept to more than one missing value. Analysis then continues as if the estimated value or values were genuine, but with residual degrees of freedom reduced by 1 for each missing value and a correction for bias applied to the total sums of squares in the Analysis of Variance computation. Likewise, in time series analysis, complete time series are almost always required, making analysis of incomplete temporal datasets very problematic. Where such data is missing at the start or end of a series, it may be sufficient to simply ignore this problem and analyze the data that is available, assuming that it can be regarded as representative of the entire period. However, if embedded values (i.e. within the series) are missing some form of estimation is often the only option.

Another common reason for missing data is incorrect data recording, coding or subsequent processing. The precise reason for such errors and the scale of the problem are important to determine. Incorrect data coding by researchers and data preparation staff can often be checked through systematic verification, for example by taking a sample of each block of survey returns and having these independently recoded and compared with the original coding. Incorrect interpretation of survey questions, or incorrect recording of data by surveyed individuals, needs to be identified through inspection and validation techniques, thereby identifying the scale and nature of any problems, and implementing changes of corrections to the data gathering and/or subsequent processing of the data. The widely publicized issue of data quality associate with the Climate Science Unit (CSU) at East Anglia University in the UK (see http://www.cru.uea.ac.uk/cru/data/availability/ and the IPCC dataset site: http://www.ipcc-data.org/ ) provides a vivid insight into some of the issues associated with collating and cleaning datasets from multiple sources on a range of variables over a prolonged period of time.

Minor errors and occasional items of missing data can often be handled programmatically, but in some cases such approaches are not sufficient and the data and project will have to be reviewed in the light of the data limitations. Many software packages include facilities to handle problems with data completeness. The most common arrangement is for data to be coded to identify missing values, for example using a distinct entry such as a blank " ", * or -999 to indicate a missing value, depending on the data type and range being recorded. When an entry of this kind is encountered, the software package will apply one or more rules in order to determine what action to take. For example, in computing basic statistics for quantitative datasets it may simply ignore missing values (as opposed to deleting records with missing values) and carry out the computation on the available subset, with a reduced count of items. This raises the question as to whether such estimates are biased.

It is not merely the scale of missing items that must be considered, but also whether there is any pattern to the missing data. If the missing values occur completely at random (MCAR) and the proportion of missing values is not large (<5%) statistics such as mean values, variances, correlations etc. can be produced ignoring these missing values and the results will tend to be unbiased. However, if the missing values are not randomly distributed throughout the data, bias will be apparent. It is possible that the non-randomness of missing values is partial, in the sense that within groups the missing values occur randomly but between groups there are substantial differences. This might be observed in cases where one group in more likely to respond to a question or to perform a task than another group. If the data show missing at random (MAR) data within groups but do not conform to the MCAR requirement, it is still possible to produced unbiased statistics within these groups. It may also be possible/acceptable to fill in missing values with estimated values that are derived from the remaining data in the entire study or subsets of the study - this applies principally to quantitative data in univariate and multivariate data sets, and to temporal and spatial datasets.

In order to determine whether MCAR, MAR or neither apply, the dataset can be partitioned and subject to various forms of simple pattern analysis and statistical comparisons. For example, all records could be divided into those with and without a data value on a given variable, and a comparison statistic (such as Little's chi-squared test, [LIT1, LIT2]) computed to try and detect any significant differences between the two subsets. Having identified the scale and nature of the missing data problem, the question then arises as to what action to take. If the sample size is large enough and the proportion of records with missing data is small, it may be acceptable to either ignore the missing values (especially if the MCAR or MAR tests indicate that this is very safe to do), or to delete/ignore entire records with missing values (generally an unsafe practice, as this tends to introduce additional bias), or to impute the missing data (see further, below) from the remaining records. In this latter case missing data are essentially 'invented' by reference to other data in the sample. Typically results are then reported with and without the imputed data, or with a clear explanation of the impact of imputation on the results.

There are many techniques for such imputation, notably maximum likelihood and a variety of regression methods. Pure multiple regression methods tend to underestimate the true variance of the imputed data values, so some form of variance inflation may be added to overcome this limitation. So-called multiple imputation (MI) methods are now favored by some researchers since these appear to provide more representative and robust results (see further, Pickles, 2005, [PIC1]). MI methods involve some form a conditional simulation, producing several imputations (typically 5-10) and then using the mean of the results as the estimates for the missing values. Alternatively the entire analytical procedure can be carried out on each of the versions of the dataset, and the results from each analysis averaged or compared. Typically such methods will compute the mean and variance of the variable across records for which complete data is available and then sample random values from the Normal distribution matching these parameters to obtain sample values for the missing item. Some packages, such as SAS/STAT, perform MI using samples obtained via MCMC methods (essentially this involves using the remaining data as a model distribution for the missing data and randomly sampling from this model distribution). For categorical data samples are taken from a Multinomial distribution. Another approach, which is sometimes usable with categorical data, is to create a new category that contains those records that include missing data.

Similar concepts have been applied in temporal and spatial analysis, both as a form of missing data analysis and as a form of prediction or estimation for unsampled times and locations. For example, the use of conditional simulation is now a preferred form of prediction in geospatial engineering applications, such as oil and mineral prospecting. If there are very few missing data points in dense temporal or spatial datasets it is usual for these to be estimated using deterministic procedures, using linear, bi-linear or spline interpolation from their immediate or near-neighboring data items, or using simple mean or median values in the local neighborhood. The quality of imputed results can be evaluated by comparison with the entire dataset (e.g. convergence of parameters), by internal consistency checks (e.g. jackknifing and bootstrapping techniques) and/or by reference to external datasets and samples (e.g. so-called 'ground truth' comparisons).

Handling missing values - techniques and tools

This section provides a brief summary of the main approaches for handling missing values. In most instances these are procedures offered within software packages, but it remains the responsibility of the researcher to select the method used and to document and justify such selection when reporting the results of analysis. In many cases estimating missing values will apply to real-valued measurements, but some procedures may apply to binary or categorical data.

Ignoring entire records

This is the most commonly available approach for handling missing data. As noted above, this is only acceptable if the number records is relatively large, the number of records with missing data is relatively small (<5%), and the missing records can be shown to occur completely at random (MCAR) or are missing at random (MAR) within well-defined subsets of the data. In general this approach cannot be used in small sample balanced trials nor for time series.

Setting missing values to fixed value

Many packages allow missing values to be replaced with a fixed value (e.g. 0) or a user-provided value for each instance. The problems of adopting these approaches are obvious.

Single estimation procedures

A very common approach to missing values is to use some form of estimation based on the characteristics of the data that has been successfully collected. For example, the SPSS Transform operation, Missing Values option, offers the following options for estimating such values: (i) use of the mean or median value of nearby points (by which it means neighboring records, with the number of such records used selectable by the researcher); (ii) use of the overall series mean, i.e. the mean based on all records; (iii) linear interpolation, which applies to data in series and uses the two adjacent non-missing values, to fill in the gap or gaps; (iv) linear regression, which is similar to linear interpolation but use a larger number of neighboring points and calculates a best fit line through these as its estimator for intermediate missing values. Other software packages may provide additional options - for example, a variety of model-based interpolation options are available in the SAS/ETS (Economic and Time Series) software. Similar procedures are provided in some other packages, but often it remains the researcher's responsibility to provide or compute estimates for missing values as a part of the data cleaning and preparation stage.

Multiple imputation (MI)

Multiple imputation (MI) methods vary depending on the type of data that is missing and the software tools used to estimate (impute) the missing values. In this subsection we describe the approaches adopted by the SAS/STAT and SPSS software, which are largely based on the published work of Rubin (1976, 1987, 1996 [RUB1],[RUB2],[RUB3]).

Essentially there are 3 stages to MI:

the missing data are filled in m times to create m complete datasets (m is typically 5)
the m complete datasets are analyzed separately, in the usual manner
the results from the multiple analyses are combined in order to provide statistical inferences regarding the data

Depending on the type and pattern of missing data, SAS/STAT and SPSS will generate estimates for the missing values using some form of regression analysis of the valid data (single, multiple or logistic regression), or MCMC methods under an assumption of multivariate Normality, for more general missing values. The latter approach is of the general form: (a) initialize estimates for the missing values for all records and variables by drawing random values from a Normal distribution with mean and variance that match the non-missing data (or use a multinomial distribution for categorical data, with proportions in each class defined by the proportions in the non-missing data); (b) using all the data, except for missing data on the jth variable, use a univariate method (e.g. regression) to impute the missing values in that variable; (c) iterate across all variables and track the convergence of both the mean and variance of the imputed missing values.

When the datasets are analyzed the results are combined to produce a single set of inferences together with the between and within imputed dataset covariances. As Rubin (1996, p476, [RUB3] ) explains, the posterior distribution of the data obtained following multiple imputations is simply the average of the individual imputations, the mean values are the means of the imputations, and the variances are the sum of the average of the individual variances obtained from the MI process plus the variance of the mean values obtained.

References

[COX1] Cox D R (1958) Planning of experiments. John Wiley & Sons, New York

[EVA1] Evans S (1988) The statistical aspects of a study to help in the design of a multi-lingual keyboard. MSc. thesis, University of Kent, UK

[LIT1] Little R J A, Rubin D B (1987) Statistical analysis with missing data. John Wiley & Sons, New York

[LIT2] Little R J A (1988) A test of missing completely at random for multivariate data with missing values. J of the American Statistical Association, 83, 1198-1202

[PIC1] Pickles A (2005) Missing data, problems and solutions. pp. 689-694 in Kempf-Leonard K, ed., Encyclopedia of social measurement. Elsevier, Amsterdam

[RUB1] Rubin D B (1976) Inference and Missing Data, Biometrika, 63, 581-592.

[RUB2] Rubin D B (1987) Multiple Imputation for Nonresponse in Surveys. New York: John Wiley & Sons.

[RUB3] Rubin D B (1996) Multiple Imputation after 18+ Years. Journal of the American Statistical Association, 91, 473-489