In many instances, before analysis of sample data can be conducted in detail, one or more tests must be carried out to determine whether the sample appears to be drawn from a particular probability distribution, e.g. the Normal distribution, the Poisson distribution etc.. For parametric methods, such as t-tests, regression analysis and analysis of variance, that rely on an underlying distribution assumption, such checks are essential. For non-parametric methods the population distribution is of little or no importance, so goodness of fit tests are not required in these instances. Goodness-of-fit tests are themselves generally non-parametric, in that the tests typically make no assumptions that require fitting of parameters. Furthermore, such tests are often best accompanied by interactive graphics and data management tools. In many instances conducting prior exploratory data analysis will help to identify data errors and outliers, or requirements for data transformation, that are pre-requisites for achieving a good fit to a target distribution.
It is common, especially with larger datasets, to experiment with a variety of data transforms to identify which gives the best approximation to the Normal distribution. A convenient method for doing this is to plot the cumulative distribution function of the data (raw and transformed variants) on a Normal distribution graph. A straight line relationship on the graph then represents the best possible fit to the Normal. In the figure below we show a series of transformations of 301 Radon level measurements made in the area around Castleisland in South-West Ireland. These graphs (probability plots) and the associated test statistics were generated using the Minitab package. Three very large-valued outliers were removed from the original dataset prior to distribution analysis. The raw data plotted in the first graph (a) is clearly non-Normal, diverging substantially from the straight line which indicates a Normal distribution. In this case simple Log transformation of the data (b) improves the fit to Normal but still diverges, even when corrections are made for background radiation (c). In each of these three cases the Anderson-Darling test of fit to Normal fails, whereas for the Box-Cox transform (d) with optimized parameter, λ, the test passes and in this instance analysis proceeded using this specific transform.
Source: F O’Sullivan (2005) Factors affecting Radon levels in houses: A GIS study around Castleisland, County Kerry, South-West Ireland.
Unpub. MSc Thesis, Dept of Geomatic Engineering, University College London