Data transformation and standardization

Navigation:  »No topics above this level«

Data transformation and standardization

Previous pageReturn to chapter overviewNext page

In statistical analysis the term data transformation is usually reserved for functions that transform observed data values that are not distributed according to a Normal distribution to data values that are approximately Normally distributed. Modification of datasets for other reasons, for example re-scaling data to a [0,1] range, or changing data to remove 0 values etc., generally falls into the broader field of data preparation and cleaning.

Different data types require the use of different transforms in order to achieve a more Normal distribution in the transformed variable. Many transforms only apply to positive data, although a dataset {x} containing negative values may be converted to positive values by adding a fixed number, e.g. min{x}, to each data item before transformation. For heavily left-skewed data distributions log transforms and Box-Cox transforms are widely used. For simple count data Freeman-Tukey (FT) square root transforms are often used, and for proportions an arcsine transform may be most appropriate. For binary data (e.g. Presence/Absence or Yes/No response data) logit transforms are normally used. After transformation and analysis of the transformed variable, it is useful to back-transform the results so they are presented in the same units as the original data. Thus if a dataset is transformed using natural logarithms and the mean value, m, computed, the back transform of this mean value would be em or exp(m).

Several transforms are variance stabilizing, at least for datasets for which they were designed - by this is meant that the variance of the transformed variable tends to a stable or constant value, which is important if multiple samples are being compared that exhibit non-homogeneous variances. The so-called Normal- or z-transform is, in reality, a form of standardization, although for the mean values of samples drawn from the same or similar distributions it does produce datasets that approximate the Normal distribution with mean 0 and standard deviation 1, i.e. ~N(0,1)

Details of a range of data transformations that are widely used are provided in the following subsections. Most statistical software packages provide a range of such functions, with or without some form of optimization of the transformation parameters, and usually with graphical inspection of the results as an option. For example SAS/STAT includes transformations BoxCox, Exp and Log, Logit, and Power, where Power is a variation on the standard Box-Cox model. The Freeman-Tukey arcsine transform is typically applied to proportion data prior to analysis, so is often not provided as a separate function within the statistical software packages.

References

[MCD1] McDonald J H (2009) Handbook of Biological Statistics, 2nd ed. Sparky House Publishing, Baltimore, Maryland. Available from: http://udel.edu/~mcdonald/statintro.html online and http://www.lulu.com/content/3862228 (printed)