The Pearson or Product Moment correlation coefficient, rxy, is essentially a measure of linear association between two paired variables, x and y. It is frequently computed as part of a data analysis exercise that includes plotting the pair of variables against one another to visually determine the form of the relationship, combined with the production of a best-fit or regression line through the data points. For example, the graph below (a scatterplot) shows the relationship between law school admission test scores (LSAT) and the subsequent grade point averages (GPA) for 15 law schools in the USA. It appears that there is a broadly linear relationship with individual observations scattered around the best fit line, with most cases being reasonably close to this line.
US Law School Data (after Efron and Tibshirani, 1993 [EFR1])
The correlation coefficient, rxy, is a measure of the strength of this relationship and is computed as the ratio of the covariance to the product of the standard deviations. If the two datasets are the same or perfectly matched this will give a result=+1. If there is no (linear) association the result will be 0, and if the data pairs are perfectly correlated but in an inverse or negative manner, the coefficient will be -1 (and the slope of the best fit line will be negative).
The value of the correlation coefficient squared, r2, is often computed and called the coefficient of determination. It provides an estimate of the proportion of the variance that can be explained by the pattern of association between the two variables, or more specifically, that part of the variance 'explained' by the regression line. So, for example, the correlation coefficient for the data above is 0.7764 and this provides an indication that roughly 60% of the pairwise variation may be accounted for by their linear pattern of joint variation, and 40% remains unexplained/the result of random variations in the data (residual variation). We return to this example a little later in this section.
The calculation of rxy is based on measuring the pattern of common (or co-) variation observed in a collection of two (or more) datasets, or partitions of a single dataset, and then standardizing this measure by the product of the standard deviations of each set. The sample covariance computation is simply:
Using this notation we can obtain the sample variances
The statistic rxy is then calculated as the ratio:
Since the terms involving n drop out of this ratio we can also write this as:
The population correlation coefficient is usually denoted, ρ, and can be written as:
For large n the sample statistic r is approximately distributed ~N(ρ,1/(n-2)), but for typical samples the significance of a correlation value must be tested using the t-distribution or some form of permutation test, such as bootstrapping (see further, below). In the case of the t-distribution approach the following transform is calculated and then compared with critical values of the t-distribution with n-2 degrees of freedom:
Fisher (1921, [FIS2]) showed that the transform:
produced an approximately Normally distributed variable with mean tanh-1(ρ) and variance 1/(n-3).
All software packages, including the standard R implementation, SPSS and many others, use the Student's t transform and test to compute probability values, but some may facilitate additional approaches. The basic R command for correlation computations is shown below:
R: cor(x,y) — correlation between the vectors x and y. The correlation method used is Pearson’s by default, but can be specified as “Kendall” or “Spearman” to obtain the results from these Rank-based methods. For example, using the swiss dataset, Fertility and Catholic columns, this statistic yield results of 0.40, 0.21 and 0.29 (rounded) respectively for the three correlation methods. Various options exist to handle missing data. Note that R: cov(x,y) computes the covariance between the vectors x and y, which is not strictly valid for methods other than Pearson’s.
In an earlier topic we have described some examples of scale, grouping and arrangement effects on data. In this subsection we consider a related issue, which can be difficult to identify using simple statistical methods, even when aided by graphical visualizations. The graph below shows a scattergram plot of the number of species of mammals (y-axis) found in forests of differing levels of productivity (x-axis). This example is from Crawley (2007, [CRA1]) and illustrates a widely reported statistical phenomenon known variously as the Yule-Simpson effect or Simpson's Paradox. It would appear that there is a very strong positive correlation between the two variables, and indeed a Spearman's rank correlation calculation gives a coefficient of +0.75. However, you will see that we have colored the data points according to a grouping parameter that Crawley has defined. Essentially this divides the point set into groups based on a scaling in both variables — small species numbers and low productivity, through to larger species numbers and higher productivity. Within these groupings the correlation is strong and negative, as shown clearly in the composite graph that follows.
Mammal species x forest productivity — scale dependency
Mammal species x forest productivity — scale dependency — disaggregated scatterplots
The 15 data pairs plotted in the US Law School graph shown earlier can be labeled by row, 1,2...15 and column 1,2. We could select 15 random integers between 1 and 15 to produce a new dataset from the labeled set, which would contain 15 data pairs (like the original dataset) but typically some pairs would be missing whilst other pairs would occur more than once (because this procedure, known as bootstrapping, uses sampling with replacement). The correlation coefficient for this new dataset could then be computed and the process repeated many times in order to obtain a distribution of r values based on the original data (which we noted earlier has an r value of 0.7764). In the histogram below we have carried out this operation (using MATLab's Statistics Toolbox) 1000 times. The result shows that most values with this data are in the range 0.4-1.0 and very rarely will values less than 0.4 be encountered, suggesting that the data (LSAT and GPA values) are indeed strongly correlated, without the need for making any distributional assumptions about the underlying population.
Bootstrap frequency distribution for correlation coefficient, r
This bootstrap approach provides one way of obtaining 'experimental' confidence limits for a given correlation value. In this instance (based on this specific bootstrap run of 1000 samples) the 95% confidence limits are 0.4726 and 0.9565 (i.e. quite wide). However, using a transform due to Fisher (1921, [FIS2]), the variable:
is approximately Normally distributed ~N(0,1), so may be used to compute confidence intervals on the stricter assumptions of bivariate Normality (assuming this is valid for this data). This is the usual manner in which such confidence intervals for r are provided in software packages. For the example above, these limits are 0.43851 and 0.92196, so slightly tighter than the bootstrap results we obtained.
If the dataset to be analyzed is comprised of measurements on multiple variables, for example with M rows representing individual cases and N columns representing measurements on a series of variables, each pair of variables can be correlated pairwise with every other variable. This will produce a symmetric correlation matrix, with dimensions NxN, and diagonal containing all values=1. Off-diagonal entries will represent the correlation coefficients that apply between the ith and jth variables. With two variables the correlation matrix will have dimension 2x2 and will simply contain one correlation coefficient (repeated). With more variables the matrix will be larger. The R package includes a function, pairs(), which enables a matrix of scatterplots to be created with the diagonal elements showing the variable in question and each plot corresponding to the pair of variables for that column and row. In the example shown below, using sample data included within the R distribution set, 12 variables are displayed each of which contains 43 point pairs.
This provides an immediate visualization showing which pairs appear to display a linear relationship, and which are more tightly following such a relationship (and will therefore have a higher r-value). The data here are a set of lawyers ratings on 12 variables (such as integrity, speed of decision-making, preparedness for trials etc) of State judges in the US Supreme Court. Other statistical software packages, such as SAS/STAT, XLStat, MATLab and others provide similar functionality.
The correlation matrix corresponding to these graphs is provided below.
1.000 -0.133 -0.154 0.012 0.137 0.087 0.011 -0.026 -0.012 -0.044 0.054 -0.034
-0.133 1.000 0.965 0.872 0.814 0.803 0.878 0.869 0.911 0.909 0.742 0.937
-0.154 0.965 1.000 0.837 0.813 0.804 0.856 0.841 0.907 0.893 0.789 0.944
0.012 0.872 0.837 1.000 0.959 0.956 0.979 0.957 0.954 0.959 0.813 0.930
0.137 0.814 0.813 0.959 1.000 0.981 0.958 0.935 0.951 0.942 0.879 0.927
0.087 0.803 0.804 0.956 0.981 1.000 0.957 0.943 0.948 0.946 0.872 0.925
0.011 0.878 0.856 0.979 0.958 0.957 1.000 0.990 0.983 0.987 0.849 0.950
-0.026 0.869 0.841 0.957 0.935 0.943 0.990 1.000 0.981 0.991 0.844 0.942
-0.012 0.911 0.907 0.954 0.951 0.948 0.983 0.981 1.000 0.993 0.891 0.982
-0.044 0.909 0.893 0.959 0.942 0.946 0.987 0.991 0.993 1.000 0.856 0.968
0.054 0.742 0.789 0.813 0.879 0.872 0.849 0.844 0.891 0.856 1.000 0.907
-0.034 0.937 0.944 0.930 0.927 0.925 0.950 0.942 0.982 0.968 0.907 1.000
As can be seen, the diagonal is always 1, as expected, and the matrix is symmetric. The first column and row contain relatively low correlation values, reflecting the lack of any clear association in the scatterplots in the diagram above, whereas all the other entries show fairly high positive values, reflecting the close, linear, positive slope relationships seen in the remaining scatterplots. The R package corrgrams provides a range of enhancements to the correlation matrix presentation above.
In the discussions above all the examples have assumed that the source data is in a form that is immediately suitable for such analysis. In many instances some form of prior data transformation may be advised, especially if analysis based on the Normality assumption are to be followed. Most software packages also support the option to compute correlations based on weighted data, whereby a separate weight vector or matrix is used prior to the computation to alter each data value, for one or more variables. For example, the data might be a set of measurements of hospital performance for a number of hospitals, with the weights being a measure of hospital size (e.g. in terms of bed capacity, staffing or funding). The weighted formulation is:
where wi are the weights applied and the mean values for x and y are calculated as weighted means rather than unweighted means.
Finally, it should be noted that real-world problems are complex, involving the interactions between many correlated variables. It is often useful to be able to examine the relationship between two variables, whilst holding a third (or several others) constant. One way of doing this would be to group the third variable into broadly homogeneous classes (e.g. high, medium and low) and then calculate the correlation on the two variables of interest within each of these classes, and then take the average of the results (weighted by the number of cases in each of the separate classes). Such a calculation would yield a partial correlation coefficient, ryx.z — partial with respect to z, because we have held z constant whilst computing the correlation between x and y. Rather than grouping on the third variable, there is a simple adjustment to the standard correlation coefficient which provides this adjustment automatically. For three variables the formula is:
Correlation coefficients of this type are called first-order partial correlations. This approach can be extended in a natural manner to larger numbers of variables. For example, with an additional variable, w, we have:
This form of expression is described as a second-order partial correlation, since it is computed using first-order partial correlations. As with the standard product moment correlation coefficient, significance levels are obtained using the t-transform and then compared with critical values of the t-distribution with n-k-2 degrees of freedom, where k is the number of variables being controlled for or partialled out:
Confidence limits for partial correlation coefficients are generally obtained using Fisher's transform, as per standard product moment correlation.
The term correlogram refers to a diagram that plots the variation in the level of correlation against an ordered variable, such as time (autocorrelation plots against time lags, which are also known as auto-correlograms), or against distance arranged in bands (widely used in geostatistical analysis). The R package corrgrams, sometimes referred to as correlograms, is in fact a form of exploratory data analysis and visualization tool that displays correlation matrices in novel ways.
Wikipedia, Resampling: http://en.wikipedia.org/wiki/Resampling_%28statistics%29#Permutation_tests