The correlation methods described in the previous subsections have dealt with data analyzed in pairs. The data relating to each pair can be seen as a vector of observations or measurements, with the vector elements being the observations. Canonical correlation, introduced by Hotelling (1936 [HOT1]), addresses the situation where there are two groups of vectors, and the objective is to obtain a new pair of vectors that are formed from a linear combination of the source data variables and which have the maximum possible correlation. The correlation values are computed as a product moment correlation measure and the procedure as a whole is similar to principal components analysis and factor analysis, but unlike these procedures canonical correlation is concerned with two distinct sets of data, not one (even if the one set relates to many variables). The two sets of data are regarded as distinct, in that neither is treated as the dependent or independent set, and although the set of observations on each variable must match (the same number of cases across all variables) the number of variables in X=Set1 and Y=Set2 do not need to be the same.

More generally, let the first set, X, contain n observations on each of p variables and the second set, Y, contain n observations on each of q variables, then canonical correlation seeks two new vectors, u and v, that are linear combinations of the variables in X and in Y, such that u and v have the highest possible correlation. Thus we seek the set of coefficients a and b (the canonical coefficients) in the two linear expressions below that achieve this maximization objective:

or, using vector notation, equations of the form u=a'X and v=b'Y; the extra subscript 1 is the expressions above are to indicate that this is the first such pair produced. In principle one could then seek a second set of vectors, u2 and v2, that are formed in the same manner but were uncorrelated with the first set. This process could continue until the min(p,q}th set of vectors has been computed.

The correlation coefficient to be maximized can be written as a generalized version of the standard product moment correlation:

where SXY is the p by q covariance matrix with entries that are the covariances of the ith x vector and jth y-vector, SXY={Cov(xi,yj)}. The solution to this optimization problem is of the form:

where the vectors c and d are obtained from matrix operations on the variance-covariance matrices for the pairs XX, XY, YX and YY

Canonical correlation has been applied in a variety of fields, but has not gained widespread use, mainly due to the difficulty in interpreting and then using the results of the analysis. Clark (1975 [CLA1]) provides a detailed walk-through of the technique as applied to problems in spatial analysis, where the cases are a set of n regions, the set X includes variables relating to employment levels in year 1, and the set Y includes a different set of variables measured in year 2. Despite the difficulties in interpretation, he regards the technique as a valuable addition to the arsenal of tools available for multivariate data analysis, especially where a broad scale synthesis is sought. In the example included as part of the test data in R (cancor function), data from 50 countries are analyzed, where Set1 is comprised of two variables providing population percentages at different ages, and Set2 is comprised of 3 savings-related variables. Data are analyzed and the coefficients obtained, with the mean values of each column vector subtracted (by default).

References

[BAR1] Bartlett M S (1941) The statistical significance of canonical correlations. Biometrika, 32, 29-38

[CLA1] Clark D (1975) Understanding Canonical Correlation. CATMOG 3. Available from: http://qmrg.org.uk/catmog/

[HOT1] Hotelling H (1936) Relations between two sets of variables. Biometrika, 28, 321–327

[LAW1] Lawley D N (1959) Tests of significance in canonical analysis. Biometrica, 46, 59-66

SAS/STAT 9.2 (2009) User Guide, 2nd edition. SAS Institute Inc., Cary, NC, USA