Canonical Corelations in R
Canonical correlation analysis (CCA) is a multidimensional exploratory statistical method which operates on the same principle as the principal component analysis. The main purpose of the canonical correlation approach is the exploration of sample correlations between two sets of quantitative variables observed on the same experimental units. On the other hand PCA method deals with one data set only and it tries to to reduce the overall dimensionality of the dataset using some linear combination of the initial variables.
In the R statistical software there is function cancor()
available under the standard instalation. Some additional packages can be downloaded and installed as well (mainly the R packages ‘CCA’ and ‘vegan’).
In the following we will consider the same dataset we have already worked with before. The data consists of two datasets where each dataset represents measurements in some specific (64-65) river localities in the Czech republic. The first dataset contains measurements of biologial metrics for each locality (17 different metrics and taxons) and the secon dataset contains measurements on chemical concentrations and values at the same localities (7 covariates recored).
The idea is to somehow relate both dataset in order to explain what bio metrics can correlate which which chemical concentrations. One way of answering this is to apply the canonical correlation approach explained below.
rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)
chemData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/chemData.csv", header = T)
head(bioData)
## Kod_Canoco SaprInd Lital RETI EPTAbu Marg Metaritr JepAbu
## 1 BecvChor 2.57620 31.96916 0.37453 26.90136 7.10911 16.25647 14.89330
## 2 BecvOsek 2.43686 28.12037 0.27790 23.32888 5.80443 14.14831 11.06343
## 3 BecvTrou 2.10235 21.30691 0.31422 36.49575 6.14622 11.21903 30.23011
## 4 BelaBosk 1.53297 37.12914 0.48181 38.49802 9.36742 25.32146 10.11182
## 5 BilyPoto 1.79499 27.21495 0.37100 23.26918 8.62000 17.88813 5.95577
## 6 BlatTova 2.90209 16.04854 0.30351 8.64187 5.51661 7.98030 4.71039
## Epiritral Hyporitral Ntaxon Nceled Bindex EPTTax PosAbu Spasaci
## 1 7.68869 22.52242 18 40 0.64865 23 0.33939 28.29352
## 2 6.96659 19.74992 12 25 0.37838 16 0.00000 20.84029
## 3 5.92213 18.63735 9 25 0.16216 12 1.47017 20.88415
## 4 20.37168 18.17237 16 40 0.72000 30 8.07504 28.62651
## 5 12.08359 19.09771 21 36 0.54167 21 0.54567 22.02324
## 6 5.23533 14.60371 15 27 0.09091 5 0.00000 16.78539
## MDS_1 MDS_2
## 1 -0.69831500 -0.7040517
## 2 -0.61323430 -0.3796887
## 3 -0.08536265 1.6051440
## 4 0.55758620 -0.4424091
## 5 0.40888510 -0.2172988
## 6 -0.76699300 0.5244099
head(chemData)
## Kod_Canoco Tepl_max X.O2 BSK5 Kond N.NH4 N.NO3 Pcelk
## 1 BecvChor 20.9 107 1.3 33.7 0.09 1.55 0.077
## 2 BecvOsek 21.8 105 1.2 37.0 0.06 2.32 0.063
## 3 BecvTrou 22.1 106 1.4 43.9 0.07 2.49 0.080
## 4 BelaBosk 16.4 104 1.1 22.1 0.02 2.05 0.029
## 5 BilyPoto 18.9 104 1.9 30.4 0.10 4.48 0.131
## 6 BrezJaro 18.4 84 2.5 83.8 0.30 3.63 0.237
Again, there is one locality which is missing the dataset of chemical measurements. We identify this locality and we do not consider it for further analysis. Both datasets are alligned with respect to he locality names - the first column in each dataset.
ind <- match(chemData[,1], bioData[,1])
data <- data.frame(bioData[ind, ], chemData[, 2:8])
X <- data[,2:9]
Y <- data[,19:25]
A nice graphical tool to view both datasets with respect to their overall (within and between) correlation structure is avialable within the R package ‘CCA’ (Canonical Correlation Analysis). The packages needs to be firstly installed (install.packages("CCA")
) and later one can use functions matcor()
and img.matcor()
to graphically display the corration matrix
\[ cor(\mathcal{X}, \mathcal{Y}) = \left(\begin{align}\Sigma_{XX} & \Sigma_{XY}\\\Sigma_{YX} & \Sigma_{YY}\end{align}\right)\].
library("CCA")
correl <- matcor(X, Y )
img.matcor(correl, type = 2)

Next we can directly use cancor()
function to obtain canonical correlations for the datasets \(\mathcal{X}\) and \(\mathcal{Y}\). Another option is to use function cc()
(from the R package ‘CCA’). Note, the the results are equivalent although not identical.
cc1 <- cancor(X, Y) ### function from standard R instalation
cc2 <- cc(X, Y) ### function for the R package 'CCA'
The canonical correlations are the same for both approaches (R commands used)
cc1$cor ### function from standard R instalation
## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277
cc2$cor ### function for the R package 'CCA'
## [1] 0.87496109 0.70496586 0.64579693 0.48750752 0.24205299 0.11563830
## [7] 0.04481277
which can be also visualized by an appropriate picture (e.g.):
par(mfrow = c(1,2))
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")
barplot(cc1$cor, main = "Canonical correlations for 'cancor()'", col = "gray")

However, the actual estimated linear combinations of the original covariates in both datasets \(\mathcal{X}\) and \(\mathcal{Y}\) are different. Compare the following:
cc1$xcoef ### function from standard R instalation
## [,1] [,2] [,3] [,4] [,5]
## SaprInd 0.162796345 0.089714234 -0.161737589 -0.491927540 -0.186875530
## Lital -0.003800701 0.001797172 -0.030965801 -0.007214761 0.017517637
## RETI 0.324209881 -0.055178843 0.693084411 -1.283570658 -2.507634813
## EPTAbu -0.002149837 -0.005077822 -0.009027554 0.008345945 0.005867073
## Marg -0.011912887 -0.037901448 -0.009666364 -0.065884327 -0.025639954
## Metaritr -0.012221662 0.001551030 0.043003790 0.012178243 -0.003013795
## JepAbu 0.005168862 0.006308244 0.004048909 -0.009547646 -0.019794148
## Epiritral 0.010432924 0.023391927 -0.013976612 -0.021306313 0.018722542
## [,6] [,7] [,8]
## SaprInd -0.098864028 0.06897063 -0.355850803
## Lital 0.004465308 -0.01327305 0.008943729
## RETI 3.047758363 1.14371548 0.285767761
## EPTAbu -0.005401486 0.01277878 -0.026098676
## Marg -0.029969261 0.01102501 -0.002459819
## Metaritr -0.027620185 -0.04963561 -0.056954743
## JepAbu -0.005346595 -0.00253416 0.030139906
## Epiritral -0.022760565 0.04740075 0.064087475
cc2$xcoef ### function for the R package 'CCA'
## [,1] [,2] [,3] [,4] [,5]
## SaprInd -1.29215593 -0.71208465 -1.28375231 -3.90455380 -1.48327854
## Lital 0.03016713 -0.01426461 -0.24578343 -0.05726539 0.13904193
## RETI -2.57333615 0.43796849 5.50118696 -10.18802625 -19.90373428
## EPTAbu 0.01706380 0.04030396 -0.07165399 0.06624389 0.04656845
## Marg 0.09455561 0.30083342 -0.07672439 -0.52294063 -0.20351082
## Metaritr 0.09700644 -0.01231092 0.34133200 0.09666181 -0.02392126
## JepAbu -0.04102657 -0.05007014 0.03213722 -0.07578209 -0.15711118
## Epiritral -0.08280877 -0.18566767 -0.11093592 -0.16911362 0.14860557
## [,6] [,7]
## SaprInd -0.78470889 0.54743742
## Lital 0.03544229 -0.10535158
## RETI 24.19083206 9.07796019
## EPTAbu -0.04287297 0.10142843
## Marg -0.23787363 0.08750829
## Metaritr -0.21922842 -0.39397046
## JepAbu -0.04243728 -0.02011427
## Epiritral -0.18065638 0.37623175
cc1$ycoef
## [,1] [,2] [,3] [,4] [,5]
## Tepl_max 0.002728344 -0.031587883 -0.022171030 0.015570969 -0.026678176
## X.O2 0.001145406 -0.004145546 0.005890339 -0.012452124 -0.009542703
## BSK5 0.015042715 0.054146231 -0.027996459 0.124516411 -0.026809781
## Kond 0.002545019 0.001425917 -0.004281998 0.001343352 0.001807199
## N.NH4 0.307216652 0.432469799 1.131659010 0.087843642 -0.818023849
## N.NO3 -0.010468525 -0.030934549 0.053453456 0.031622860 0.022842863
## Pcelk 0.412696634 -0.979367658 0.196048060 -3.071986907 0.807813337
## [,6] [,7]
## Tepl_max 0.024294508 0.0072354037
## X.O2 -0.021080517 -0.0001162876
## BSK5 -0.122653042 -0.1252271666
## Kond -0.003536074 0.0038599055
## N.NH4 0.123847014 0.5657683896
## N.NO3 -0.011571285 0.0045994691
## Pcelk 1.447735216 -1.1452167044
cc2$ycoef
## [,1] [,2] [,3] [,4] [,5]
## Tepl_max -0.021655562 0.25072104 -0.17597710 0.12359074 -0.21175146
## X.O2 -0.009091381 0.03290426 0.04675311 -0.09883567 -0.07574285
## BSK5 -0.119397849 -0.42977239 -0.22221500 0.98831837 -0.21279604
## Kond -0.020200461 -0.01131786 -0.03398730 0.01066252 0.01434420
## N.NH4 -2.438456583 -3.43262261 8.98226493 0.69723729 -6.49286301
## N.NO3 0.083091342 0.24553537 0.42427365 0.25099867 0.18130960
## Pcelk -3.275677981 7.77348979 1.55608323 -24.38314016 6.41181959
## [,6] [,7]
## Tepl_max 0.19283168 0.0574292361
## X.O2 -0.16732142 -0.0009230045
## BSK5 -0.97352834 -0.9939598209
## Kond -0.02806672 0.0306370503
## N.NH4 0.98300520 4.4906473759
## N.NO3 -0.09184423 0.0365071542
## Pcelk 11.49104203 -9.0898757913
Using the ‘CCA’ package one can also take an advantage of nice graphical tools which are available for the canonical correlation analysis. The idea is to display maximized correlations between transformed variables of the dataset \(\mathcal{X}\) and the dataset \(\mathcal{Y}\).
plt.cc(cc2, var.label = TRUE, ind.names = data[,1])

Questions
-
How would you interpret the left hand side of the figure?
-
What is the relationship between the subfigure on the left hand side and the subfigure on the right hand side?
-
What would be your conclussion about the two available dataset and their mutual relationship?
A usefull extension to the classical canonical correlation approach is the regularized version - so called regularized canonical correlation analysis. It is mean for scenarios where the number of parameters \(p \in \mathbb{N}\) is greater than the sample size \(n \in \mathbb{n}\) (usually \(p \gg n\)). There is R function rcc()
available in the R package ‘CCA’.
Another option how to perform canonical correlation in R is to install an additional R package - package ‘vegan’ (use install.packages("vegan")
for instalation) and to apply function cca()
. Use the help session in R for more details about the function itself or the corresponding package respectively.
library(vegan)
cc3 <- cca(X, Y)
And the corresponding graphical summary given by plot()
function.
plot(cc3, scaling = 1)

To Do
-
Use this markdown and consider (maybe) the whole dataset on biologial measurements. Compare how much the results changes.
-
Consider a lower dimensional profile and try to refit the whole canonical correlation analysis on a lower dimensional datasets (e.g. two covariates for bio data and two covariates for chemical data).
-
Try to manually reconstruct the results from the above by using the correponding canonical correlation theory form the lecture.