We will consider here a broad class of methods used to study variation in multivariate data. Most of these methods work by performing some sort of dimension reduction using a matrix factorization. The canonical method of this type is Principal Components Analysis (PCA), but there are many other useful techniques.
Principal Components Analysis
PCA is used to understand how vector-valued (multivariate) data vary
with respect to their mean. Suppose we observe a collection of
vectors $x_i \in {\cal R}^d$
, where $i=1,\ldots, n$
. The most
basic form of PCA is the single-factor model:
$$ x_i \approx \bar{x} + \lambda_i \eta. $$
Here, $\bar{x} \in {\cal R}^d$
is the centroid of the observed data,
$\eta \in {\cal R}^d$
is a vector of Principal Component loadings,
and the $\lambda_i$
are Principal Component scores. We can also
write this relationship explicitly in terms of the deviations from the
mean:
$$ x_i - \bar{x} \approx \lambda_i \eta. $$
This model represents the deviations from the mean as all lying along
a common direction vector $\eta$
. This is a very strong model that
won’t describe most data sets very well, so we will generalize it
below. The score $\lambda_i$
defines how far in the direction
$\eta$
a data value $x_i$
is located. An observation with
$\lambda_i \approx 0$
is close to the centroid, while values with
$\lambda_i = \pm c$
are displaced by distance $c$
in opposite
directions from the centroid.
In order for $\bar{x}$
to actually be the mean of the data in this
representation, we need $\bar{\lambda_i}$
to be equal to zero.
Thus, there is always some symmetry in the sense that some
observations will be displaced from the mean in the direction $\eta$
and other observations will be displaced from the mean in the
direction $-\eta$
, with the average signed diosplacement averaging
out to zero.
Note that since we can multiply $\eta$
by any constant $c$
and
then multiply all the $\lambda_i$
by $1/c$
without changing the
representation, the length of $\eta$
is not identified, and
therefore without loss of generality is usually taken to be $1$
.
Similarly, we can multiply $\eta$
and all the $\lambda_i$
by
$-1$
without changing the representation. There is no standard way
to resolve this ambiguity, so we must accept that different software
packages will report either $\eta$
or $-\eta$
as the principal
component loading vector (and the scores will be reflected
accordingly).
The above simplified form of PCA only has one component. We can generalize this representation as follows:
$$ x_i - \bar{x} \approx \lambda_i^{(1)}\eta^{(1)} + \cdots + \lambda_i^{(q)}\eta^{(q)} $$
This is a $q$
-component decomposition of the data. In order to make
this representation unique, we usually require the loading vectors to
be orthonormal in the Euclidean sense, $\langle \eta^{(j)},
\eta^{(k)}\rangle = \delta_{jk}$
, and orthogonal (but not normalized)
with respect to $C = {\rm cov}(X)$
, where $C$
is the $p\times p$
covariance matrix of the variables. This second type of orthogonality
can be expressed explicitly as $\eta^{(j)T} \cdot C \cdot \eta^{(k)} \propto
\delta_{jk}$
. We also require $\lambda^{(k)}_1 + \cdots +
\lambda^{(k)}_n = 0$
for each $k$
, and ${\rm var}(\lambda_1^{(k)},
\ldots, \lambda_n^{(k)})$
to be non-increasing in $k$
. This last
condition implies that each successive component $\eta^{(k)}$
explains no more variation in the $x_i$
than the preceeding one.
As noted above, the goal of PCA is to help us understand the variation in a set of data around their mean value. It is a somewhat open-ended and exploratory procedure in the sense that it often needs to be adapted somewhat to be useful in a particular setting. For example, just like in a regression analysis, we have the option to do any type of transformation or basis expansion of the data before conducting the PCA.
Another challenge is interpreting and communicating the findings of a PCA. There are many approaches that can be pursued to accomplish this, including:
Interpreting the components of the component loading vectors
$\eta^{(j)}$
: if$|\eta^{(j)}_k|$
is large, then the$k^{\rm th}$
variable is a major contributor to the variation captured by component$j$
. Two elements of$\eta^{(j)}$
having the same sign tend to move in unison according to the variation captured by the$j^{\rm th}$
component. Two elements of$\eta^{(j)}$
having opposite signs tend to move in opposite directions according to the variation captured by the$j^{\rm th}$
component.Interpreting the scores
$\lambda^{(j)}_i$
: as noted above these values show how$x_i$
is displaced from the centroid in the direction$\eta^{(j)}$
. The scores may be plotted against other data to better understand their meaning.Interpreting the eigenvalues and effective dimensionality: a PCA is conducted using an eigen-decomposition of a covariance matrix. If these eigenvalues are well-separated, it may be possible to argue that most of the variation in the
$x_i$
lies in a low-dimensional subspace.
Canonical Correlation Analysis (CCA)
Canonical Correlation analysis is a way to understand a collection of
pairs of observed
vectors, $x_i\in {\cal R}^p$
and $y_i\in{\cal R}^q$
.
It is very important to emphasize that $x_i$
and $y_i$
are paired
through the index $i$
.
The focus in CCA is on the relationship between
each $x_i$
and its paired $y_i$
, rather than variation among
the $x_i$
or among the $y_i$
.
The “dominant layer” of a CCA is
defined through two unit vectors $\theta\in{\cal R}^p$
and
$\eta\in{\cal R}^q$
that
maximize ${\rm Cor}(\theta^\prime x_i, \eta^\prime y_i)$
,
with the correlation taken over $i$
. That is,
we are seeking linear dimension reductions of $x$
and $y$
that are maximally correlated.
CCA considers $x_i$
and $y_i$
symmetrically, which makes
it somewhat different from regression
approaches that look at the conditional distribution of one
variable given another. The vectors $x_i$
and $y_i$
can
have different dimensions (i.e. $p\ne q$
is allowed).
If both are 1-dimensional, CCA
is equivalent to the usual Pearson correlation coefficient.
The dominant layer of CCA is defined as stated above, subsequent layers
are defined sequentially. Rewriting the dominant layer as
$\theta^{(1)}$
, $\eta^{(1)}$
, then the second layer
$\theta^{(2)}$
, $\eta^{(2)}$
is
defined to maximize
${\rm Cor}(\theta^{(2)\prime} x_i, \eta^{(2)\prime} y_i)$
, subject
to the constraints $\theta^{(2)^\prime}\cdot C_x\cdot\theta^{(1)} = 0$
and
$\eta^{(2)\prime}\cdot C_y\cdot \eta^{(1)} = 0$
, where
$C_x = {\rm Cov}(x)$
and $C_y = {\rm Cov}(y)$
. Subsequent layers are
defined analogously.
CCA can be seen as identifying linear summary statistics (dimension
reductions) of the two data vectors ($x$
and $y$
) that are maximally
correlated. The orthogonality constraints gaurantee that two distinct
summary statistics are uncorrelated, i.e.
${\rm Cor}(\theta^{(j)\prime} x, \theta^{(k)\prime} x) = 0$
when $j \ne k$
. They also gaurantee that all correlation between
$x$
and $y$
is “within layers”, in that
${\rm Cor}(\theta^{(j)\prime}x, \eta^{(k)\prime}y) = 0$
if $j \ne k$
.