Matrix factorization and dimension reduction

2019/09/22

We will consider here a broad class of methods used to study variation in multivariate data. Most of these methods work by performing some sort of dimension reduction using a matrix factorization. The canonical method of this type is Principal Components Analysis (PCA), but there are many other useful techniques.

Principal Components Analysis

PCA is used to understand how vector-valued (multivariate) data vary with respect to their mean. Suppose we observe a collection of vectors $x_i \in {\cal R}^d$, where $i=1,\ldots, n$. The most basic form of PCA is the single-factor model:

$$ x_i \approx \bar{x} + \lambda_i \eta. $$

Here, $\bar{x} \in {\cal R}^d$ is the centroid of the observed data, $\eta \in {\cal R}^d$ is a vector of Principal Component loadings, and the $\lambda_i$ are Principal Component scores. We can also write this relationship explicitly in terms of the deviations from the mean:

$$ x_i - \bar{x} \approx \lambda_i \eta. $$

This model represents the deviations from the mean as all lying along a common direction vector $\eta$. This is a very strong model that won’t describe most data sets very well, so we will generalize it below. The score $\lambda_i$ defines how far in the direction $\eta$ a data value $x_i$ is located. An observation with $\lambda_i \approx 0$ is close to the centroid, while values with $\lambda_i = \pm c$ are displaced by distance $c$ in opposite directions from the centroid.

In order for $\bar{x}$ to actually be the mean of the data in this representation, we need $\bar{\lambda_i}$ to be equal to zero. Thus, there is always some symmetry in the sense that some observations will be displaced from the mean in the direction $\eta$ and other observations will be displaced from the mean in the direction $-\eta$, with the average signed diosplacement averaging out to zero.

Note that since we can multiply $\eta$ by any constant $c$ and then multiply all the $\lambda_i$ by $1/c$ without changing the representation, the length of $\eta$ is not identified, and therefore without loss of generality is usually taken to be $1$. Similarly, we can multiply $\eta$ and all the $\lambda_i$ by $-1$ without changing the representation. There is no standard way to resolve this ambiguity, so we must accept that different software packages will report either $\eta$ or $-\eta$ as the principal component loading vector (and the scores will be reflected accordingly).

The above simplified form of PCA only has one component. We can generalize this representation as follows:

$$ x_i - \bar{x} \approx \lambda_i^{(1)}\eta^{(1)} + \cdots + \lambda_i^{(q)}\eta^{(q)} $$

This is a $q$-component decomposition of the data. In order to make this representation unique, we usually require the loading vectors to be orthonormal in the Euclidean sense, $\langle \eta^{(j)}, \eta^{(k)}\rangle = \delta_{jk}$, and orthogonal (but not normalized) with respect to $C = {\rm cov}(X)$, where $C$ is the $p\times p$ covariance matrix of the variables. This second type of orthogonality can be expressed explicitly as $\eta^{(j)T} \cdot C \cdot \eta^{(k)} \propto \delta_{jk}$. We also require $\lambda^{(k)}_1 + \cdots + \lambda^{(k)}_n = 0$ for each $k$, and ${\rm var}(\lambda_1^{(k)}, \ldots, \lambda_n^{(k)})$ to be non-increasing in $k$. This last condition implies that each successive component $\eta^{(k)}$ explains no more variation in the $x_i$ than the preceeding one.

As noted above, the goal of PCA is to help us understand the variation in a set of data around their mean value. It is a somewhat open-ended and exploratory procedure in the sense that it often needs to be adapted somewhat to be useful in a particular setting. For example, just like in a regression analysis, we have the option to do any type of transformation or basis expansion of the data before conducting the PCA.

Another challenge is interpreting and communicating the findings of a PCA. There are many approaches that can be pursued to accomplish this, including:

Canonical Correlation Analysis (CCA)

Canonical Correlation analysis is a way to understand a collection of pairs of observed vectors, $x_i\in {\cal R}^p$ and $y_i\in{\cal R}^q$. It is very important to emphasize that $x_i$ and $y_i$ are paired through the index $i$. The focus in CCA is on the relationship between each $x_i$ and its paired $y_i$, rather than variation among the $x_i$ or among the $y_i$.

The “dominant layer” of a CCA is defined through two unit vectors $\theta\in{\cal R}^p$ and $\eta\in{\cal R}^q$ that maximize ${\rm Cor}(\theta^\prime x_i, \eta^\prime y_i)$, with the correlation taken over $i$. That is, we are seeking linear dimension reductions of $x$ and $y$ that are maximally correlated.

CCA considers $x_i$ and $y_i$ symmetrically, which makes it somewhat different from regression approaches that look at the conditional distribution of one variable given another. The vectors $x_i$ and $y_i$ can have different dimensions (i.e. $p\ne q$ is allowed). If both are 1-dimensional, CCA is equivalent to the usual Pearson correlation coefficient.

The dominant layer of CCA is defined as stated above, subsequent layers are defined sequentially. Rewriting the dominant layer as $\theta^{(1)}$, $\eta^{(1)}$, then the second layer $\theta^{(2)}$, $\eta^{(2)}$ is defined to maximize ${\rm Cor}(\theta^{(2)\prime} x_i, \eta^{(2)\prime} y_i)$, subject to the constraints $\theta^{(2)^\prime}\cdot C_x\cdot\theta^{(1)} = 0$ and $\eta^{(2)\prime}\cdot C_y\cdot \eta^{(1)} = 0$, where $C_x = {\rm Cov}(x)$ and $C_y = {\rm Cov}(y)$. Subsequent layers are defined analogously.

CCA can be seen as identifying linear summary statistics (dimension reductions) of the two data vectors ($x$ and $y$) that are maximally correlated. The orthogonality constraints gaurantee that two distinct summary statistics are uncorrelated, i.e. ${\rm Cor}(\theta^{(j)\prime} x, \theta^{(k)\prime} x) = 0$ when $j \ne k$. They also gaurantee that all correlation between $x$ and $y$ is “within layers”, in that ${\rm Cor}(\theta^{(j)\prime}x, \eta^{(k)\prime}y) = 0$ if $j \ne k$.