Talk Title:

Correlation screening in high dimension

Speaker: Al Hero
UM Department of EECS and Department of Statistics

Abstract

Discovery of variables having high sample correlations in a multivariate sample arises in many applications of machine learning such as: data mining; sensor networks; financial time series; and bioinformatics. However, to date there has been little mathematical theory to guide the analyst in quantifying statistical reliability of his putative discoveries. Is an observed high sample correlation due to chance alone and what is the statistical significance of such a discovery? In this talk we present a comprehensive theory for answering these questions. In particular, we show that correlation screening suffers from a phase transition phenomenon: as the correlation threshold decreases the number of discoveries increases abruptly. We obtain asymptotic expressions for the mean number of discoveries and the phase transition thresholds as a function of the number of samples, the number of variables, and the joint sample distribution. Interestingly, the phase transition threshold is determined by a local measure of mutual information between pairs of variables, defined by a cross-sectional Bhattacharyya affinity.

We also show that under a weak dependency condition the number of discoveries is dominated by a Poisson random variable leading to an asymptotic expression for the false positive rate. The correlation screening approach bears tremendous dividends in terms of the type and strength of the asymptotic results that can be obtained. It also overcomes some of the major hurdles faced by existing methods in the literature as correlation screening is immediately scalable. Numerical results also strongly validate the theoretical results that are presented in the paper. We illustrate the application of the correlation screening methodology on a large scale gene-expression dataset.

This is joint work with Bala Rajaratnam at Stanford University.