Conference on Statistical Learning and Data Mining Rackham Graduate School, University of Michigan, Ann Arbor, MI June 5 - 7, 2012

June 5 (Tuesday)

 8:30-10:00am Parallel Sessions East Room Flexible regression and clustering (organized by Yufeng Liu and chaired by Lingsong Zhang) Jia Li (Pennsylvania State University and NSF) D2-clustering and hypothetical local mapping In this talk, I will present a novel clustering algorithm called D2-Clustering for objects each represented by a discrete distribution with finite but non-fixed support. This mathematical representation often arises in multimedia content analysis and can be viewed as a sparse representation of histogram-type (aka, frequency-type) feature vectors. The Kantorovich-Wasserstein metric is used as the distance between objects, which differs profoundly from the Euclidean distance based on frequencies. The D2-Clustering algorithm groups objects by minimizing the total within-cluster squared distances, a criterion in the same spirit as k-means but applied to a significantly more complicated data type. The computational obstacle is overcome by a double loop iterative linear programming algorithm. Inspired by manifold learning, I also propose Hypothetical Local Mapping (HLM) to construct a mixture density function for the generic metric space. The combination of D2-Clustering and HLM is applied to real-time image annotation. Guang Cheng (Purdue University) Joint asymptotics and inferences for semi-nonparametric models In this talk, we consider the joint asymptotics and inferences for the semi-nonparametric models where the Euclidean parameter and an infinite dimensional parameter are both of interest. To illustrate our idea, we consider the generalized partly linear framework. Specifically, we derive the joint limit distribution for the two parameters which are rescaled according to different convergence rates. The marginal limit distribution/semiparametric efficiency for the Euclidean estimate coincides with that derived in the semiparametric literature. To construct the joint confidence region/band, we propose the likelihood ratio testing approach that can effectively avoid estimating the asymptotic covariance. The employed regularization tool is the smoothing spline. The undersmoothing of the smoothing spline estimate is required for obtaining the valid joint inferences. The key technical tool is a concentration inequality. A by-product result is the marginal asymptotics for the infinite dimensional parameter that are new even in the nonparametric literature. Annie Qu (University of Illinois at Urbana-Champaign) Variable selection in high-dimensional varying-coefficient models with global optimality The varying coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the LASSO, the adaptive LASSO and the SCAD approaches. West Room Inference in networks and multivariate analysis (organized and chaired by Guilherme Rocha) Karl Rohe (University of Wisconsin, Madison) Characterizing the asymmetric flow of information in directed networks Relationships in social networks are often asymmetric and can represent the flow of information from one person to another. Email messages are one example. Using the directed stochastic blockmodel, this talk will (i) present a way to characterize the asymmetric flow of information in a directed network and (ii) demonstrate how this can identify "bottleneck nodes" in a citation network. Michael Trosset (Indiana University) Inference on random graphs with classified edge attributes We test simple hypotheses about a random graph with a fixed set of vertices and random edges, each of which possesses one of K mutually exclusive attributes. The edge attributes are inferred by means of a fallible classifier. Suppose that E and F are the confusion matrices of two such classifiers. Using results from statistical decision theory, we demonstrate that, if there exists a K x K stochastic matrix R such that ER = F, then most powerful (MP) tests based on E are necessarily more powerful than MP tests based on F. By means of an example, we also demonstrate that entry-wise superiority of E to F does not guarantee that an MP test based on E is more powerful than an MP test based on F. Vincent Vu (Carnegi Mellon University) Sparse principal subspaces Sparse Principal Subspaces (SPS) is an unsupervised statistical technique that simultaneously performs variable selection and dimension reduction for multivariate data. It builds on the main idea of PCA by imposing the following constraint: the principal subspace of variation is spanned mostly by a smaller number of variables. SPS avoids problems such as non-orthogonality and ill-posedness that can plague adhoc extensions of 1-dimensional sparse PCA methods to d-dimensions. It does this by estimating an orthonormal basis for the principal subspace all at once, and directly constraining the sparsity of the estimated subspace rather than entries of individual eigenvectors. In this talk I will present theoretical results including non-asymptotic minimax lower and upper bounds for the estimation of the principal subspace, and show that SPS is nearly rate optimal. In addition, I will discuss algorithmic issues, examples in data visualization, and concrete applications in the analysis of neuroimaging data. Amphitheatre Development in variable screening and selection (organized and chaired by Helen Zhang) Qing Mai and Hui Zou (University of Minnesota) The Kolmogorov filter for variable screening in high-dimensional binary classification Variable screening techniques have been proposed to mitigate the impact of high dimensionality in classification problems, including the t-test marginal screening (Fan & Fan, 2008) and the maximum marginal likelihood screening (Fan & Song, 2010). However, these methods rely on strong modeling assumptions that are easily violated in real applications. To circumvent the rigid assumptions, we propose a new variable screening technique for binary classification based on the Kolmogorov-Smirnov statistic, and hence the name Kolmogorov filter. We prove that the Kolmogorov filter enjoys the sure screening property under much weakened model assumptions. We supplement our theoretical study by a simulation study. This is joint work with Prof. Hui Zou. Sijian Wang (University of Wisconsin, Madison) Group and within group variable selection via convex penalty In many scientific and engineering applications, predictors are naturally grouped, for example, in biological applications where assayed genes or proteins can be grouped by biological roles or biological pathways. When the group structures are available among predictors, people are usually interested in identifying both important groups and important variables within the selected groups. Among existing successful group variable selection methods, some methods fail to conduct the within group selection. Some methods are able to conduct both group and within group selection, but the corresponding objective function is non-convex, which may require extra numerical effort. In this talk, we will present a convex penalty for both group and within group variable selection. We develop an efficient group-level coordinate descent algorithm for solving the corresponding optimization problem. We also study the non-asymptotic properties of the estimates in the high-dimensional setting, where the number of predictors can be much larger than the sample size. Numerical results indicate that the proposed method works well in terms of both variable selection and prediction accuracy. We also apply the proposed method to American Cancer Society Breast Cancer Survivor dataset. This is joint work with Zhigeng Geng and Grace Wahba. Yichao Wu (North Carolina State University) Two-dimensional solution surface for weighted support vector machines The support vector machine (SVM) is a popular method for binary classification and its natural extension is the weighted SVM (WSVM) in which observations from different classes are imposed with different weights. Consequently there are two parameters associated with the WSVM: one is the regularization parameter and the other is the weight parameter. Although the marginal behavior of the WSVM solution have already been explored, it is largely unknown how the WSVM solution changes with respect to these two parameters jointly. We establish in this article the joint piecewise-linearity of the WSVM solution in terms of the regularization parameter and the weight parameter. By taking advantage of the joint piecewise-linearity, an efficient algorithm is then proposed to obtain the entire two-dimensional solution surface. Our contribution is two-fold. First we provide a theoretical understanding on how the WSVM solution changes when the two parameters vary. Second this can address assorted potential issues in practical applications as the WSVM solution is readily available for any pair of regularization parameter and weight parameter. We demonstrate its usefulness by applying the proposed two-dimensional solution surface algorithm to tune the regularization parameter for the probability estimation method originally developed by Wang et al. (2008). A joint work with Seung Jun Shin and Hao Helen Zhang. 10:30-11:30am Plenary Talk (chaired by Xiaotong Shen) Amphitheatre Art Owen (Stanford University) Bootstrapping r-fold tensor data This work is joint with Dean Eckles, Facebook. The famous Netflix data is a sparsely sampled table with rows for customers and columns for movies (or vice versa). Both movies and rows are naturally modeled as random effects. Bootstrapping such data is problematic: no proper bootstrap can exist, according to a theorem of Peter McCullagh. Resampling rows and columns independently is effective though slightly conservative. Computerized data gathering frequently produces data sets with three-way or even higher order data tables. We present a bootstrap for such tensor valued data. Our version uses independent weights instead of multinomial ones. It remains mildly conservative. Poisson weights are close to the original bootstrap, but binary weights have computational and statistical advantages. Under certain conditions a single bootstrap replicate suffices to give a variance estimate. We apply our method to show that Facebook users in the UK make longer comments than those in the US when using their phone. The opposite pattern holds for comments made via computer. 1:00-2:30pm Parallel Sessions East Room Graphical modeling for high-dimensional problems (organized by Daniela Witten and chaired by David Mease) Hyonho Chun (Purdue University) A sparse conditional graphical model for non-independent measurements For the purpose of inferring a network, we consider a sparse Gaussian graphical model (SGGM) under the presence of a population structure, which often occurs in genetic studies with model organisms. In these studies, data are obtained by combining multiple lines of inbred organisms or by using outbred animals. Ignoring such population structures would produce false connections in a graph structure, but most research in graph inference has been focused on independent cases. On the other hand, in regression settings, a linear mixed effect model has been widely used in order to account for correlations among observations. Besides its effectiveness, the linear mixed effect model has a generality; the model can be stated within a framework of penalized least squares. This generality makes it very flexible for utilization in settings other than regression. In this proposal, we adopt a random effect model to an SGGM. Our formulation fits into the recently developed conditional Gaussian graphical model in which the population structures are modeled as predictors and the graph is determined by a conditional precision matrix. The proposed approach is applied to the network inference problem of two datasets: heterogeneous stock (HS) and heterogeneous mice diversity panel (HMDP) datasets. Pradeep Ravikumar (University of Texas at Austin) One the use of greedy methods for learning discrete and Gaussian graphical models Undirected graphical models, also known as Markov random fields, are widely used in a variety of domains, including biostatistics, natural language processing and image analysis among others. They compactly represent distributions over a large number of variables using undirected graphs which encodes conditional independence assumptions among the variables. Recovering this underlying graph structure is thus important for many of these applications of MRFs, especially under constrained settings where the number of variables is large, and the samples are limited. In this talk, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Classical approaches to this problem have ranged over various heuristic approaches, greedy methods being one class of them, but a line of recent research have proposed principled convex regularization based M-estimators, that have been shown to be not only computationally practical, but to also enjoy strong statistical guarantees. In this talk, we revisit a classical and simple greedy algorithm, that just iteratively adds and deletes edges. Surprisingly, we show that when these forward and backward steps are performed appropriately, we obtain a state of the art method for recovering graphical model structure. Indeed, not only does it enjoy obvious computational advantages over regularized convex optimization based approaches, but we also show that it is sparsistent, or consistent in sparsity pattern recovery, under weaker conditions, and with a smaller sample complexity. Joint work with Ali Jalali, Christopher Johnson. Daniela Witten (University of Washington) Fast graphical model estimation in very high dimensions The graphical lasso, recently proposed for Gaussian graphical modeling in high dimensions, involves estimating an inverse covariance matrix under a multivariate normal model by maximizing the L1-penalized log likelihood. I will begin by presenting a very simple but previously unknown necessary and sufficient condition that can be used to identify the connected components in the graphical lasso solution. This condition can be used to achieve massive computational gains: computing the graphical lasso solution with 20,000 features now takes minutes on a standard desktop machine, whereas previously the computations were prohibitive. This opens up new doors for rigorous network analysis of high-dimensional biological data. As a specific example, I will discuss estimation of graphical models under distinct biological conditions, in which we expect some, but not all, aspects of the networks to differ between conditions. An extension of the necessary and sufficient condition developed for the graphical lasso allows for extremely fast network estimation in this setting. Parts of this work are joint with Jerry Friedman, Noah Simon, Pei Wang, and Patrick Danaher. West Room Variable/feature selection via regularization (organized by Ming Yuan and chaired by Peter Qian) Yang Feng (Columbia University) A nonparametric optimal decision rule in high dimensional space Nonparametric Optimal Decision Rule (NOD) is proposed to tackle the high-dimensional binary classification problem. We assume that the log-likelihood of class conditional densities can be written as a linear combination of log-likelihoods of marginal class conditional densities. After we plug-in estimates of marginal class conditional densities, to solve the classification problem is equivalent to find the best linear classification rule in a nonparametrically transformed feature space. Oracle inequalities are developed regarding the classification error. A vast array of simulation and real data analysis demonstrate the outstanding performance of NOD. Yiyuan She (Florida State University) Joint variable and rank selection for parsimonious estimation of high dimensional matrices The talk discusses joint variable and rank selection for supervised dimension reduction in predictive learning. When the number of responses and/or that of the predictors exceed the sample size, one has to consider shrinkage methods for estimation and prediction. We propose to apply sparsity and reduced rank techniques jointly to attain simultaneous feature selection and feature extraction. A class of estimators are introduced are based on novel penalties that impose both row and rank restrictions on the coefficient matrix. We show that these estimators adapt to the unknown matrix sparsity. A computation algorithm is developed and applied to real world applications in machine learning, cognitive neuroscience and macroeconometrics forecasting. This is joint work with Florentina Bunea and Marten Wegkamp. Yixin Fang (New York University) Penalized principal components of heritability In family studies with multiple continuous phenotypes, we are interested in finding linear combinations of the phenotypes with large heritabilities, which can be considered as new phenotypes for genetic analysis. Here we examine the principal-components approach based on heritability (PCH) for situations where the number of phenotypes is large, and propose the penalized PCH approach for such situations. We evaluate the performance of the new approach using simulation and two family studies. 3:00-4:30pm Parallel Sessions East Room Structural sparsity pursuit in high-dimensional modeling (organized and chaired by Yiyuan She) Jiahua Chen (University of British Columbia) A thresholding algorithm for order selection in finite mixture models Order selection is an important step in the application of finite mixture models. Classical methods such as AIC and BIC discourage complex models with a penalty directly proportional to the number of mixing components. In contrast, Chen and Khalili(2008) propose to link the penalty to two types of over-fitting. In particular, they introduce a regularization penalty to merge similar subpopulations in a mixture model, where the shrinkage idea of regularized regression is seamlessly employed. However, the new method requires an effective and efficient algorithm. When the popular EM-algorithm is used, we need to maximize a non-smooth and non-concave objective function in the M-step, which is computationally challenging. In this article, we show that such an objective function can be transformed into a sum of univariate auxiliary functions. We then design an iterative thresholding descent algorithm (ITD) to efficiently solve the associated optimization problem. Unlike many existing numerical approaches, the new algorithm leads to sparse solutions and thereby avoids undesirable ad hoc steps. We establish the convergence of the ITD and further assess its empirical performance using both simulations and real-data examples. Jiashun Jin (Carnegi Mellon University) Optimality of graphlet screening in high dimensional variable selection Consider a linear model $Y=X\beta+z$, where X has n rows and p columns and z ~ N(0; In). We assume both p and n are large, including the case of p >> n. The unknown signal vector $\beta$ is assumed to be sparse in the sense that only a small fraction of its components is nonzero. The goal is to identify such nonzero coordinates (i.e., variable selection). We are primarily interested in the regime where signals are both rare and weak so that successful variable selection is challenging but is still possible. Researches on rare and weak signals to date have been focused on the unstructured case, where the Gram matrix $G=X^TX$ is nearly orthogonal. In this paper, G is only assumed to be sparse in the sense that each row of G has relatively few large coordinates (diagonals of G are normalized to 1). The sparsity of G naturally induces the sparsity of the so-called graph of strong dependence (GOSD). The key insight is that there is an interesting interplay between the signal sparsity and graph sparsity: in a broad context, the signals decompose into many small-size components of GOSD that are disconnected to each other. We propose Graphlet Screening (GS) for variable selection. This is a two-step Screen and Clean procedure, where in the first step, we screen subgraphs of GOSD with sequential $\chi^2$-tests, and in the second step, we clean with penalized MLE. The main methodological innovation is to use GOSD to guide both the screening and cleaning processes. For any variable selection procedure $\hat{\beta}$ we measure its performance with the Hamming distance between the sign vectors of $\hat{\beta}$ and $\beta$, and assess the optimality by the convergence rate of the Hamming distance. Compared with more stringent criterions such as exact support recovery or oracle property, which demand strong signals, the Hamming distance criterion is more appropriate for weak signals since it naturally allows a small fraction of errors. We show that in a broad class of situations, Graphlet Screening achieves the optimal rate of convergence in terms of the Hamming distance. Well-known procedures such as the L0-penalization method and the L1-penalization methods do not utilize graph structure for variable selection, so they generally do not achieve the optimal rate of convergence, even in very simple settings and even when the tuning parameters are ideally set. Elizaveta Levina (University of Michigan) Link prediction for partially observed networks Link prediction is one of the fundamental problems in network analysis. In many applications, notably in genetics, a partially observed network may not contain any negative examples of absent edges, which creates a difficulty for many existing supervised learning approaches. We present a new method which treats the observed network as a sample of the true network with different sampling rates for positive and negative examples, where the sampling rate for negative examples is allowed to be 0. The sampling rate itself cannot be estimated in this setting, but a relative ranking of potential links by their probabilities can be, which is sufficient in many applications. To obtain these rankings, we utilize information on node covariates as well as on network topology, and set up the problem as a loss function measuring the fit plus a penalty measuring similarity between nodes. Empirically, the method performs well under many settings, including when the observed network is sparse and when the sampling rate is low even for positive examples. We illustrate the performance of our method and compare to others on an application to a protein-protein interaction network. Joint work with Yunpeng Zhao and Ji Zhu. West Room Statistical methods for next generation sequencing data analysis in bioinformatics (organized anc chaired by Michael Zhu) Ming Hu (Harvard University) Bayesian inference of three-dimensional chromosomal organization Knowledge of spatial organization of the genome is critical for the study of transcription regulation and other nuclear processes in the cell. Recently, chromosome conformation capture (3C) based technologies such as Hi-C and TCC have been developed to provide a genome-wide, three-dimensional (3D) view of chromatin organization, but analysis algorithms for inferring chromosomal structures from such experiments are still under-developed. In the Hi-C experiment, the spatial proximity of any two genomic loci is represented by a count matrix. The goal of our study is to translate the count matrix into three-dimensional (3D) structure of local genomic domains. To achieve this, we devise a novel Bayesian statistical model linking the observed read counts spanning a pair of loci to the spatial distance between them. Using advanced Monte Carlo computational techniques such as sequential Monte Carlo and hybrid Monte Carlo, we are able to reconstruct the spatial arrangement of local genomic domains in 3D space. When applying our method to a real Hi-C dataset, we can visualize the 3D shape of the local genomic domains, and the predicted spatial distances are consistent with the gold standard florescent in situ hybridization (FISH) data. With the explosive accumulation of high throughput genome-wide chromatin interaction data, the proposed method will have immediate and far-reaching impact in the broader area of biomedical research. Steve Qin (Emory University) Inference of correlated hidden Markov models with application to genome-wide studies Hidden Markov models (HMM) have been widely used to analyze large-scale genomic data because of its ability to incorporate spatial correlation, in areas such as genome-wide mapping of binding sites for regulatory proteins. With advances in sequencing technology, it is now common to analyze data for multiple proteins jointly, but existing algorithms analyze data for each protein separately. However, extending the HMM framework to a multivariate form is not trivial because hidden state space increases quickly with the number of experiments and it is therefore of interest to develop a powerful yet computationally tractable algorithm. Here we present a fast inferential method for correlated hidden Markov models for multiple sequential data (series). Instead of jointly inferring hidden states for all series, we adjust the transition kernel of each series with the hidden state configuration in other correlated series and keep the hidden state inference marginal in each. Through simulation, we show that the new scheme achieves sensitivity comparable to the fully coupled HMM fit at a computational cost as low as fitting an independent HMM for each series separately. The method was applied to the analysis of histone modification data in human T cells. Han Wu (Purdue University) Using finite Poisson mixture models for RNA-Seq data analysis and transcript expression level quantification RNA-Seq has emerged as a powerful technique for transcriptome study. As much as the improved sensitivity and coverage, RNA-Seq also brings challenges for data analysis. The massive amount of sequence reads data, excessive variability, uncertainties, and bias and noises stemming from multiple sources all make the analysis of RAN-Seq data difficult. Despite much progress, RNA-Seq data analysis still has much room for improvement, especially on the quantification of transcript/gene expression levels. In this article, we propose to use finite Poisson mixture models to characterize base pair-level RNA-Seq data and further quantify transcript/gene expression levels. Finite Poisson mixture models combine the strength of fully parametric models with the flexibility of fully nonparametric models, and are extremely suitable for modeling heterogeneous count data such as what we observed from RNA-Seq experiments. In particular, we consider three types of Poisson mixture models and propose to use a BIC-based model selection procedure to adapt the models to individual transcripts. A unified quantification method based on the Poisson mixture models is developed to measure transcript/gene expression levels. The Poisson mixture models and the proposed quantification method were applied to analyze two RNA-Seq data sets and demonstrated excellent performances in comparison with other existing methods. Our approach resulted in better characterization of the data and more accurate measurements of transcript expression levels. We believe that finite Poisson mixture models provide a flexible framework to model RNA-Seq data, and methods developed based on this framework have the potential to become powerful tools for RNA-Seq data analysis.