University of Michigan logo Conference on Statistical Learning and Data Mining
Rackham Graduate School, University of Michigan, Ann Arbor, MI
June 5 - 7, 2012

June 5 (Tuesday)

9:00-10:00am Plenary Talk (chaired by Harrison Zhou)
Amphitheatre Peter Bartlett (University of California, Berkeley)
Large scale model selection and computational oracle inequalities

In many large-scale, high-dimensional prediction problems, performance is limited by computational resources rather than sample size. In this setting, we consider the problem of model selection under computational constraints: given a particular computational budget, is it better to gather more data and estimate a simpler model, or gather less data and estimate a more complex model? The talk will first review classical results for the performance of model selection methods based on complexity penalization. These methods aim to choose a model to optimally trade off estimation and approximation errors by minimizing the sum of an empirical risk term and a complexity penalty. We evaluate these methods via oracle inequalities, which show that the predictive accuracy is almost as good as the best bound that would have been achieved by any model in the hierarchy. We then focus on model selection with computational constraints, motivated by large scale problems. We introduce general model selection methods. In contrast to classical oracle inequalities, which show a near-optimal trade-off between approximation error and estimation error for a given sample size, we give computational oracle inequalities, which show that our methods give a near-optimal trade-off for a given amount of computation, that is, devoting all of our computational budget to the best model would not have led to a significant performance improvement.

10:30am-12:00pm Parallel Sessions
East Room Sensor networks, boosting and CER (organized by Tim Hesterberg and chaired by George Michailidis)

Xiaoming Huo (Georgia Institute of Technology)
Trajectory identification in binary proximity sensor networks

In office buildings, hospitals, and clinics, anonymous binary motion detectors can be installed. A key issue in mining this huge amount of sensor data is to build predictive models, so that particular applications can be facilitated. Trajectory identification (TI) is to recover object tracks, out of anonymous binary sensor data; it is a core step in mining this type of data. We present a general approach that is based on optimization, specifically on the mixed integer programming. We consider how to incorporate false information and missing data, using maximum likelihood estimation. It turns out the proposed method is extremely efficient in numerical implementation. We found that the efficiency is due to the special structure of this type of problems, which makes the current state-of-the-art optimizer run extremely efficient. Some sufficient conditions are derived to guarantee the numerical efficiency. Robustness and reliability of the solutions will be analyzed too. It has applications in building smart environments.

David Mease (Google)
Evidence contrary to the statistical view of boosting

The statistical perspective on boosting algorithms focuses on optimization, drawing parallels with maximum likelihood estimation for logistic regression. In this talk we present empirical evidence that raises questions about this view. Although the statistical perspective provides a theoretical framework within which it is possible to derive theorems and create new algorithms for general contexts, we show that there remain many unanswered important questions. Furthermore, we provide examples that reveal crucial flaws in the many practical suggestions and new algorithms that are derived from the statistical view. We examine experiments using simple simulation models to illustrate some of these flaws and their practical consequences. This is joint work with Abraham Wyner at the University of Pennsylvania.

Robert Obenchain (Risk Benefit Statistics)
Help wanted: systematic sensitivity analyses in CER

Global, parametric models typically make strong assumptions that can be wrong or misleading. Due to numerous sources of bias in observational data, such models tend to grossly underestimate the true uncertainty in Comparative Effectiveness Research (CER) using massive datasets of medical records or administrative claims. Local Control (LC) is a form of nonparametric preprocessing that makes robust head-to-head treatment comparisons. By dividing patients into many subgroups (blocks) that are relatively well-matched on pre-treatment X-covariates, LC produces a full distribution of local effect-size estimates. Systematic Sensitivity Analyses (SSA) of these distributions can provide an objective basis for individualized medicine by quantifying heterogeneous patient response to treatment. Because the LC approach is nonparametric (i.e. makes many fewer and weaker assumptions than parametric models), I will argue that computational requirements for SSA of LC distributions are sufficiently simplified to make them practically feasible. Alternative "choices" made in LC are limited to: which patient X-covariates are used, how many patient subgroups are formed, and which method of patient matching / clustering is used. Members of the SLDM section of ASA may well be uniquely qualified to pioneer use of SSA in LC and, thereby, make future CER findings much more objective and credible.

West Room Subgroup identification (organized and chaired by Ming-Long Lam)

Chakib Battioui (Eli Lilly)
A bootstrap-based ensemble tree method to identify patient subgroups with enhanced treatment effects

A rigorous method to identify patient subgroups with enhanced treatment effects in clinical trials can be valuable for efforts to develop tailored therapeutics. However, such a method must address the challenge of multiplicity, given the large number of potential subgroups formed by even a modest number of predictors, and be able to distinguish real difference from noise. In this presentation, I will describe a bootstrap-based ensemble tree method we have implemented to identify subgroups of patients for whom the treatment effect is significantly larger than for the overall patient population. The method starts by applying the tree procedure in SAS Enterprise Miner to multiple bootstrap samples to extract an ensemble of preliminary subgroups with better response. This ensemble is then reduced by aggregating over similarly defined subgroups. Out-of-Bag data is used to control type I errors and, for each candidate subgroup, correct the bias in the estimated treatment effect.

Richard Zink (JMP)
Considerations for subgroup identification of patients with enhanced treatment response in clinical trials

In contrast to the "one-size-fits-all" approach of traditional drug development, the need to locate subjects with an enhanced treatment effect is a critical component for tailored therapeutics or personalized medicine. Typically, the goal is to identify patients receiving additional benefit from the treatment in terms of an efficacy response. Alternatively, finding subgroups based on important safety endpoints could be considered to determine those individuals experiencing a reduced risk of key adverse events, or to identify subjects for whom the new therapy may be inappropriate. The analysis of subgroups in clinical trials is a controversial topic. There is often insufficient power to observe statistically significant treatment effects among numerous subgroups. Furthermore, a common practice is to identify many potential subgroups post hoc and to perform testing without the appropriate adjustment for multiple comparisons. This often leads to numerous false-positive findings. One common recommendation is to pre-specify a small number of clinically-meaningful subgroups for analysis and to interpret the findings conservatively. In many instances, however, the number of covariates may be so numerous or the biological phenomenon may be poorly understood so that pre-specifying subgroups is difficult at best. To address these concerns, we describe a "prune-as-you-go" method for subgroup identification. We address the following topics: i) computational complexity, ii) accounting for multiplicity, iii) minimizing confounding, iv) interpreting and communicating findings, and v) potential regulatory concerns. Finally, we describe the impact CDISC standards will likely have on subgroup-finding methodologies.

Jared Foster (University of Michigan)
Subgroup identification using random forests and regression trees

We consider the problem of identifying a subgroup of patients who may have an enhanced treatment effect in a randomized clinical trial. In such cases, there are often a moderate to large number of covariates, and it is desirable that any identified subgroup be defined by a limited number of covariates. For this problem, the development of a standard, pre-determined strategy may help to avoid the well-known dangers of subgroup analysis. We present one such strategy, which we refer to as "Virtual Twins." This method involves predicting response probabilities for treatment and control "twins" for each subject using random forests. In order to identify a subset of enhanced treatment effect which is defined by a limited number of covariates, the difference in these probabilities is then used as the outcome in a classification or regression tree. We define a measure Q(A) to be the difference between the treatment effect in estimated subgroup A and the marginal treatment effect. We present several methods developed to obtain an estimate of Q(A), including estimation of Q(A) using estimated probabilities in the original data and using estimated probabilities in newly simulated data. With this method, there is the potential for over-fitting, so we discuss a number of ways to combat this, including a bootstrap-based bias correction for estimates of Q(A).

Amphitheatre Sparsity in high dimensional problems (organized and chaired by Shuheng Zhou)

Roman Vershynin (University of Michigan)
New ways of dimension reduction? How to cut data sets into small pieces

The classical and simplest method of dimension reduction for a data set K in R^N is to project K onto a random low-dimensional subspace. Johnson-Lindenstrauss lemma gives theoretical guarantees for this method. I will describe a new geometric way of dimension reduction, which is to *cut*, rather than project, the data set K by random independent hyperplanes. Compared to Johnson-Lindenstrauss Lemma, one obtains an embedding of K into a *discrete* metric space, namely the Hamming cube. To compute the embedding, one simply records the orientation of a point with respect to all hyperplanes. Moreover, one can algorithmically and accurately recover a point from its image in the Hamming cube by solving a convex program. I will discuss possible connections and applications of this "cutting" method for compressed sensing and for sparse binomial regression. This is a joint work with Yaniv Plan (UM, Mathematics).

Cun-Hui Zhang (Rutgers University)
A general theory of concave regularization for high dimensional sparse estimation problems

Concave regularization methods provide natural procedures for sparse recovery. However, they are difficult to analyze in the high dimensional setting. Only recently a few sparse recovery results have been established for some specific local solutions obtained via specialized numerical procedures. Still, the fundamental relationship between these solutions such as whether they are identical or their relationship to the global minimizer of the underlying nonconvex formulation is unknown. The current paper fills this conceptual gap by presenting a general theoretical framework showing that under appropriate conditions, the global solution of nonconvex regularization leads to desirable recovery performance; moreover, under suitable conditions, the global solution corresponds to the unique sparse local solution, which can be obtained via different numerical procedures. Under this unified framework, we present an overview of existing results and discuss their connections. The unified view of this work leads to a more satisfactory treatment of concave high dimensional sparse estimation procedures, and serves as guideline for developing further numerical procedures for concave regularization. This is joint work with Tong Zhang.

Harrison Zhou (Yale University)
Optimalities in estimation of latent variable graphical models

In this talk we will consider estimation of latent variable graphical models. The precision matrix can be decomposed as the sum of a sparse matrix and a low rank matrix. Rate-optimal estimation will be studied to recover both sparse and low rank components.

1:30-3:00pm Parallel Sessions
East Room Complex data in medicine (organized by David Banks and chaired by Kerby Shedden)

David Banks (Duke University)
Statistical issues in metabolomics metrology

Metabolomics measurement is a complex process, with several sources of uncertainty and systematic error. This talk describes an integrated Bayesian approach that includes warping, modeling of library functions, and incorporation of calibration standards. The result is a significant improvement in the quality of overall estimates of metabolite abundance, and generalizable template for analyses on multiple platforms. The performance is assessed using data from the Beecher Laboratory at the University of Michigan.

Hongyuan Cao (University of Chicago)
Multiple testing under dependence

High-throughput screening has become an important mainstay for contemporary biomedical research. A standard approach is to first rank p-values and then adjust for multiplicity. In certain applications, for example, imaging analysis, the signals are sparse but clustered. By taking into account of the neighbouring information of spatially structured data, we propose a new multiple comparison procedure to accommodate the spatial information of neighbouring p-values. Theoretical properties of the new testing procedure are investigated. We demonstrate that our approach has improved detection ability as opposed to methods that ignore the sparse spatial structure through numerical studies. The computational simplicity and detection effectiveness of our procedure are illustrated through a breast cancer imaging dataset.

Eric Laber (North Carolina State University)
Statistical inference for dynamic treatment regimes

Dynamic treatment regimes, also known as treatment policies, are increasingly being used to operationalize sequential clinical decision making associated with patient care. Common approaches to constructing a dynamic treatment regime from data, such as Q-learning, employ non-smooth functionals of the data. Therefore, simple inferential tasks such as constructing a confidence interval for the parameters in the Q-function are complicated by nonregular asymptotics under certain commonly-encountered generative models. Methods that ignore this nonregularity can suffer from poor performance in small samples. We construct confidence intervals for the parameters in the Q-function by first constructing smooth, data-dependent, upper and lower bounds on these parameters and then applying the bootstrap. The confidence interval is adaptive in that although it is conservative for nonregular generative models, it achieves asymptotically exact coverage elsewhere. The small sample performance of the method is evaluated on a series of examples and compares favorably to previously published competitors. Finally, we illustrate the method using data from the Adaptive Interventions for Children with ADHD study (Pelham and Fabiano 2008).

West Room Methods for modeling complex dependence structure in high dimension setting (organized by Hongzhe Lee and chaired by Xiaoming Huo)

Genevera Allen (Rice University)
Sparse higher-order principal components analysis

High-dimensional tensors or multi-way data are becoming prevalent in areas such as biomedical imaging, chemometrics, networking and bibliometrics. Traditional approaches to finding lower dimensional representations of tensor data include fattening the data and applying matrix factorizations such as principal components analysis (PCA) or employing tensor decompositions such as the CANDECOMP/PARAFAC (CP) and Tucker decompositions. The former can lose important structure in the data, while the latter Higher-Order PCA (HOPCA) methods can be problematic in high-dimensions with many irrelevant features. We introduce frameworks for sparse tensor factorizations or Sparse HOPCA based on heuristic algorithmic approaches and by solving penalized optimization problems related to the CP decomposition. Extensions of these approaches lead to methods for general regularized tensor factorizations, multi-way Functional HOPCA and generalizations of HOPCA for structured data. We illustrate the utility of our methods for dimension reduction, feature selection, and signal recovery on simulated data and multi-dimensional microarrays and functional MRIs.

Jichun Xie (Temple University)
Joint estimation of multiple precision matrices

Precision matrices reveal conditional dependency structures between multivariate normal random variables. Some heterogeneous data share similar dependency structures among different groups. One of the examples is the multiple-tissue gene expression data. Evidence has shown that the gene regulatory networks share similarities across different tissues and yet have tissue-specific interaction mechanism. To improve the estimation efficiency, we develop a joint estimation method to infer precision matrices. Theoretical and Numerical studies show that our method has improved support recovery results compared with separate estimation methods as well as other competing joint estimation methods. We further apply our method to the mouse gene expression data and obtain the gene regulatory networks for multiple mouse tissues.

Lifeng Wang (Michigan State University)
Efficient regularization path algorithms for high dimensional additive models

Least Angle Regression (LARS) proves to be an efficient algorithm that constructs piecewise linear solution path for high-dimensional linear regression, particularly for the problem with p>>n. However, piecewise linear path algorithms have yet to be developed for nonparametric additive models. Motivated by the geometric interpretation of LARS, we propose a functional LARS algorithm to perform nonparametric regression and feature selection simultaneously for high-dimensional additive models. The proposed algorithm efficiently constructs the whole regularization path, which allows for adaptive tuning and efficient model selection. We investigate its asymptotic property and its connection to the Boosting and Lasso algorithms for additive models, and illustrate its finite-sample performance via simulations and applications.

3:30-5:00pm Parallel Sessions
East Room Statistical network analysis (organized by Tyler McCormick and chaired by Elizaveta Levina)

Tyler McCormick (University of Washington)
Latent space models for networks using aggregated relational data

The latent space model for networks (see Hoff, Raftery and Handcock (2002), for example) assumes that the actors in a network form ties independently given their (latent) position in some unobservable 'social space.' A multidimensional geometric space then parsimoniously represents the complicated dependence structure in the network. In many cases---particularly in the social sciences---collecting complete network data, required for the latent space model, is logistically and financially challenging. In contrast, Aggregated Relational Data (ARD) measure network structure indirectly by asking respondents how many connections they have with members of a certain subpopulation (e.g. How many individuals with HIV/AIDS do you know?). These data require no special sampling procedure and are easily incorporated into existing surveys. This research develops a latent space model for ARD. The key distinction from the complete network case is that instead of conditioning on the (latent) distance between two members of the network, the latent space model for ARD conditions on the expected distance between a survey respondent and the center of a subpopulation in the latent space. A spherical latent space facilitates tractable computation of this expectation. This model estimates relative homogeneity between groups in the population and variation in the propensity for interaction between respondents and group members. The model also estimates features of groups which are difficult to reach using standard surveys (the homeless, for example).

Simon Lunagomez (Harvard University)
Bayesian inference from non-ignorable network sampling designs

Consider a population where subjects are susceptible to a disease (e.g. AIDS). The objective is to perform inferences on a population quantity (like the incidence of HIV on a high-risk subpopulation, e.g. intra-venous drug abusers) via sampling mechanisms based on a social network (link-tracing designs, RDS). We phrase this problem in terms of the framework proposed by Rubin (1976). A new notion of ignorability (graph-ignorability) is proposed for this context and it is proved that RDS is not graph-ignorable. We develop a general framework for making Bayesian inference on the population quantity that: models the uncertainty in the underlying social network using a random graph model, incorporates dependence among the individual responses according to the social network via a Markov Random Field, models the uncertainty regarding the sampling on the social network, and deals with the non-ignorability of the sampling design. The proposed framework is general in the sense that it allows a wide range of different specifications for the components of the model we just mentioned. Samples from the posterior distribution are obtained via Bayesian model averaging. Our model is compared with state of the art methods in simulation studies to show its superiority. This is joint work with Edoardo Airoldi.

Andrew Thomas (Carnegie Mellon University)
Discriminating among distance functions for network-correlated outcomes

A key promise of social networks is the ability to detect and model the correlation of personal attributes along the structure of the network, in either static or dynamic settings. The basis for most of these models, the Markov Random Field on a lattice, has several assumptions that may not be reflected in real network data, namely the assumptions that the process is stationary on the lattice, and that the ties in the model are correctly specified. Additionally, it is less than clear how correlation over longer distances on networks can be adequately specified under the lattice mechanism, given the assumption of a stationary process at work. Based on concepts from generalized additive models and spatial/geostatistical methods, I introduce a class of models that is more robust to the failure of these assumptions, more flexible to different definitions of network distance, and more generally applicable to large-scale studies of network phenomena. In this talk I will address the challenge of selecting an appropriate subset of distance measures from the possible range of choices, including the interaction of multiple classes of network ties.

West Room Topics in multivariate statistics (organized and chaired by Adam Rothman)

Gareth James (University of Southern California)
Functional additive regression

We suggest a new method, called "Functional Additive Regression", or FAR, for efficiently performing high dimensional functional regression. FAR extends the usual linear regression model involving a functional predictor, X(t), and a scalar response, Y , in two key respects. First, FAR uses a penalized least squares optimization approach to efficiently deal with high dimensional problems involving a large number of different functional predictors. Second, FAR extends beyond the standard linear regression setting to fit general non-linear additive models. FAR can be implemented with a wide range of penalty functions using a highly efficient coordinate descent algorithm. FAR can also be extended to include both functional responses and predictors. We demonstrate our method by predicting the first 10 weeks of Hollywood movie box office revenues using an online virtual movie stock market.

Wenxuan Zhong (University of Illinois at Urbana-Champaign)
Variable selection procedures for the index model

In this talk, I will present three variable selection methods developed under the sufficient dimension reduction framework. Unlike the conventional stepwise method, the proposed methods do not impose a special form of relationship between the response variable and the predictor variables. They select variables that maximize the correlation between the transformed response and linear combinations of the predictors. Various asymptotic properties of the methods will be discussed, and in particular, the variable selection performance under the diverging number of predictors and sample size will be discussed.

Yunzhang Zhu (University of Minnesota)
Estimation over multiple undirected graphs

Graphical models are useful to analyze and visualize conditional independence relationships between interacting units, in addition to their structural implications. For instance, in dynamic network analysis, a structural change is often a result of certain events or experimental conditions. In this talk, we will discuss methods for estimating multiple precision matrices over Gaussian graphical models, describing dependencies among interacting units. Of particular interest is the clustering structure over these matrices as well as the sparseness structure over and within matrices, which is characterized by homogeneous subgroups of elements and zero-elements of the matrices. Numerical as well as theoretical results will be presented to support use of the methods. This work is joint with Xiaotong Shen and Wei Pan.

June 6 (Wednesday)

8:30-10:00am Parallel Sessions
East Room Complex data structures and learning (organized and chaired by Eric Laber)

Natesh Pillai (Harvard University)
On a class of shrinkage priors for covariance matrix estimation

We propose a flexible class of models based on scale mixture of uniform distributions to construct shrinkage priors for covariance matrix estimation. This new class of priors enjoys a number of advantages over the traditional scale mixture of normal priors, including its simplicity and flexibility in characterizing the prior density. We also exhibit a simple, easy to implement Gibbs sampler for posterior simulation which leads to efficient estimation in high dimensional problems. We first discuss the theory and computational details of this new approach and then extend the basic model to a new class of multivariate conditional autoregressive models for analyzing multivariate areal data. The proposed spatial model flexibly characterizes both the spatial and the outcome correlation structures at an appealing computational cost. Examples consisting of both synthetic and real-world data show the utility of this new framework in terms of robust estimation as well as improved predictive performance.

Kerby Shedden (University of Michigan)
Learning curves under increasing model complexity

The accuracy of a trained classifier is expected to increase as the training set sample size grows. The classification accuracy rate expressed as a function of the training set sample size has been termed the "learning curve." The learning curve has previously been seen to depend in a natural way on the number of predictor variables, and on the relationship between the training and testing covariate distributions. These results were obtained in a framework where a fixed model structure is used as the sample size grows. I will present our recent efforts to characterize learning curve behavior in a setting where increasingly complex models are fit as the sample size grows. We find that a wider range of learning curve patterns arises, and the shape of the learning curve becomes more difficult to predict than in the case of a static model structure.

Hua Zhou (North Carolina State University)
Tensor regression with applications in neuroimaging data analysis

Classical regression methods treat covariates as a vector and estimate a corresponding vector of regression coefficients. Modern applications in medical imaging generate covariates of more complex form such as multidimensional arrays (tensors). Traditional statistical and computational methods are proving insufficient for analysis of these high-throughput data due to their ultrahigh dimensionality as well as complex structure. We propose a new family of tensor regression models that efficiently exploit the special structure of tensor covariates. Under this framework, ultrahigh dimensionality is reduced to a manageable level, resulting in efficient estimation and prediction. A fast and highly scalable estimation algorithm is proposed for maximum likelihood estimation and the asymptotics are studied. Effectiveness of the new methods is demonstrated on both synthetic and real MRI imaging data. This is based on joint work with Lexin Li and Hongtu Zhu.

West Room Statistical ranking (organized and chaired by Yoonkyung Lee)

Shivani Agarwal (Indian Institute of Science)
On the statistical consistency of ranking algorithms

Ranking problems arise in an increasing number of applications, including for example information retrieval, recommendation systems, computational biology, drug discovery, and a variety of industrial prioritization tasks. In recent years, there has been much interest in developing machine learning algorithms for ranking problems, and in understanding the statistical properties of such algorithms. This talk will survey some of these results, focusing on the statistical consistency properties of ranking algorithms.

Lek-Heng Lim (University of Chicago)
Online ranking on random graphs

Suppose a large number of voters have each rated or compared a small subset of a large number of alternatives, how could we rank the alternatives based on these data? The rank aggregation problem is fraught with famous difficulties --- Arrow's impossibility, Saari's chaos, NP-hardness of Kemeny optima. To complicate matters further, let's say the ratings do not come all at once but trickles in on a daily basis and we would like to regularly update our ranking. Let's say we also want a measure of reliability or quality of our ranking. We will discuss a method based on Hodge decomposition and online learning that meets all these requirements. This is joint work with Yuan Yao.

Cynthia Rudin (MIT)
How to reverse-engineer product quality rankings

A good or bad product quality rating can make or break an organization. However, the notion of "quality" is often defined by an independent rating company that does not make the formula for ranking products public. We provide a supervised ranking approach for "reverse-engineering" a rating company's proprietary model as closely as possible. More generally, we present novel mixed integer programming (MIP) formulations for optimizing a large class of rank statistics, including the Area Under the Curve (AUC), the partial or local AUC, the Discounted Cumulative Gain (DCG), the P-Norm Push, and quality metrics relevant to reverse-engineering product rankings. We demonstrate the merit of our approach on a decade's worth of quality rating data from a major quality ratings company. This is joint work with Allison Chang from MIT, and Mike Cavaretta, Bob Thomas, and Gloria Chou from Ford Motors.

Amphitheatre Learning dependent structures (organized by Xiaotong Shen and chaired by Genevera Allen)

Adam Rothman (University of Minnesota)
Positive definite estimators of large covariance matrices

Using convex optimization, we construct a sparse estimator of the covariance matrix that is positive definite and performs well in high-dimensional settings. A lasso-type penalty is used to encourage sparsity and a logarithmic barrier function is used to enforce positive definiteness. Consistency and convergence rate bounds are established as both the number of variables and sample size diverge. An efficient computational algorithm is developed and the merits of the approach are illustrated with simulations and a speech signal classification example.

Peng Wang (Bowling Green State University)
High-dimensional inference function for random effects model with unspecified distribution

In longitudinal studies, mixed-effects models are important for addressing subject-specific effects. However, most existing approaches assume a normal distribution for the random effects, and this could affect the bias and efficiency of the fixed-effects estimator. Even in cases where the estimation of the fixed effects is robust with a misspecified distribution of the random effects, the estimation of the random effects could be invalid. We propose a new approach to estimate fixed and random effects using conditional quadratic inference functions. The new approach does not require the specification of likelihood functions or a normality assumption for random effects. It can also accommodate serial correlation between observations within the same cluster, in addition to mixed-effects modeling. Other advantages include not requiring the estimation of the unknown variance components associated with the random effects, or the nuisance parameters associated with the working correlations. We establish asymptotic results for the fixed-effect parameter estimators which do not rely on the consistency of the random-effect estimators. Real data examples and simulations are used to compare the new approach with the penalized quasi-likelihood approach, and SAS GLIMMIX and nonlinear mixed effects model (NLMIXED) procedures.

Lingsong Zhang (Purdue University)
Robust regularized singular value decomposition for two-way functional data

We develop a robust regularized singular value decomposition (RobRSVD) method for analyzing two-way functional data. The research is primarily motivated by the application of modeling human mortality as a smooth two-way function of age group and year. The RobRSVD is formulated as a penalized minimization problem where a robust loss function is used to measure the reconstruction error of a low-rank matrix approximation of the data, and an appropriately defined two-way roughness penalty function is used to ensure smoothness along two functional domains. By viewing the minimization problem as conditional regularized robust regressions, we develop a fast iterative reweighted least squares algorithm to implement the method. Furthermore, our formulation allows rigorous derivation of leave-one-row/column-out cross-validation and generalized cross-validation criteria, which enable computationally efficient data-driven penalty parameter selection. The advantages of the developed method over non-robust ones are shown via the mortality rate application and extensive simulation studies. This is a joint work with Haipeng Shen and Jianhua Huang.

10:30-11:30am Plenary Talk (chaired by David Banks)
Amphitheatre Susan Murphy (University of Michigan)
Adaptive confidence intervals for nonregular parameters

A commonly occurring non-regular parameter arises when scientific interest centers on a non-smooth function of regular parameters. These non-regular parameters occur in the assessment of a classifier's performance, in the estimation of multistage decision making policies and in many economic models. Indeed methods that use thresholding to provide interpretable results yet improve prediction performance produce non-regular parameters of this type. This is because these methods can be interpreted at estimating sparse versions of an underlying high dimensional parameter. If confidence intervals are considered at all, most research assumes potentially implausible "margin-like" conditions in order to justify the proposed confidence interval method. We describe a different approach based on constructing smooth upper and lower bounds on the parameter and then basing the confidence interval on the smooth upper and lower bounds. In particular two settings will be discussed and contrasted, that of a confidence interval for the misclassification rate and a confidence intervals for parameters in multistage decision making policies.

1:00-2:30pm Parallel Sessions
East Room Learning latent structures (organized and chaired by Long Nguyen)

Ryan Adams (Harvard University)
Tree-structured stick breaking processes for hierarchical data

The Dirichlet process and related distributions over infinite-dimensional random measures have had a large impact on Bayesian nonparametric modeling, but they can be limited by the lack of structure they place over their random partitions. In this work, we use a recursive stick breaking process to construct a Bayesian nonparametric prior for inferring hierarchies in data, yielding an exchangeable urn scheme on trees of unbounded width and depth. We use a novel slice sampling approach based on an auxiliary variable model to update the structure of the tree, as well as Gibbs moves that take advantage of invariance under size-biased permutation. We use this MCMC sampling scheme to perform hierarchical clustering on a set of 50,000 color images, and also to discover structure from natural language in a corpus of scientific documents. This is joint work with Zoubin Ghahramani (Cambridge) and Michael Jordan (Berkeley).

Junming Yin (Carnegie Mellon University)
Group sparse additive models

We consider the problem of sparse variable selection in nonparametric additive models, with the prior knowledge of the structure among the covariates to encourage those variables within a group to be selected jointly. Previous works either study the group sparsity in the parametric setting (e.g., group lasso), or address the variable selection problem in the nonparametric setting without exploiting the structural information (e.g., sparse additive models (SpAM)). In this paper, we present a new method, called group sparse additive models (GroupSpAM), which can handle group sparsity in nonparametric additive models. We generalize the l1/l2 norm to Hilbert spaces as the sparsity-inducing penalty in GroupSpAM. Moreover, we derive a novel thresholding condition for identifying the functional sparsity at the group level, and propose an efficient block coordinate descent algorithm for constructing the estimate. We demonstrate by simulation that GroupSpAM substantially outperforms competing methods in terms of support recovery and prediction accuracy in additive models, and also conduct a comparative experiment on a real breast cancer dataset.

Abel Rodriguez (UC Santa Cruz)
Modeling collections of networks using nonparametric mixtures

Network data, where the observations corresponds to measurements of the interactions among a collection of subjects, arises in application areas as diverse as biology, sociology, epidemiology, and finance. Array-valued extensions of non-parametric mixture models have recently been used to generate flexible extensions of the traditional class of blockmodels, which are often used for the purpose of community identification. This talk will discuss extensions of these infinite-dimensional blockmodels to problems where multiple networks are observed. In particular, we discuss both models for time series of networks, with a particular emphasis on modeling sequences of financial trading networks. This is a joint work with Naomi Boyd (University of West Virginia) and Brenda Betancourt (University of California, Santa Cruz).

West Room New techniques for correlated high-dimensional data (organized and chaired by Yichao Wu)

Wonyul Lee and Yufeng Liu (University of North Carolina at Chapel Hill)
Joint statistical modeling of multiple high dimensional data

With the abundance of high dimensional data, shrinkage techniques are very popular for simultaneous variable selection and estimation. In this talk, I will present some new shrinkage techniques for joint analysis of multiple high dimensional data. Applications on cancer gene expression data and micro-RNA data will be presented.

George Michailidis (University of Michigan)
Inferring gene regulatory networks by integrating perturbation screens and steady state gene expression profiles

Reconstructing a transcriptional regulatory network is an important task in functional genomics. Data obtained from experiments that perturb genes by knock-outs or RNA interference contain useful information for addressing the reconstruction problem. However, such data can be limited in size and/or expensive to acquire. On the other hand, observational data of the organism in steady state are more readily available, but their informational content inadequate for the task at hand. We develop a computational approach to appropriately utilize both data sources for estimating a regulatory network. The proposed method offers significant advantages over existing techniques. We develop a three-step algorithm to estimate the underlying directed acyclic regulatory network that uses as input both perturbation screens and steady state gene expression data. In the first step, the algorithm determines causal orderings of the genes that are consistent with the perturbation data. In the second step, for each ordering, a regulatory network is estimated using a penalized likelihood based method, while in the third step a consensus network is constructed from the highest scored ones. The algorithm offers two options for determining all possible causal orderings: an exhaustive search that becomes prohibitive for larger scale problems and a fast heuristic that couples a Monte Carlo technique with a fast search algorithm. Further, it is established that the algorithm produces a consistent estimate of the regulatory network. Numerical results show that the algorithm performs well in uncovering the underlying network and clearly outperforms competing approaches that rely only on a single data source.

Ning Hao and Helen Zhang (University of Arizona)
Selection of interaction effects for ultra high-dimensional data

For the ultra-high dimensional data, the identification of important interaction effects is an extremely challenging task, in terms of computation, practical implementation, and theoretical investigation. We propose new methods and computational algorithms to tackle these issues. The new methods are featured with efficient computation, desired theoretical properties, and promising numerical results. Various examples are presented to illustrate the new proposals.

3:00-4:30pm Parallel Sessions
East Room Applications of statistical learning to brain imaging (organized by Hernando Ombao and chaired by Veronica Berrocal)

Rossi Luo (Brown University)
Sparse inverse covariance estimation with applications in recovering brain networks

This talk presents a new method for estimating sparse precision matrices in the high dimensional setting. This procedure applies a Sparse Column-wise Inverse Operator (SCIO) to modified sample covariance matrices. We establish the convergence rates of this procedure under various matrix norms. In particular, we prove theoretical guarantees for using cross validation to pick tuning parameters in this procedure. Another advantage is its efficient computation for large-scale problems. This method is applied to recover gene networks of HIV brain tissues and resting-state fMRI networks of ADHD children.

Tingting Zhang (University of Virginia)
A semi-parametric model for hemodynamic response with latency differences

We introduce a semi-parametric model of hemodynamic response function (HRF) within the context of the General Linear Model (GLM) for multi-subject fMRI data analysis. In this new model, the HRFs for the given stimulus and fixed brain region are assumed to share the same but unknown functional form across subjects, while they differ in height, time to peak, and width. We use a nonparametric spline-smoothing method to evaluate this common functional form, based on which we estimate the subject-specific characteristics of the HRFs. This approach brings several advantages, including: (1) it is flexible in describing various brain hemodynamic activities across different regions and stimuli; (2) it explicitly characterizes the shared common properties across subjects, (3) the temporal differentiability of the employed spline basis enables precise evaluation of latency and width differences in hemodynamic responses. Through both simulations and real data analysis, we show that our method can out-perform existing methods for HRF estimation.

Timothy Johnson (University of Michigan)
Classifying multiple sclerosis subtypes: a Bayesian spatial point process approach

Multiple Sclerosis (MS) is a disease in which the myelin sheaths surrounding the axons of the brain and spinal cord are damaged. This leads to dymelination and scarring of the white matter tracks in the brain. There are five main subtypes of MS: 1) clinically isolated syndrom 2) relapsing/remitting, 3) secondary progressive, 4) progressive relapsing and 5) primary progressive. Researchers are interested in subtype prediction via MRI imaging. We use a model based approach to prediction, which can lead to substantial increases in prediction accuracy versus non-model based approaches. The data are lesion locations from T1-weighted MRI images. Prediction is performed within the Bayesian framework and we adopt a spatial Cox process model. In particular, we estimate the intensity function on the 3D image both parametrically, with a log-Gaussian random field, and non-parametrically, with an hierarchical Poisson/gamma random field. The resulting estimates are used to predict MS subtype in a new patient. We show, via importance sampling, that we achieve excellent cross-validated prediction results compared to a naive Bayesian classifier.

West Room Tuning parameter selection, matrix factorization and decomposition (organized and chaired by Annie Qu)

Junhui Wang (University of Illinois at Chicago)
Consistent selection of tuning parameters in high-dimensional penalized regression

Penalized regression models are popularly used in high-dimensional data analysis to conduct variable selection and model fitting simultaneously. Although success has been widely reported in literature, their performances largely depend on the tuning parameters that balance the trade-of between model fitting and model sparsity. Existing tuning criteria mainly follow the route of minimizing the estimated prediction error or maximizing the posterior model probability, such as cross-validation, AIC and BIC. In this talk, I will present a general tuning parameter selection criterion based on a novel concept of variable selection stability. The key idea is to select the tuning parameters so that the resultant penalized regression model is stable in variable selection. Numerical examples will be provided to demonstrate the effectiveness of the proposed selection criterion. Asymptotic selection consistency of the proposed criterion will also be discussed.

Mu Zhu (University of Waterloo)
Content-boosted matrix factorization for recommender systems

Recommender systems are quickly becoming ubiquitous marketing tools. Recommendation algorithms can be either based on content or driven by collaborative filtering. The Netflix prize has rejuvenated a widespread interest in the matrix factorization approach of collaborative filtering. We have been studying ways to incorporate content information directly into this approach. In this talk, I will present two methods with slightly different flavors. One uses a regression constraint, and the other uses an extra penalty that has a selective shrinkage effect. Our empirical evidence shows that these content-boosted matrix factorization algorithms not only improve recommendation accuracy, but also provide useful insights about the contents themselves that are otherwise unavailable, as well as make recommendations more easily interpretable. (Based on joint work with Peter Forbes and Jennifer Nguyen.)

Xiaotong Shen (University of Minnesota)
Simultaneous pursuit of sparseness and rank structures through matrix decomposition

In multi-response regression, pursuit of two different types of structures is essential to battle the "curse of dimensionality". In this paper, we decompose a parameter matrix into a sum of sparse and low rank matrices for structure pursuit. On this basis, we propose a constrained method subject to two nonconvex constraints, respectively for sparseness and low-rank properties. Computationally, obtaining an exact global solution is NP-difficult. To overcome the difficulty, we utilize an alternating directions method solving a low-rank problem and a sparseness problem alternatively, where we derive an exact solution to the low-rank problem, as well as an approximation solution for the sparse problem through a surrogate of the L0 constraint and difference convex programming. Theoretical properties of the proposed method will be discussed in addition to numerical examples.

June 7 (Thursday)

8:30-10:00am Parallel Sessions
East Room Flexible regression and clustering (organized by Yufeng Liu and chaired by Lingsong Zhang)

Jia Li (Pennsylvania State University and NSF)
D2-clustering and hypothetical local mapping

In this talk, I will present a novel clustering algorithm called D2-Clustering for objects each represented by a discrete distribution with finite but non-fixed support. This mathematical representation often arises in multimedia content analysis and can be viewed as a sparse representation of histogram-type (aka, frequency-type) feature vectors. The Kantorovich-Wasserstein metric is used as the distance between objects, which differs profoundly from the Euclidean distance based on frequencies. The D2-Clustering algorithm groups objects by minimizing the total within-cluster squared distances, a criterion in the same spirit as k-means but applied to a significantly more complicated data type. The computational obstacle is overcome by a double loop iterative linear programming algorithm. Inspired by manifold learning, I also propose Hypothetical Local Mapping (HLM) to construct a mixture density function for the generic metric space. The combination of D2-Clustering and HLM is applied to real-time image annotation.

Guang Cheng (Purdue University)
Joint asymptotics and inferences for semi-nonparametric models

In this talk, we consider the joint asymptotics and inferences for the semi-nonparametric models where the Euclidean parameter and an infinite dimensional parameter are both of interest. To illustrate our idea, we consider the generalized partly linear framework. Specifically, we derive the joint limit distribution for the two parameters which are rescaled according to different convergence rates. The marginal limit distribution/semiparametric efficiency for the Euclidean estimate coincides with that derived in the semiparametric literature. To construct the joint confidence region/band, we propose the likelihood ratio testing approach that can effectively avoid estimating the asymptotic covariance. The employed regularization tool is the smoothing spline. The undersmoothing of the smoothing spline estimate is required for obtaining the valid joint inferences. The key technical tool is a concentration inequality. A by-product result is the marginal asymptotics for the infinite dimensional parameter that are new even in the nonparametric literature.

Annie Qu (University of Illinois at Urbana-Champaign)
Variable selection in high-dimensional varying-coefficient models with global optimality

The varying coefficient model is flexible and powerful for modeling the dynamic changes of regression coefficients. It is important to identify significant covariates associated with response variables, especially for high-dimensional settings where the number of covariates can be larger than the sample size. We consider model selection in the high-dimensional setting and adopt difference convex programming to approximate the L0 penalty, and we investigate the global optimality properties of the varying coefficient estimator. The challenge of the variable selection problem here is that the dimension of the nonparametric form for the varying coefficient modeling could be infinite, in addition to dealing with the high-dimensional linear covariates. We show that the proposed varying coefficient estimator is consistent, enjoys the oracle property and achieves an optimal convergence rate for the non-zero nonparametric components for high-dimensional data. Our simulations and numerical examples indicate that the difference convex algorithm is efficient using the coordinate decent algorithm, and is able to select the true model at a higher frequency than the LASSO, the adaptive LASSO and the SCAD approaches.

West Room Inference in networks and multivariate analysis (organized and chaired by Guilherme Rocha)

Karl Rohe (University of Wisconsin, Madison)
Characterizing the asymmetric flow of information in directed networks

Relationships in social networks are often asymmetric and can represent the flow of information from one person to another. Email messages are one example. Using the directed stochastic blockmodel, this talk will (i) present a way to characterize the asymmetric flow of information in a directed network and (ii) demonstrate how this can identify "bottleneck nodes" in a citation network.

Michael Trosset (Indiana University)
Inference on random graphs with classified edge attributes

We test simple hypotheses about a random graph with a fixed set of vertices and random edges, each of which possesses one of K mutually exclusive attributes. The edge attributes are inferred by means of a fallible classifier. Suppose that E and F are the confusion matrices of two such classifiers. Using results from statistical decision theory, we demonstrate that, if there exists a K x K stochastic matrix R such that ER = F, then most powerful (MP) tests based on E are necessarily more powerful than MP tests based on F. By means of an example, we also demonstrate that entry-wise superiority of E to F does not guarantee that an MP test based on E is more powerful than an MP test based on F.

Vincent Vu (Carnegi Mellon University)
Sparse principal subspaces

Sparse Principal Subspaces (SPS) is an unsupervised statistical technique that simultaneously performs variable selection and dimension reduction for multivariate data. It builds on the main idea of PCA by imposing the following constraint: the principal subspace of variation is spanned mostly by a smaller number of variables. SPS avoids problems such as non-orthogonality and ill-posedness that can plague adhoc extensions of 1-dimensional sparse PCA methods to d-dimensions. It does this by estimating an orthonormal basis for the principal subspace all at once, and directly constraining the sparsity of the estimated subspace rather than entries of individual eigenvectors. In this talk I will present theoretical results including non-asymptotic minimax lower and upper bounds for the estimation of the principal subspace, and show that SPS is nearly rate optimal. In addition, I will discuss algorithmic issues, examples in data visualization, and concrete applications in the analysis of neuroimaging data.

Amphitheatre Development in variable screening and selection (organized and chaired by Helen Zhang)

Qing Mai and Hui Zou (University of Minnesota)
The Kolmogorov filter for variable screening in high-dimensional binary classification

Variable screening techniques have been proposed to mitigate the impact of high dimensionality in classification problems, including the t-test marginal screening (Fan & Fan, 2008) and the maximum marginal likelihood screening (Fan & Song, 2010). However, these methods rely on strong modeling assumptions that are easily violated in real applications. To circumvent the rigid assumptions, we propose a new variable screening technique for binary classification based on the Kolmogorov-Smirnov statistic, and hence the name Kolmogorov filter. We prove that the Kolmogorov filter enjoys the sure screening property under much weakened model assumptions. We supplement our theoretical study by a simulation study. This is joint work with Prof. Hui Zou.

Sijian Wang (University of Wisconsin, Madison)
Group and within group variable selection via convex penalty

In many scientific and engineering applications, predictors are naturally grouped, for example, in biological applications where assayed genes or proteins can be grouped by biological roles or biological pathways. When the group structures are available among predictors, people are usually interested in identifying both important groups and important variables within the selected groups. Among existing successful group variable selection methods, some methods fail to conduct the within group selection. Some methods are able to conduct both group and within group selection, but the corresponding objective function is non-convex, which may require extra numerical effort. In this talk, we will present a convex penalty for both group and within group variable selection. We develop an efficient group-level coordinate descent algorithm for solving the corresponding optimization problem. We also study the non-asymptotic properties of the estimates in the high-dimensional setting, where the number of predictors can be much larger than the sample size. Numerical results indicate that the proposed method works well in terms of both variable selection and prediction accuracy. We also apply the proposed method to American Cancer Society Breast Cancer Survivor dataset. This is joint work with Zhigeng Geng and Grace Wahba.

Yichao Wu (North Carolina State University)
Two-dimensional solution surface for weighted support vector machines

The support vector machine (SVM) is a popular method for binary classification and its natural extension is the weighted SVM (WSVM) in which observations from different classes are imposed with different weights. Consequently there are two parameters associated with the WSVM: one is the regularization parameter and the other is the weight parameter. Although the marginal behavior of the WSVM solution have already been explored, it is largely unknown how the WSVM solution changes with respect to these two parameters jointly. We establish in this article the joint piecewise-linearity of the WSVM solution in terms of the regularization parameter and the weight parameter. By taking advantage of the joint piecewise-linearity, an efficient algorithm is then proposed to obtain the entire two-dimensional solution surface. Our contribution is two-fold. First we provide a theoretical understanding on how the WSVM solution changes when the two parameters vary. Second this can address assorted potential issues in practical applications as the WSVM solution is readily available for any pair of regularization parameter and weight parameter. We demonstrate its usefulness by applying the proposed two-dimensional solution surface algorithm to tune the regularization parameter for the probability estimation method originally developed by Wang et al. (2008). A joint work with Seung Jun Shin and Hao Helen Zhang.

10:30-11:30am Plenary Talk (chaired by Xiaotong Shen)
Amphitheatre Art Owen (Stanford University)
Bootstrapping r-fold tensor data

This work is joint with Dean Eckles, Facebook. The famous Netflix data is a sparsely sampled table with rows for customers and columns for movies (or vice versa). Both movies and rows are naturally modeled as random effects. Bootstrapping such data is problematic: no proper bootstrap can exist, according to a theorem of Peter McCullagh. Resampling rows and columns independently is effective though slightly conservative. Computerized data gathering frequently produces data sets with three-way or even higher order data tables. We present a bootstrap for such tensor valued data. Our version uses independent weights instead of multinomial ones. It remains mildly conservative. Poisson weights are close to the original bootstrap, but binary weights have computational and statistical advantages. Under certain conditions a single bootstrap replicate suffices to give a variance estimate. We apply our method to show that Facebook users in the UK make longer comments than those in the US when using their phone. The opposite pattern holds for comments made via computer.

1:00-2:30pm Parallel Sessions
East Room Graphical modeling for high-dimensional problems (organized by Daniela Witten and chaired by David Mease)

Hyonho Chun (Purdue University)
A sparse conditional graphical model for non-independent measurements

For the purpose of inferring a network, we consider a sparse Gaussian graphical model (SGGM) under the presence of a population structure, which often occurs in genetic studies with model organisms. In these studies, data are obtained by combining multiple lines of inbred organisms or by using outbred animals. Ignoring such population structures would produce false connections in a graph structure, but most research in graph inference has been focused on independent cases. On the other hand, in regression settings, a linear mixed effect model has been widely used in order to account for correlations among observations. Besides its effectiveness, the linear mixed effect model has a generality; the model can be stated within a framework of penalized least squares. This generality makes it very flexible for utilization in settings other than regression. In this proposal, we adopt a random effect model to an SGGM. Our formulation fits into the recently developed conditional Gaussian graphical model in which the population structures are modeled as predictors and the graph is determined by a conditional precision matrix. The proposed approach is applied to the network inference problem of two datasets: heterogeneous stock (HS) and heterogeneous mice diversity panel (HMDP) datasets.

Pradeep Ravikumar (University of Texas at Austin)
One the use of greedy methods for learning discrete and Gaussian graphical models

Undirected graphical models, also known as Markov random fields, are widely used in a variety of domains, including biostatistics, natural language processing and image analysis among others. They compactly represent distributions over a large number of variables using undirected graphs which encodes conditional independence assumptions among the variables. Recovering this underlying graph structure is thus important for many of these applications of MRFs, especially under constrained settings where the number of variables is large, and the samples are limited. In this talk, we address the problem of learning the structure of a pairwise graphical model from samples in a high-dimensional setting. Classical approaches to this problem have ranged over various heuristic approaches, greedy methods being one class of them, but a line of recent research have proposed principled convex regularization based M-estimators, that have been shown to be not only computationally practical, but to also enjoy strong statistical guarantees. In this talk, we revisit a classical and simple greedy algorithm, that just iteratively adds and deletes edges. Surprisingly, we show that when these forward and backward steps are performed appropriately, we obtain a state of the art method for recovering graphical model structure. Indeed, not only does it enjoy obvious computational advantages over regularized convex optimization based approaches, but we also show that it is sparsistent, or consistent in sparsity pattern recovery, under weaker conditions, and with a smaller sample complexity. Joint work with Ali Jalali, Christopher Johnson.

Daniela Witten (University of Washington)
Fast graphical model estimation in very high dimensions

The graphical lasso, recently proposed for Gaussian graphical modeling in high dimensions, involves estimating an inverse covariance matrix under a multivariate normal model by maximizing the L1-penalized log likelihood. I will begin by presenting a very simple but previously unknown necessary and sufficient condition that can be used to identify the connected components in the graphical lasso solution. This condition can be used to achieve massive computational gains: computing the graphical lasso solution with 20,000 features now takes minutes on a standard desktop machine, whereas previously the computations were prohibitive. This opens up new doors for rigorous network analysis of high-dimensional biological data. As a specific example, I will discuss estimation of graphical models under distinct biological conditions, in which we expect some, but not all, aspects of the networks to differ between conditions. An extension of the necessary and sufficient condition developed for the graphical lasso allows for extremely fast network estimation in this setting. Parts of this work are joint with Jerry Friedman, Noah Simon, Pei Wang, and Patrick Danaher.

West Room Variable/feature selection via regularization (organized by Ming Yuan and chaired by Peter Qian)

Yang Feng (Columbia University)
A nonparametric optimal decision rule in high dimensional space

Nonparametric Optimal Decision Rule (NOD) is proposed to tackle the high-dimensional binary classification problem. We assume that the log-likelihood of class conditional densities can be written as a linear combination of log-likelihoods of marginal class conditional densities. After we plug-in estimates of marginal class conditional densities, to solve the classification problem is equivalent to find the best linear classification rule in a nonparametrically transformed feature space. Oracle inequalities are developed regarding the classification error. A vast array of simulation and real data analysis demonstrate the outstanding performance of NOD.

Yiyuan She (Florida State University)
Joint variable and rank selection for parsimonious estimation of high dimensional matrices

The talk discusses joint variable and rank selection for supervised dimension reduction in predictive learning. When the number of responses and/or that of the predictors exceed the sample size, one has to consider shrinkage methods for estimation and prediction. We propose to apply sparsity and reduced rank techniques jointly to attain simultaneous feature selection and feature extraction. A class of estimators are introduced are based on novel penalties that impose both row and rank restrictions on the coefficient matrix. We show that these estimators adapt to the unknown matrix sparsity. A computation algorithm is developed and applied to real world applications in machine learning, cognitive neuroscience and macroeconometrics forecasting. This is joint work with Florentina Bunea and Marten Wegkamp.

Yixin Fang (New York University)
Penalized principal components of heritability

In family studies with multiple continuous phenotypes, we are interested in finding linear combinations of the phenotypes with large heritabilities, which can be considered as new phenotypes for genetic analysis. Here we examine the principal-components approach based on heritability (PCH) for situations where the number of phenotypes is large, and propose the penalized PCH approach for such situations. We evaluate the performance of the new approach using simulation and two family studies.

3:00-4:30pm Parallel Sessions
East Room Structural sparsity pursuit in high-dimensional modeling (organized and chaired by Yiyuan She)

Jiahua Chen (University of British Columbia)
A thresholding algorithm for order selection in finite mixture models

Order selection is an important step in the application of finite mixture models. Classical methods such as AIC and BIC discourage complex models with a penalty directly proportional to the number of mixing components. In contrast, Chen and Khalili(2008) propose to link the penalty to two types of over-fitting. In particular, they introduce a regularization penalty to merge similar subpopulations in a mixture model, where the shrinkage idea of regularized regression is seamlessly employed. However, the new method requires an effective and efficient algorithm. When the popular EM-algorithm is used, we need to maximize a non-smooth and non-concave objective function in the M-step, which is computationally challenging. In this article, we show that such an objective function can be transformed into a sum of univariate auxiliary functions. We then design an iterative thresholding descent algorithm (ITD) to efficiently solve the associated optimization problem. Unlike many existing numerical approaches, the new algorithm leads to sparse solutions and thereby avoids undesirable ad hoc steps. We establish the convergence of the ITD and further assess its empirical performance using both simulations and real-data examples.

Jiashun Jin (Carnegi Mellon University)
Optimality of graphlet screening in high dimensional variable selection

Consider a linear model $Y=X\beta+z$, where X has n rows and p columns and z ~ N(0; In). We assume both p and n are large, including the case of p >> n. The unknown signal vector $\beta$ is assumed to be sparse in the sense that only a small fraction of its components is nonzero. The goal is to identify such nonzero coordinates (i.e., variable selection). We are primarily interested in the regime where signals are both rare and weak so that successful variable selection is challenging but is still possible. Researches on rare and weak signals to date have been focused on the unstructured case, where the Gram matrix $G=X^TX$ is nearly orthogonal. In this paper, G is only assumed to be sparse in the sense that each row of G has relatively few large coordinates (diagonals of G are normalized to 1). The sparsity of G naturally induces the sparsity of the so-called graph of strong dependence (GOSD). The key insight is that there is an interesting interplay between the signal sparsity and graph sparsity: in a broad context, the signals decompose into many small-size components of GOSD that are disconnected to each other. We propose Graphlet Screening (GS) for variable selection. This is a two-step Screen and Clean procedure, where in the first step, we screen subgraphs of GOSD with sequential $\chi^2$-tests, and in the second step, we clean with penalized MLE. The main methodological innovation is to use GOSD to guide both the screening and cleaning processes. For any variable selection procedure $\hat{\beta}$ we measure its performance with the Hamming distance between the sign vectors of $\hat{\beta}$ and $\beta$, and assess the optimality by the convergence rate of the Hamming distance. Compared with more stringent criterions such as exact support recovery or oracle property, which demand strong signals, the Hamming distance criterion is more appropriate for weak signals since it naturally allows a small fraction of errors. We show that in a broad class of situations, Graphlet Screening achieves the optimal rate of convergence in terms of the Hamming distance. Well-known procedures such as the L0-penalization method and the L1-penalization methods do not utilize graph structure for variable selection, so they generally do not achieve the optimal rate of convergence, even in very simple settings and even when the tuning parameters are ideally set.

Elizaveta Levina (University of Michigan)
Link prediction for partially observed networks

Link prediction is one of the fundamental problems in network analysis. In many applications, notably in genetics, a partially observed network may not contain any negative examples of absent edges, which creates a difficulty for many existing supervised learning approaches. We present a new method which treats the observed network as a sample of the true network with different sampling rates for positive and negative examples, where the sampling rate for negative examples is allowed to be 0. The sampling rate itself cannot be estimated in this setting, but a relative ranking of potential links by their probabilities can be, which is sufficient in many applications. To obtain these rankings, we utilize information on node covariates as well as on network topology, and set up the problem as a loss function measuring the fit plus a penalty measuring similarity between nodes. Empirically, the method performs well under many settings, including when the observed network is sparse and when the sampling rate is low even for positive examples. We illustrate the performance of our method and compare to others on an application to a protein-protein interaction network. Joint work with Yunpeng Zhao and Ji Zhu.

West Room Statistical methods for next generation sequencing data analysis in bioinformatics (organized anc chaired by Michael Zhu)

Ming Hu (Harvard University)
Bayesian inference of three-dimensional chromosomal organization

Knowledge of spatial organization of the genome is critical for the study of transcription regulation and other nuclear processes in the cell. Recently, chromosome conformation capture (3C) based technologies such as Hi-C and TCC have been developed to provide a genome-wide, three-dimensional (3D) view of chromatin organization, but analysis algorithms for inferring chromosomal structures from such experiments are still under-developed. In the Hi-C experiment, the spatial proximity of any two genomic loci is represented by a count matrix. The goal of our study is to translate the count matrix into three-dimensional (3D) structure of local genomic domains. To achieve this, we devise a novel Bayesian statistical model linking the observed read counts spanning a pair of loci to the spatial distance between them. Using advanced Monte Carlo computational techniques such as sequential Monte Carlo and hybrid Monte Carlo, we are able to reconstruct the spatial arrangement of local genomic domains in 3D space. When applying our method to a real Hi-C dataset, we can visualize the 3D shape of the local genomic domains, and the predicted spatial distances are consistent with the gold standard florescent in situ hybridization (FISH) data. With the explosive accumulation of high throughput genome-wide chromatin interaction data, the proposed method will have immediate and far-reaching impact in the broader area of biomedical research.

Steve Qin (Emory University)
Inference of correlated hidden Markov models with application to genome-wide studies

Hidden Markov models (HMM) have been widely used to analyze large-scale genomic data because of its ability to incorporate spatial correlation, in areas such as genome-wide mapping of binding sites for regulatory proteins. With advances in sequencing technology, it is now common to analyze data for multiple proteins jointly, but existing algorithms analyze data for each protein separately. However, extending the HMM framework to a multivariate form is not trivial because hidden state space increases quickly with the number of experiments and it is therefore of interest to develop a powerful yet computationally tractable algorithm. Here we present a fast inferential method for correlated hidden Markov models for multiple sequential data (series). Instead of jointly inferring hidden states for all series, we adjust the transition kernel of each series with the hidden state configuration in other correlated series and keep the hidden state inference marginal in each. Through simulation, we show that the new scheme achieves sensitivity comparable to the fully coupled HMM fit at a computational cost as low as fitting an independent HMM for each series separately. The method was applied to the analysis of histone modification data in human T cells.

Han Wu (Purdue University)
Using finite Poisson mixture models for RNA-Seq data analysis and transcript expression level quantification

RNA-Seq has emerged as a powerful technique for transcriptome study. As much as the improved sensitivity and coverage, RNA-Seq also brings challenges for data analysis. The massive amount of sequence reads data, excessive variability, uncertainties, and bias and noises stemming from multiple sources all make the analysis of RAN-Seq data difficult. Despite much progress, RNA-Seq data analysis still has much room for improvement, especially on the quantification of transcript/gene expression levels. In this article, we propose to use finite Poisson mixture models to characterize base pair-level RNA-Seq data and further quantify transcript/gene expression levels. Finite Poisson mixture models combine the strength of fully parametric models with the flexibility of fully nonparametric models, and are extremely suitable for modeling heterogeneous count data such as what we observed from RNA-Seq experiments. In particular, we consider three types of Poisson mixture models and propose to use a BIC-based model selection procedure to adapt the models to individual transcripts. A unified quantification method based on the Poisson mixture models is developed to measure transcript/gene expression levels. The Poisson mixture models and the proposed quantification method were applied to analyze two RNA-Seq data sets and demonstrated excellent performances in comparison with other existing methods. Our approach resulted in better characterization of the data and more accurate measurements of transcript expression levels. We believe that finite Poisson mixture models provide a flexible framework to model RNA-Seq data, and methods developed based on this framework have the potential to become powerful tools for RNA-Seq data analysis.