Introduction
Regression analysis is a very large branch of statistics. In this course we make use of a variety of methods for regression modeling. Below, we first define some concepts that can be used to understand the major distinctions between various approaches to regression. Then we review some specific regression methods along with their key properties.
Before proceeding, note that regression itself is somewhat difficult
to define in a way that differentiates it from the rest of statistics.
In most cases, regression focuses on a conditional distribution,
e.g. the conditional distribution of a variable $y$
given another variable $x$
.
Any analysis focusing on a conditional
distribution can be seen as a form of regression analysis.
Major concepts
Single index models: a single index model is any regression model that is expressed in terms of one “linear predictor”
$b_1x_1 + \cdots + b_px_p$
, where the$x_j$
are observed covariates (data) and the$b_j$
are unkown coefficients (parameters).Mean regression: this term refers to any regression analysis where the population target is the conditional mean function
$E[y | x]$
.Linear model: Depending on the context, this can mean any of the following: (i) the expected value is linear in the covariates, (ii) the expected value is linear in the parameters, or (iii) the fitted values and/or parameter estimates are linear in the data.
Regression for independent observations: Most of the basic regression methods are suitable for samples of independent observations. More advanced regression methods can be used when the observations are known to be dependent.
Heteroscedasticity: If the conditional variance
${\rm Var}[y|x]$
is constant (i.e. does not depend on$x$
), then the conditional distribution of$y$
given$x$
is homoscedastic, otherwise it is heteroscedastic (alternative terminology is “constant variance” or “nonconstant variance”). Heteroscedasticity can be accommodated by some regression procedures (e.g. Poisson regression works best if the mean and variance are equal). Some regression procedures, like ordinary least squares, work best when the population is homoscedastic, but can still give meaningful results (with some loss of power) if the population is heteroscedastic.Mean/variance relationship: If the variance of a distribution is a function of the mean, there is a mean/variance relationship. For example, in the Gaussian distribution, the variance is unrelated to the mean, in the Poisson distribution, the variance is equal to the mean, and in the negative binomial distribution, the variance has the form
$\mu + \alpha\mu^2$
, where$\alpha \ge 0$
is a “shape parameter”.Overdispersion/undispersion: If the conditional variance of the data is greater than the conditional variance of the population model being fit to the data, there is overdispersion. If the conditional variance of the data is less than the population model, there is underdispersion.
Repeated measures: This is one reason that data may be non-independent. Repeated measures (or “clustering”) refers to any setting in which the data fall into groups, and the observations in one group are more similar to each other than to observations in other groups (perhaps due to unobserved covariates that are stable within a group).
Marginal regression: This is a form of regression analysis where the estimation target is the marginal regression function
$E[y|x]$
, even though the data may be clustered or otherwise dependent. The marginal regression function remains an object of interest when the data are dependent, even though it does not capture the relationship between the independent and dependent variables in full. Some methods for marginal regression also give insight into the marginal variance function$Var[y|x]$
and marginal covariances$Cov[y_1, y_2|x_1, x_2]$
.Multilevel regression: This is an alternative term for “random effects modeling” that is preferred by some people. It emphasizes the fact that in many data sets, there are complex inter-relationships between the observations that are not explained by the covariates. These inter-relationships allow us to speak in terms of unobserved “random effects” that are added to the linear predictors of one or more observations. This gives rise to dependence, and also, in nonlinear models, gives rise to different ways of defining a “regression effect”. Multilevel models can also be viewed as a way to model variances and covariances, although these are modeled through random effects, rather than directly.
Conditional/marginal effect (in multilevel regression): In a multilevel model, a “marginal effect” is usually defined as the change in
$E[y|x]$
corresponding to a one unit change of a specific covariate$x_k$
. A “conditional effect” is usually defined as the change in$E[y|x,u]$
for a one unit change in$x_k$
, where$u$
is an unobserved random effect. For linear models, conditional and marginal effects are the same. But in nonlinear models the two types of effects differ. Methods for nonlinear regression target either the marginal effects, or the conditional effects, but usually not both. In most cases the conditional effect will be numerically larger than the marginal effect. Note that the word “effect”, while widely used, conveys causality that may not be warranted.Conditional/marginal effect (in single-level regression): Another use of the term “marginal effect” arises in single-level regression models. In this case the marginal effect is the change in
$E[y|x_k]$
corresponding to a one unit change of$x_k$
, while the conditional effect is the change in$E[y| x_1, \ldots, x_p]$
corresponding to a single unit change in$x_k$
, with the other variables$x_j$
for$j\ne k$
held fixed. When referring to this type of marginal effect, the marginal and conditional effects differ even in a linear model.Parametric/nonparametric regression: This terminology is used inconsistently to refer to different things in different settings. It is often not a useful distinction to think of a specific regression model or estimator as being “parametric” or “nonparametric”, since many methods like linear least squares can be seen as being both – least squares is parametric if placed in a Gaussian setting, and nonparametric if viewed through moments and geometric projections. Another sense of this term is “flexible”, in that certain parts of the regression model (e.g. the mean function) can adapt to increasingly complex population structures as more data become available.
Models, fitting procedures, and algorithms
Another important distinction to make is between the various
regression model structures (e.g. different model parameterizations),
and different ways for fitting a regression model structure to data.
For example, the linear mean
model is one prominent structural model for regression, in which the
conditional mean function $E[y|x]$
is expressed as a linear function
of the predictors in $x$
. There are many “fitting procedures”
that enable one to fit this linear model to
data, including least squares, penalized least
squares, and many variations of robust regression, maximum likelihood
regression, and Bayesian regression. However all of these fitting procedures are fitting the same
class of models to the data.
In other words, least squares is a fitting procedure that can be used to fit a model to data. The least squares fitting procedure has statistical properties (i.e. it is known to be efficient, consistent, etc. in some settings). A different (e.g. Bayesian or penalized) procedure for fitting the same class of models will have its own, potentially different properties (e.g. it may be consistent in some settings where least squares is not and vice-versa).
Algorithms are specific numerical procedures used to implement fitting procedures so that they can be used to fit models to data sets. In some cases, e.g. least squares, the algorithm is essentially exact, and therefore does not impact the statistical properties of the analysis. In a few settings, e.g. regression trees or deep neural networks, “the algorithm is the model”, and it is difficult to distinguish the model structure itself from the estimation approach used to fit the model to data.
Some specific regression analysis methods
Least squares: ordinary least squares (OLS) is the most basic type of curve fitting. It is optimally used when the conditional mean function is linear in the covariates, and the conditional variance is constant. Both of these restrictions can be worked around, however. Nonlinearity of the mean function can be accommodated using basis functions, and heteroscedasticity can be accommodated using inverse variance weights (in which case were are doing “weighted least squares”, or WLS).
Generalized Linear Models (GLM): GLM’s are an extension of linear models that introduce link functions and mean/variance relationships. The link function allows the expected value of the response variable to be expressed as a known transformation of the linear predictor. The mean/variance relationship expresses how the conditional variance of the response given the predictors relates to the conditional mean of the response given the predictors. GLM’s are often (but not always) a better alternative to using least squares with a transformed predictor (e.g. instead of regressing
$\log y$
on$x$
using a linear model, regression$y$
on$x$
using a GLM with a log link function). Some people like to emphasize the fact that many GLM’s imply a limited domain for the dependent variable of the regression (e.g. the sample space may be limited to non-negative integers, to the positive real line, to a finite set of values, etc.). This is technically true, but may not be the most salient feature of a GLM, and in fact GLM’s generally work well even if the domain restriction is violated. Finally, note that while some GLM’s are likelihood-based, others are not (e.g. quasi-Poisson, quasi-negative binomial). This leads to the so-called “quasi-likelihood” approach to fitting generalized linear models.Generalized Estimating Equations (GEE): GEE is an extension of GLM that allows for certain types of statistical dependencies between the observations. A GEE is determined by specifying the GLM that it is derived from, and a “working model” for the correlation structure. The fitting and inference in a GEE is robust in that the working dependence model can be misspecified, and the estimates and inferences will still be valid (this can be stated in more precise terms but we will not do that here). GEE estimates the “marginal mean structure”. In the linear case, GEE is closely related to the more basic technique of “generalized least squares” (GLS).
Multilevel linear models: multilevel (or mixed) linear models are an extension of the basic linear model in which there are (usually) one or more covariates, and also “random effects” which describe how the observations are correlated with each other. These unobserved random effects can be viewed as missing information that reflects additional structure in the population not captured through the covariates. There is essentially a 1-1 correspondence between mixed linear models and GLS/GEE models, in that both estimate the same population target (the conditional mean function), but using different estimators. The mixed linear model will in most cases give better estimates of variance parameters than GLS/GEE, but may be less robust to misspecification of the dependence structure. It is a very rich framework that can be used to account for a variety of structures in the population that are difficult to model in other ways, including clustering, multilevel (nested) clustering, crossed clustering, and heterogeneous partial associations (e.g. the coefficient for a covariate differs across many known subpopulations).
Multilevel GLM’s: these are one of the most challenging classes of regression models, especially from a computational perspective. Structurally, they are very similar to linear mixed models, and in practice, can be interpreted in a similar way, except for the important distinction that in a multilevel GLM, the marginal and conditional mean structures differ (which is not the case for a multilevel linear model).
Other forms of regression:
Survival regression – this is a large set of techniques used for handling censored data
Quantile regression – this refers to any method that linearly relates a specified quantile (often the median) to the covariates
Conditional regression – this is a useful but narrowly applicable “trick” in which by conditioning on certain statistics, a multilevel model is essentially converted into a single-level model. The most familiar forms of this technique are single-level conditional logistic and Poisson regression. In both cases, we can have clustered data (which would more often be handled using mixed effects or GEE), but by conditioning on the observed total of the outcome values within each group, the observations become conditionally independent, and can be rigorously fit using a single-level likelihood approach.
Variance regression – this is a class of approaches that parametrically model the variance along with the mean, e.g. log(variance) is modeled as a linear function of the covariates.
Local regression – this is a very flexible approach to capturing nonlinear regression relationships. It is an example of a regression method that is not fitting a single-index model; it is generally seen as being limited by the “curse of dimensionality”, so that it cannot be applied with more than a handful of covariates, unless the sample size is very large.
Additive regression – this is a way to restrict the general kernel regression technique to avoid the curse of dimensionality. The conditional mean function
$E[y|x]$
is modeled as$g_1(x_1) + \cdots + g_p(x_p),$
where the$g_j()$
are unknown univariate functions. The model is additive over the covariates, which is a strong restriction, but generalizes classical linear models by allowing each covariate to be transformed in an arbitrary way.Dimension reduction regression – this is a very unique and distinct class of regression approaches that posit a multi-index structure and an unknown link function. Specifically,
$E[y|x]$
is modeled as having the form$g(b_1^\prime x, \ldots, b_k^\prime x)$
, where the$b_j$
are vectors of regression coefficients, and$g$
is an unknown link function. The focus is on estimating the regression “directions”$b_j$
, not on the link function.Generalized method of moments (GMM) – this is a technique for efficiently estimating the parameters of nonlinear models using only the moments. It is mainly used when it is important to estimate regression effects without requiring a model for the full conditional distribution to be specified.
Multivariate regression – these are techniques for regressing a vector of dependent variables on a vector of independent variables. There is some “sharing of information” in these models that allows the regressions to be fit more accurately as a collection, rather than performing several standard regressions, one to each component of the vector dependent variable.
Machine learning/algorithmic regression – this is a broad and loosely-defined collection of methods for regression analysis in which complex formal representations of regression functions, e.g. trees, ensembles of trees, or neural networks, are fit to data. The distinction between “machine learning” and “statistical regression” is generally artificial and not useful to make, but some techniques, especially neural networks, are often viewed as being part of machine learning.