Overview of basic mixed regression

2019/10/26

Introduction

Linear mixed modeling is one of the most widely-applicable analysis frameworks in modern applied statistics. Here we focus on one common use-case for this methodology, which is to rigorously conduct a linear regression analysis in a setting where there are known sources of correlation among the observed values. While standard linear regression methods (e.g. ordinary least squares) can sometimes be used when the observations are statistically dependent, doing this will generally result in incorrect standard errors, and the coefficient estimates will be less precise than they could be. Linear mixed modeling is widely seen as an appropriate and rigorous way to address these issues (although there are good alternatives including GLS and GEE).

It is helpful to think about why two values in a data set may be correlated. One common setting is that the two values are “repeated measures” taken on the same sampled unit. For example, you may have measurements of a person’s eyesight in their left and right eyes. If the person has poor vision (e.g. due to a genetic condition), in most cases both eyes will have poor vision. If the person has good vision, both eyes will tend to have good vision. While there can be situations like injury or diseases that will lead to a person having good vision in one eye and poor vision in the other, in general eyesight will be similar in both of a person’s eyes. As a result, there is a correlation between left and right eye sight quality when grouping the measurements by person.

A common source of correlation in many data sets is the effects of covariates that we do not observe. For example, suppose we observe cancer death rates by census tract, and the census tracts are nested in counties. Cancer is much more common in older people compared to younger people, and census tracts differ in their age compositions (e.g. some census tracts have a larger fraction of older people than others). Two census tracts in the same county are more likely to have similar age compositions than two census tracts in different counties. Therefore, census tracts in the same county will be more similar in terms of cancer rates (i.e. they are correlated).

Note that in many cases correlation will change after controlling for relevant covariates. In the cancer death rates example, the correlation might be reduced if we had access to the age composition of the census tracts and included this information as covariates in the regression model.

A final and very common setting where we have correlation between observations in a dataset is if the dataset is longitudinal. This means that we collect data on the same set of units (e.g. people) at repeated points in time. For example, if we have a sample of children and measure each child’s height every year, then we would have longitudinal data. A child who is is taller than average one year will tend to be taller than average the next year. This is a form of temporal (“auto”) correlation.

Note that many data sets may exhibit correlation for more than one of the reasons discussed above. For example, we may have longitudinal data measured on units that are also clustered.

Terminology and alternative approaches

Several terms are in common use when referring to mixed linear models. These models are also called “random effects models”, “varying coefficients models”, “mixed effects models”, “multilevel models”, and “hierarchical models”. The term “variance components models” is also sometimes encountered. To a large degree, these terms all refer to the same approach.

While linear mixed modeling is probably the most commonly-used way to work with correlated data in a regression framework, it is not the only valid approach for doing so. Generalized Least Squares (GLS), or Generalized Estimating Equations (GEE) are popular alternatives, and there are many other options.

In addition to the linear models that we focus on here, there are also many settings where we may wish to conduct a nonlinear regression with correlated data. Popular approaches for this type of analysis include generalized linear mixed effects modeling, nonlinear mixed effects modeling, and the Generalized Estimating Equations (GEE) approach mentioned earlier.

Basic example (random intercepts)

A basic task that can be handled with this framework would arise if we wanted to estimate the overall mean of a population using a clustered sample. The model we will use here is often referred to as a “random intercepts model”, with the random intercepts reflecting the net effect of all unobserved covariates influencing all observed values in a particular group. A problem with this terminology is that the term “random” may imply that the intercepts are completely unpredictable. In fact, the use of the term “random” here only refers to a technical aspect of the way that the intercepts are modeled. An alternative term is “varying intercepts model”, which emphasizes that the intercepts are different across the groups emphasizing that the intercepts are modeled as random quantities.

Marginal model

One way to view this is through the marginal correlations. Suppose for illustration that we have two groups and two observations per group. Our dataset thus contains four values, and the correlation matrix among these four values might look like what is shown below (the data here are denoted $y_{ij}$, for the jth value observed in the ith group).

$$ \begin{array}{l|cccc} & y_{11} & y_{12} & y_{21} & y_{22}\\\hline y_{11} & 1 & r & 0 & 0\\ y_{12} & r & 1 & 0 & 0\\ y_{21} & 0 & 0 & 1 & r\\ y_{22} & 0 & 0 & r & 1 \end{array} $$

The parameter $r$ appearing in the matrix above is the unknown correlation between two observations in the same group. If $r$ is large, this means that the factors causing observations within a group to be more similar than the observations in different groups are strong. If $r$ is close to zero, then these factors are weak. While it is mathematically possible for $r$ to be negative, we usually take $r$ to be non-negative in this type of model.

In addition to the correlation structure shown above, to define a complete model we also need to specify the marginal variance structure and the marginal mean structure. Here we take the data to be homoscedastic, so there is a single variance parameter $\sigma^2$, and the marginal covariance matrix is obtained by taking the matrix above and multiplying all elements by $\sigma^2$:

$$ \begin{array}{l|cccc} & y_{11} & y_{12} & y_{21} & y_{22} \\\hline y_{11} & \sigma^2 & r\sigma^2 & 0 & 0\\ y_{12} & r\sigma^2 & \sigma^2 & 0 & 0\\ y_{21} & 0 & 0 & \sigma^2 & r\sigma^2\\ y_{22} & 0 & 0 & r\sigma^2 & \sigma^2 \end{array} $$

As stated above, here we are interested in estimating the population mean, so the marginal mean structure can be written $E[y_{ij}] = \mu$. Overall, this model has three parameters: $\mu, r, \sigma^2$.

Data generating model

A linear mixed model can always be defined through the marginal mean and marginal variance/covariance structure, as defined above. It is also possible to define a linear mixed model through a “generating model”. The generating model that is equivalent to the model expressed in marginal form above is given by:

$$ y_{ij} = \mu + \theta_i + \epsilon_{ij} $$

where $\theta_i$ and $\epsilon_{ij}$ are “random effects”, or “latent variables”. These are random values that we do not observe. The distributions of these random effects are defined using “structural parameters”. In this particular example, the $\theta_i$ values have variance $\tau^2$ and the $\epsilon_{ij}$ values have variance $\sigma^2$. In addition, the $\theta_i$ and $\epsilon_{ij}$ values are mutually independent and have mean zero. The structural parameters of this variance/covariance structure are $\tau^2$ and $\sigma^2$.

Equivalence between marginal and data generating models

Every model specified in marginal form can be expressed through a data generating model, and vice versa. These are simply two different ways of conveying the same information. If we want to find the marginal model for this data generating model, we can calculate the marginal moments directly. For example, when $j\ne k$ we have the following covariance:

$$ \begin{eqnarray*} {\rm cov}(y_{ij}, y_{ik}) &=& {\rm cov}(\mu+\theta_i+\epsilon_{ij}, \mu+\theta_i+\epsilon_{ik})\\ &=& {\rm cov}(\theta_i + \epsilon_{ij}, \theta_i + \epsilon_{ik})\\ &=& {\rm cov}(\theta_i, \theta_i)\\ &=& {\rm var}(\theta_i)\\ &=& \tau^2. \end{eqnarray*} $$

We can similarly calculate that ${\rm var}(y_{ij}) = \tau^2 + \sigma^2$. Thus the correlation parameter r is

$$ r = \frac{{\rm cov}(y_{ij}, y_{ik})}{{\rm sd}(y_{ij}){\rm sd}(y_{ik})} = \frac{{\rm cov}(y_{ij}, y_{ik})}{{\rm var}(y_{ij})} = \frac{\tau^2}{\tau^2+\sigma^2}. $$

The value $r$ defined above is known as the “intraclass correlation coefficient”, or “ICC”. It is a measure of how strongly the group membership associates with the value of the response variable $y_{ij}$. The ICC falls between zero and one. If the ICC is equal to zero, the group membership is totally unrelated to the response value. If the ICC is equal to one, the response variable is entirely determined by the group membership.

Random slopes example

The random intercepts example discussed above is only scratching the surface of what can be done with a mixed model. Here we develop an example with both random intercepts and random slopes, and discuss in what contexts it may be useful.

Starting with the data generating form of the model, a simple random slopes model would have this form:

$$ y_{ij} = \alpha + \beta x_{ij} + \theta_i + \gamma_i x_{ij} + \epsilon_{ij} $$

In this model, the data are partitioned into groups, indexed by i. There is a common intercept ($\alpha$), a covariate ($x_{ij}$), and a common slope $\beta$ for this covariate. Each group has its own intercept $\theta_i$ and slope $\gamma_i$. If we look within a specific group, we would see a linear trend with intercept $\alpha + \theta_i$ and slope $\beta + \gamma_i$.

As discussed above, although the group-specific intercepts and slopes are technically random (they are modeled as random variables), it might be better to describe them as “varying” to convey the more important fact that they differ among the groups.

The structural parameters for the random slopes model are $\alpha$, $\beta$, $\tau_\theta^2 = {\rm var}(\theta_i)$, $\tau_\gamma^2 = {\rm var}(\gamma_i)$, and $\sigma^2 = {\rm var}(\epsilon_{ij})$. In addition, there may be a parameter capturing a correlation between $\theta_i$ and $\gamma_i$, although this is often fixed at zero.

The marginal covariance structure for a two groups, with two observations per group, can be expressed

$$ \left(\begin{array}{cccc} \tau_\theta^2 + x_{11}^2\tau_\gamma^2 + \sigma^2& \tau_\theta^2 + x_{11}x_{12}\tau_\gamma^2 & 0 & 0\\ \tau_\theta^2 + x_{11}x_{12}\tau_\gamma^2 & \tau_\theta^2 + x_{12}^2\tau_\gamma^2 + \sigma^2& \sigma^2 & 0\\ 0 & 0 & \tau_\theta^2 + x_{21}^2\tau_\gamma^2 + \sigma^2& \tau_\theta^2 + x_{21}x_{22}\tau_\gamma^2\\ 0 & 0 & \tau_\theta^2 + x_{21}x_{22}\tau_\gamma^2 & \tau_\theta^2 + x_{22}^2\tau_\gamma^2 + \sigma^2 \end{array}\right) $$

We see that unlike the basic random intercepts model, the random slopes model exhibits heteroscedasticity (the diagonal elements of the above covariance matrix are not all the same). The variances depends on the covariate values, as do the covariances between values in the same group.

Mixed effects models

The term “mixed” in a “mixed effects model” refers to the presence of random effects, like the random intercepts and slopes studied above, in a model together with “fixed effects”, usually meaning regression slopes that do not vary from group to group. A model with no such fixed parameters is sometimes referred to as a “variance components model”.

Predicting random effects (BLUPs)

Mixed effects models are estimated in marginal form, usually via maximum likelihood estimation. The likelihood of the model is a function of the structural parameters, defined above, so it is only these parameters that need to be considered in the model fitting. A random effects model that looks somewhat complex in data generating form can involve surprisingly few structural parameters. For example, in the basic random effects model above, there may be many groups, each with its own random effect $\theta_i$, but the model only has three structural parameters ($\mu, \tau^2, \sigma^2$). For this reason mixed models are very useful for accommodating very complex data sets without introducing a huge number of parameters.

Although the random effects are not present in the likelihood, and are part of the model fitting, after a mixed model has been fit we can predict the random effects. These predictions are called “Best Linear Unbiased Predictors” (BLUPs). This terminology is widely used, although it is debatable as to whether the predictors are actually “best”, “unbiased”, or “linear”. Note that we use the term “predict” here not “estimate”, because the $\theta_i$ are random variables.

The BLUPs for the simple variance components model described above are quite illuminating about how mixed models work. The BLUP for predicting the random effect for group i is:

$$ \hat{\theta}_i = \frac{n_i \hat{\tau}^2}{\hat{\sigma}^2 + n_i\hat{\tau}^2} \cdot (\bar{y}_{i\cdot} - \hat{\mu}). $$

where $n_i$ is the number of observations in group i.

The expression shown above for group i illustrates how mixed models can be viewed as carrying out a form of “partial pooling”. At one extreme, if $\tau^2 = 0$, we have $\hat{\theta}_i = 0$, which is the “fully pooled” prediction of $\theta_i$. “Full pooling” in this setting means that we combine all the data equally, giving no special preference for the data values that belong to group i when predicting the random effect for group i. At the other extreme, when $\sigma^2=0$ we have $\hat{\theta}_i = \bar{y}_{i\cdot} - \hat{\mu}$, which is the unpooled prediction of $\theta_i$. This is an unpooled estimate because the predicted value for $\theta_i$ only depends on the data for group i.

The ability of mixed models to produce partially pooled estimates, and to do this automatically and adaptively based on the data, gives it better performance than other approaches that are only able to either fully pool or not pool at all. When there is evidence in the data set as a whole that the groups are different, then $\tau^2$ is large relative to $\sigma^2$ and the BLUPs are closer to the unpooled predictions – the observed data for group i are primarily used to predict the random effect for group i, because the groups are quite different from each other. But when $\tau^2$ is small relative to $\sigma^2$, the groups are quite similar, and we gain power by using a heavily pooled prediction.

Mixed model BLUPs are also adaptive in terms of group sizes. A “balanced” data set in this setting is one in which all the group sizes ($n_i$) are the same, while an unbalanced data set is one in which the group sizes are quite different. Unpooled analyses can peform poorly with highly unbalanced data, because the unpooled predictions of the random effects $\theta_i$ have very different precisions (when $n_i$ is large, the precision is high, and vice versa). Partially pooled predictions have more uniform precisions across a range of group sizes. The BLUP of $\theta_i$ for a larger group is less pooled than the BLUP of $\theta_i$ for a smaller group. It is advantageous to use less pooling for the smaller group, because there is less information to predict the random effect solely from the data in the group.

Partial pooling also happens in the random slopes model, but in a more complex manner. The BLUP $\hat{\gamma}_i$ for the slope random effect is not simply the slope obtained by analyzing the data from group i in isolation (or the difference between this estimate and the overall slope estimated by $\hat{\beta}$). Instead, the predicted slope is shrunk toward zero, with greater shrinkage for smaller groups and less shrinkage for larger groups.

Fitting mixed models

The immediate goal in fitting a linear mixed model is to estimate the structural parameters defining the means, variances, and covariances. In practice this is usually accomplished using maximum likelihood estimation, taking the random effects to have Gaussian distributions. Note that all the parameters of these models, and all the quantities required to construct the BLUPs, are expressed as first and second moments. As a result, linear mixed models should not be strongly sensitive to the random effects being normally distributed. This is important, since we never observe the random effects directly, so it would be unconvincing to use a modeling framework that is sensitive to the random effects following a distribution which cannot be checked from the data.

As with most any statistical form of statistical analysis, inference in mixed models is challenging when the sample size is small. While it is always possible to use an asymptotically-based standard error (such as what can be obtained using the Fisher information matrix), the fact that these are motivated through an asymptotic argument makes them suspect in smaller samples. This is particularly true for the structural variance parameters of a mixed model, whose sampling distributions are known to converge more slowly to the limiting Gaussian approximation than the mean structure parameters. A further complication is that the variance parameters are domain-restricted, since a variance cannot be negative. As a result, some standard inference procedures like the log likelihood ratio test cannot be applied.

Many effective uses for mixed models in practice do not require us to do traditional inferential analysis on variance parameters. For example, it is rarely of great interest to test $\tau^2=0$ in the basic random intercepts model discussed above. When the focus is more on the mean structure parameters, the variance parameters are being used to adaptively select the appropriate level of pooling, so there is no need to formally test the variance structure. Even when the variance structure is of primary interest, it is generally enough to assess its structure using goodness of fit or other model evaluation procedures, rather than thinking in terms of formal tests of sharp null hypotheses.