Generalized Linear Models

2019/10/01

Generalized Linear Models (GLMs) are a type of single-index regression model that, compared to using linear models, substantially extends the range of analyses that can be carried out. A single-index model expresses the conditional mean function $E[Y|X=x]$ through a single linear predictor (a linear function of the covariates):

$$ \beta_0 + \beta_1x_1 + \cdots + \beta_p x_p. $$

A linear model is a specific type of GLM that models the conditional mean function directly as the linear predictor:

$$ E[Y|X=x] = \beta_0 + \beta_1x_1 + \cdots + \beta_p x_p. $$

In general, a GLM relates the conditional mean to the linear predictor via a link function $g$:

$$ g(E[Y|X=x]) = \beta_0 + \beta_1x_1 + \cdots + \beta_p x_p. $$

The link function should be invertible, so we can also write

$$ E[Y|X=x] = g^{-1}(\beta_0 + \beta_1x_1 + \cdots + \beta_p x_p). $$

Relationship to linear models with a transformed response variable: In most cases the link function will not be linear. For a nonlinear function, $g(E[Y|X=x])$ is not equal to $E[g(Y)|X=x]$. So while there is some superficial similarity between fitting a GLM and fitting a linear model with a transformed dependent variable $g(Y)$, these approaches to regression can perform quite differently.

Variance functions: GLM’s specify a variance function as well as a mean function. For a basic GLM, the variance function has the form

$$ {\rm Var}[Y|X=x] = \phi \cdot V(E[Y|X=x]). $$

The variance is expressed through a “mean/variance relationship”. The variance is a function of the mean, up to a multiplicative “scale parameter” $\phi \in {\cal R}^+$.

GLM’s and parametric probability models: Most basic GLM’s can also be seen as being equivalent to maximum likelihood analysis for a particular parametric distribution. However as discussed below, there is an alternative “quasi-likelihood” way to understand GLMs that does not emphasize likelihoods or parameterized probability distributions.

Some examples of GLM’s that are equivalent to parametric models are:

When viewed in terms of parameterized probability distributions, a GLM can be seen as indexing a (usually) infinite family of distributions through the covariate vector $x$. That is, in a Poisson GLM, $P(Y|X=x)$ is Poisson for every $x$, but with a different mean value.

Note that a GLM describes a collection of conditional distributions, $P(Y|X=x)$, for different values of $x$. A GLM (like any regression procedure) says very little about the marginal distribution $P(Y)$. In a Poisson GLM, repeated observations of $Y$ at the same value of $X$ will follow a Poisson distribution, but the marginal distribution of $Y$, which describes the pooled set of all $Y$ values (taken at different values of $X$) will not be Poisson. Similarly, in a Gaussian linear model, $Y$ values taken at the same $X$ are Gaussian, but the marginal distribution of $Y$ is usually not Gaussian.

GLMs via quasi-likelihood

Just as most of the favorable properties of linear regression do not depend on Gaussianity of the data, most of the favorable properties of GLM’s do not depend on the data following the specific family used to define the model. For example, Poisson regression can be used even if the data do not follow a Poisson distribution. The key requirements for attaining good performance with a GLM are that the mean structure and variance structure are correct. The other properties of the distribution (other than the conditional mean and variance) are much less important. This applies as well to domain constraints, so for example, a Poisson GLM can be fit to data that includes non-integer values, as long as the mean and variance models hold.

One useful example of a GLM fit using quasi-likelihood is “quasi-Poisson” regression, which results from using Poisson regression, but allowing the scale parameter $\phi$ to take on values other than 1. In a Poisson distribution, the mean must be equal to the variance. In a quasi-Poisson regression, the variance is equal to $\phi$ times the mean. There is no tractable distribution (i.e. with a simple closed-form PMF) that has this mean and variance structure except when $\phi=1$, nevertheless there are certainly intractable probability distributions that have this property, and there is no reason to doubt that a given data set could follow such a distribution. Quasi-Poisson regression is an appropriate way to estimate the mean and variance structure parameters in this setting.

An important robustness property of quasi-GLMs is that the mean parameters can be estimated consistently even when the variance is mis-specified. Incorrect specification of the variance leads to a loss of efficiency, but does not completely invalidate the analysis. For example, we can still justify using Poisson regression to estimate the mean structure of an overdispersed population, as long as we believe that the mean structure (e.g. based on the log link function) is correct. Although this type of robustness gives us consistency of the mean parameter estimates, in general it does not allow us to conduct inference on these estimates. However, we can achieve this goal as well by using quasi-GLM analysis, which in the case of Poisson regression means that we treat the scale parameter $\phi$ as being a parameter to be estimated from the data, even though in an actual Poisson distribution it would be fixed at 1.

Mean/variance relationships and “overdispersion”

For data sets that involve counting the number of times that a rare event happens, it is natural to think in terms of a Poisson model. For example, if we have data on the number of car accidents that occur at each roadway intersection in a city during one year, we might imagine that these counts are Poisson-distributed. Recall that a Poisson distribution should describe the number of times that an event occurs, if the event has a constant probability of happening in each small time interval, and if these occurrences are independent. If there is a fixed, small probability of a car accident occurring, say, on each day, then the number of accidents per year might follow a Poisson distribution, as long as the accidents are independent.

If either the independence or homogeneity conditions stated above do not hold, then the resulting counts may not be Poisson-distributed. For example, if there are differences in risk for different days of the week, or during different seasons, then the total number of events in one year will generally not follow a Poisson distribution.

As noted above, to apply Poisson regression, we actually do not need the data to be exactly Poisson-distributed. We only need the specified mean and variance structures to hold. A key property of a Poisson distribution is that the mean is equal to the variance. “Overdispersion” occurs when the variance is greater than the mean (underdispersion is also possible but we will not consider that here). Overdispersion typically results from “heterogeneity” – the total count being modeled is a sum of counts over temporal or spatial sub-units with different event rates.

One challenge to understanding overdispersion is that, as noted above, a GLM defines a probability model for each value of the covariate $x$. That is, we observe a value $y$ drawn from the probability model defined by the GLM at $X=x$. Unless there are ties, we only observe one $y$ for each $x$. With of a sample of size 1 (at each $x$), clearly it is not straightforward to assess the shape of the distribution or even its variance.

Fortunately, there is a way to estimate the scale parameter $\phi$ even when there are no replicates. We will not provide the details of this procedure here. The important thing to note is that it is possible to estimate the scale parameter from the data. In a Poisson regression, $\phi$ must be equal to 1, but if the data-driven estimate of $\phi$ is much greater than $1$, the data do not fit the Poisson model very well. Nevertheless, we can use the resulting “quasi-Poisson” model with whatever value of $\phi$ is selected by the data. This is one way to address the presence of overdispersion. The main limitation of this approach is that it takes the variance to be proportional to the mean. If the variance is related to the mean in some other way, even the quasi-Poisson model is incorrect.

If we have enough data, it is possible to assess the goodness of fit of the Poisson regression model more directly. We can fit a preliminary Poisson regression, then stratify the data into bins based on the fitted values $\hat{y}$ from this model. We then take the empirical mean and variance of the data within each bin. The relationship between these estimated mean and variance values can be used to infer the mean/variance relationship.

It is not necessary to achieve a Poisson model (with $\phi=1$) whenever we have count data in-hand – it is sometimes better to allow for overdispersion, as long as we correctly estimate the variance function and report it as such. However it is usually desirable to explain as much of the variation in the response variable of a regression as possible, rather than leave it as “unexplained variation”. It is often the case that including additional covariates in such a model will lead to smaller values of the dispersion parameter, and with enough informative covariates, a Poisson relationship can be sometimes be achieved.

Overview of different GLMs

The Poisson and Gaussian GLMs are very widely used, but there are many other useful GLMs, that can be specified through different choices of the family, link function, and variance function. In fact there are infinitely many possible GLMs. We will discuss a few of the most prominent ones here. Note that all of the approaches discussed below are suitable for non-negative response variables. The main GLM family that is used with data that can take on both positive and negative values is the Gaussian family.

All of the GLM’s discussed here most commonly use the logarithm as the link function, although alternative link functions are possible.

The negative binomial GLM can be seen as an extension of the Poisson GLM. Its mean/variance relationship involves a shape parameter $\alpha$, such that

$${\rm Var}[Y|X=x] = E[Y|X=x] + \alpha \cdot E[Y|X=x]^2.$$

If $\alpha=0$, the mean/variance relationship is the same as in the Poisson model. If $\alpha > 0$, the variance increases faster with the mean compared to the Poisson GLM. The quasi negative binomial model introduces a scale parameter to the mean/variance relationship, which becomes

$$ {\rm Var}[Y|X=x] = \phi\cdot (E[Y|X=x] + \alpha \cdot E[Y|X=x]^2). $$

The Tweedie GLM interpolates the mean/variance relationship between a Poisson and a negative binomial GLM. It has the form

$$ {\rm Var}[Y|X=x] = \phi\cdot E[Y|X=x]^p $$

where $1 \le p \le 2$ is a “power parameter”.

A Gamma GLM has a mean/variance relationship that is

$$ {\rm Var}[Y|X=x] = \phi \cdot E[Y|X=x]^2. $$

Note that this implies that the coefficient of variation is

$$ {\rm CV}[Y|X=x] = {\rm SD}[Y|X=x]/E[Y|X=x] = \sqrt{\phi}, $$

which is constant.

An inverse Gaussian GLM has

$$ {\rm Var}[Y|X=x] = \phi \cdot E[Y|X=x]^3 $$

as its mean/variance relationship. The variance grows very fast with the mean.

Generalized estimating equations

As discussed above, GLM analysis is not fully model-based, in the sense that meaningful results can be obtained even when the probability model used to estimate the regression parameters does not exactly match the population from which the data were sampled. For example, we can use Poisson regression even if the response values are not distributed as Poisson random variables (conditioned on the covariates). In order for GLM parameter estimates to be meaningful, the mean structure has to be correctly specified. In order for the uncertainty analysis of the estimated mean structure to be meaningful, the variance structure has to be correctly specified. In neither case does the full probability model need to be correctly specified. When taking advantage of these “robustness” properties of a GLM, we are doing “quasi-likelihood” analysis rather than the more basic “likelihood analysis”.

The quasi-likelihood analysis is helpful if we specify the marginal mean and variance structures correctly, but the full distributional form of the response given the predictors is unknown, or is specified incorrectly in a working model. However the quasi-likelihood approach does not help us if the variance structure is misspecified, or if there are correlations between the observations. A more advanced technique called “Generalized Estimating Equations” (GEE) can be used in these situations.

In GEE regression, we specify a “working correlation model” (which does not need to be correct, although the estimated regression parameters are more accurate if it is approximately correct). GEE uses a technique involving a “robust covariance estimator” to quantify the uncertainty in the regression parameter estimates even when the working model is misspecified.

Working correlation structures for GEE

The GEE working correlation structure can sometimes provide insight into the population under study that complements what we learn through the mean structure parameters. One of the most common settings where GEE can be used is when we have “clustered data”. In this situation, the data can be partitioned into groups such that there may be correlations between the observations within a group, but there are no correlations between observations in different groups. These situations arise very often in practice, for example, where we sample people in various communities, and there may be correlations between people who live in the same community, but people in different communities are taken to be independent.

One approach to modeling clustered data is to include the cluster variable (which is categorical) as a covariate in the model. This is called “fixed effects analysis”. It can be quite effective if the number of clusters is small or moderate. If there are many clusters, however, it may not be possible to include so many dummy variables in the model as regressors. In this case, we can omit the cluster variable, which now means that there will be unmodeled residual covariance within, but not between, the clusters.

A simple model for such clustered data describes the dependence through a single parameter called the “ICC” - the “intraclass correlation coefficient”. The ICC represents the correlation we would see if we sampled one pair of units from each cluster, and calculated their Pearson correlation coefficient. The ICC ranges from 0 to 1 (it cannot be negative).

The ICC is informative about the possible presence of “stable confounders” in the population under study. A stable confounder is an unobserved covariate that is constant for all units within a cluster. The higher the ICC, the greater the influence of stable confounders. We will not go into a complete assessment of the causality properties of GEE here, but we note that GEE can mitigate for certain types of stable confounding, but not for others. GEE is an improvement over GLM, but there is no statistical technique that allows unqualified causal conclusions to be drawn from arbitrary sources of observational data.

The ICC is initially estimated through the residuals of the GLM, using the method of moments. See here for more details.

Once a preliminary estimate of the ICC is available, a set of “estimating equations” is used to update the estimates of the regression parameters. These estimating equations are similar to the quasi-likelihood used to fit a GLM, but are modified to account for the working correlation. This approach is typically iterated – after updating the regression parameters, the ICC is estimated again, then the regression parameters are re-estimated, and so on. There is no guarantee that this process will converge, and there is no need for it to converge for the results to be statistically valid. Nevertheless, in most cases people run GEE to convergence when it is practical to do so.