Analysis of data with censored values #

Many values of interest are observed in a censored form. This means that we may observe only a bound on the value of interest, not its exact value. The most common form of bound that we encounter is a lower bound, which is illustrated through several examples below. It is also possible to have an upper bound, or an interval bound (i.e. the value is known only to fall between numbers \(a\) and \(b\) ), but here we only consider the case of the lower bound, which corresponds to right censoring.

A common situation where right censoring arises is when the value of interest is a duration (i.e. of time). A duration is defined to be the length of time between a “time origin” and the time when an event of interest occurs. An example from medical science would be the duration of time from the point when someone is diagnosed with a serious disease, e.g. cancer, until the time when that person dies. Alternatively, we may consider people who have been treated successfully for cancer and have gone into remission (i.e. there is no detectable cancer), and then study the time from remission until the disease recurs.

Although strongly associated with clinical research, this framework has applications in many other areas. In quality engineering, we can consider the duration of time from the point when a product is purchased until it breaks or wears out. Or, in studies of consumer behavior, we could consider the duration of time from the point when people purchase a product until they return it. In social science, we could consider the duration of time starting from the point when someone looses their job until they find another job.

A key point to emphasize about all these examples is that the “origin” is marked by a well-defined event (e.g. cancer remission, loosing a job), and the event of interest is also defined clearly and explicitly (e.g. cancer recurrence or finding a new job).

While sometimes durations are fully observed (e.g. in a retrospective study looking at events that happened long ago), when duration data are collected prospectively or from the recent past, it is usually not practical to wait until every unit of analysis (i.e. every case in the sample) has experienced the event. In fact, some units may never experience the event (e.g. some people who have gone into remission for cancer never have a recurrence, and some products are purchased and never returned). However, if we “follow up” a unit for a given amount of time, say \(F\) time units, and then do not follow it further, then we only know that the event time for that unit is greater than \(F\) . In other words, it was right censored at time \(F\) . Thus, we have partial but not complete information about the unit’s event time.

In a setting where we have partial information about some or all analysis units, it is tempting to simply omit the cases with partial information. However not only does this loose efficiency, since we are not taking advantage of useful information in the censored cases, but worse, it causes bias. The intuitive reason for the bias is that the units where the true event time is large are more likely to be censored. Therefore if we exclude the censored cases, we bias the sample toward units with shorter event times.

Due to this bias and loss of information, it is almost never appropriate to analyze data affected by censoring using standard techniques that do not account for censoring. In the last fifty years, many broadly useful tools for working with censored data have been developed. Below we focus on two of them – methods for estimating the marginal survival function, and methods for conducting regression analysis with censored data. Before we discuss those techniques, we present the notation and definitions more completely, and discuss the concept of censoring in more detail.

More on censoring #

Let \(T\) denote the duration from the origin until the event of interest occurs. This is called the event time. In addition, each unit also has a follow-up time which we denote \(F\) . This is the maximum amount of time that this unit would be observed if the event does not happen sooner. The data that we actually observe consists of two values: \(Y = \textrm{min}(T, F)\) and the censoring status \(C = I(T = \textrm{min}(T, F))\) . By convention the censoring status is equal to 1 if the event is observed (i.e. the observation is not censored), and the censoring status is equal to 0 if the observation is censored.

A key condition behind most of the basic methods for survival analysis is called independent censoring, which is that \(T\) and \(F\) must be independent random variables. This is almost always an untestable assumption, because we never observe \(T\) and \(F\) for the same unit. We can only argue that independent censoring is likely to hold in a particular setting based on the mechanisms of censoring that are potentially relevant. Some common censoring mechanisms are:

Administrative censoring – if the data are collected retrospectively, and are current as of some fixed date (i.e. everyone is censored on that date if they have not already experienced the event), then we have administrative censoring. It is often plausible to argue that administrative censoring is a form of independent censoring, as long as the subjects whose time origin occurred earlier are comparable to those whose time origin occurred later.
Censoring from competing risks – if a unit becomes censored once they experience some other event that is not the event of interest, this could be considered to be a form of censoring. Although this is often treated as independent censoring, it may fail to be so. Alternative methods for “competing risks” survival analysis are a better choice in this setting.
Loss to follow-up – if individuals move, drop out, or stop responding to requests for information, this is called “loss to follow-up”. It may approximately be a form of independent censoring, but it easily could be dependent as well (e.g. if certain types of people are more likely to be lost to follow-up than others).

Survival function estimation #

If we have a homogeneous population, the survival function (or, for emphasis, the marginal survival function) is defined to be \(S(t) \equiv P(T > t)\) . It is equal to 1 minus the cumulative distribution function (CDF) of \(T\) .

The product-limit method of Kaplan and Meier is generally used to estimate the survival function \(S(t)\) . This is a non-parametric method that yields an estimate that takes the form of a step function, with steps at the event times.

If the data are grouped, then it is common to compare two or more groups using the null hypothesis that all groups have the same survival function. This can be accomplished using the log-rank test.

Proportional hazards regression #

Suppose we have censored duration data \((Y, C)\) on a sample of units, and we also observe explanatory variables \(X\) on each of them. In this setting, it would be natural to ask if the conditional distribution of true durations \(T|X\) varies with \(X\) . Since we do not observe \(T\) , but rather observe only \((Y, C)\) , this cannot be accomplished simply by applying a conventional form of regression analysis.

There are many approaches for conducting regression analysis with censored duration data. By far the most widely used approach is the proportional hazards regression approach due to David Cox. This approach is centered on the hazard function, which is defined to be

\(\lambda(t; x) = \textrm{lim}_{\delta\rightarrow 0} P(T\in (t, t+\delta]|T>t, X=x)/\delta.\)

In words, the hazard function is the “instantaneous rate” at which events occur, given that it is possible for an event to occur.

The proportional hazards regression approach models the hazard function as \(\lambda(t; x) = \lambda_0(t)\cdot \exp(\beta^\prime x)\) . This model imposes the constraint that the hazard functions for any two units have the same form \(\lambda_0(t)\) , scaled by a constant factor \(\exp(\beta^\prime x)\) that depends on the covariates. Proportional hazards is an assumption that may or may not hold for any given population.

The covariates enter the proportional hazards model through a linear predictor \(\beta^\prime x\) which can be interpreted like a linear predictor in other regression approaches. In the context of this model, the role of the j’th covariate is captured through its log hazard ratio \(\beta_j\) and its hazard ratio \(\exp(\beta_j)\) . Specifically, the hazard ratio is the factor by which the hazard increases corresponding to a 1-unit increase in \(x_j\) , when the values of all the other covariates are held fixed.

It turns out that the regression parameters \(\beta\) in a proportional hazards regression function can be fit without estimating the baseline hazard function \(\lambda_0(\cdot)\) . In this sense, the proportional hazards approach is “semi-parametric”. It is possible to estimate the baseline hazard function as a separate step after estimating the coefficients \(\beta\) , but many people do not focus on the baseline hazard function, since the research focus is on the role of the covariates.

Entry times #

In a basic survival analysis, every subject is observed from the time origin up to the point when they either have the event, or are censored. This is appropriate if the data are collected prospectively, and if every subject is monitored starting from the time origin. Often in a survival analysis, we have a partially retrospective design in which the selection of subjects takes place after the time origin.

Suppose for example that we are considering a medical treatment such as implantation of a cardiac pacemaker. We are interested in quantifying how this pacemaker may reduce mortality. Imagine a design in which we are able to identify all people currently living in a community who have received a pacemaker in the past. By design, if we enroll a person into our study at a given time \(E\) , that person could not have died (experienced the event) before time \(E\) . This phenomenon is known as left truncation. It is important to account for this in the analysis. To do so, we define \(E\) to be the entry time for this subject. Subjects who are monitored from the time origin will have entry times of zero, but subjects who are recruited later will have positive entry times. We will not discuss the technical details here of how this is handled, except to say that all standard survival analysis software has the capability to handle left truncation, and entry times should always be provided when appropriate.

Time-varying covariates #

In a basic proportional hazards regression analysis, the covariates are defined at the time origin and cannot change as time progresses. In many settings the covariates of interest are time-varying and will continue to evolve between the time origin, and the point where the event or censoring occurs. A standard proportional hazards regression model cannot accommodate covariates of this type.

To accommodate time-varying covariates in proportional hazards regression analysis, we first consider covariates whose temporal variation follows a step function, i.e. the covariate are piece-wise constant functions of time. If we have such a covariate, we treat each constant interval as an independent subject in the analysis, using entry times as discussed above to define the time offset where each interval begins. Here is an example:

Id	Entry	Time	X	Status
0	0	14	0	0
0	14	21	1	0
0	21	26	2	1
1	0	7	2	0
1	7	19	4	0

The subject with id 0 had values of covariate \(X\) that began at 0, changed to 1 at time 14, and changed to 2 at time 21. This subject ultimately experienced the event at time 26. The subject with id 1 had covariate values that began at 2, then changed to 4 at time 7. Subject 1 never was observed to have the event, and was censored at time 19. In this representation of the data, each subject will have 1 or more rows, every row except the last row for each subject will have status=0. The last row will have status = 1 for subjects who have the event, and status=0 for subjects who are censored.

Notably, this representation makes it seem as if we have repeated observations per subject, and this normally means that we would need to account for the repeated measures somehow. However, in this particular setting, the likelihood factors into distinct terms with respect to the rows of data, so no such adjustment is needed. The subject id variable is not used when fitting the model, and is only presented above to illustrate how the data were structured for this analysis.

Many time-varying covariates evolve continuously rather than in discrete steps. Unfortunately, there is no practical way to handle such variables directly in a proportional hazards regression analysis. Instead, it is common to create a piecewise constant version of each such variable, and approach the modeling as discussed above. The choice of interval width for approximating the continuously-changing variable as a piecewise constant variable is a parameter that must be chosen.

Non-proportional hazards #

The key structural constraint for a proportional hazards regression analysis is that the hazard functions are proportional between any two units of analysis. This may not hold in practice, and is somewhat difficult to diagnose. One way to identify non-proportional hazards is to stratify the sample according to some of the key variables, and plot the estimated baseline hazard functions for the different strata on the same axes. This works well if the non-proportionality is related to the variable chosen for the stratification, and if there is a sufficient sample size within each stratum.

Another approach for diagnosing non-proportional hazards is to add interactions between time and other covariates to the model. Time is normally not included as a covariate in a proportional hazards regression, since the role of time is explained by the baseline hazard function. If the interaction between time and a covariate seems to have a substantial impact, this likely indicates non-proportionality. Note that when including this interaction, we do not also include the main effect of time (but the main effect of the covariate is generally included). Since time is a continuously changing time-varying variable, this requires use of the techniques discussed in the previous two subsections.