NHANES Logistic Regression

Contents:

Background
The code
See also
Downloads

Background

Logistic regression can be used to understand the relationship between one or more predictor variables and a binary outcome. It is often used to assess the relationship between one predictor vairable and a binary outcome, while holding the values of other variables fixed.

Here we demonstrate a number of techniques related to logistic regression using a dichotomized definition of high blood pressure (systolic blood pressure being greater than 140 units), in the NHANES data set.

The code

We have many import statements, and some code to allow us to use more covenient variable names.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.discrete.discrete_model import Logit
from matplotlib.backends.backend_pdf import PdfPages
from nhanes_read_data_pandas import Z,VNH
from patsy import dmatrices

"""
Use logistic regression to predict high versus normal systolic blood
pressure in the NHANES data set.
"""

Next we create a data set containing only our variables of interest, and drop people younger than 18 years old and anyone with missing data. We also drop cases that have zero values in certain variables where a zero value is impossible. Then we create the dichotomous outcome, and modify the variable names.

## Put a data set together for regression analysis
VN = [VNH[x] for x in "Age", "Gender", "BMI", "Weight", "Height", "Ethnicity"]
VNA = VN + [VNH["SYBP"],]
D = Z.loc[:,VNA].dropna()
ii = D.loc[:,VNH["Age"]] >= 18
D = D.loc[ii,:]

## Drop zeros from variables where they don't belong
for vn in "Height","SYBP","Weight","BMI":
    ii = (D.loc[:,VNH[vn]] > 0)
    D = D.loc[ii,:]

## Create a dichotomous blood pressure variable, cutting at 140.
D.loc[:,VNH["SYBP"]] = 1*(D.loc[:,VNH["SYBP"]] >= 140)

## Eliminate colons from the variable names because it
## confuses the formula system
D.columns = [x.split(":")[1] for x in D.columns]
for k in VNH:
    VNH[k] = VNH[k].split(":")[1]

Here we make a dot plot to visualize the marginal relationships between predictors and the outcome.

pdf = PdfPages("nhanes_logistic_regression_plots.pdf")

## Make dot plots showing the predictor means stratified by outcome
VN = [VNH[x] for x in "Height", "Weight", "BMI", "Age", "Gender"]
i0 = np.flatnonzero(D.loc[:,VNH["SYBP"]] == 0)
i1 = np.flatnonzero(D.loc[:,VNH["SYBP"]] == 1)

plt.clf()
ax = plt.axes([0.2,0.15,0.6,0.8])
for loc, spine in ax.spines.items():
    if loc in ['bottom',]:
        spine.set_position(('outward',10))
    elif loc in ['right','top','left']:
        spine.set_color('none')

ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('none')
ax.set_yticks([])

yy = 20
for vname in VN:
    Y = D.loc[:,vname]
    Y = (Y - Y.mean()) / Y.std()
    mn0 = Y.iloc[i0].mean()
    mn1 = Y.iloc[i1].mean()
    plt.plot([-3,3], [yy,yy], '-', color='grey')
    a0, = plt.plot([mn0,], [yy-0.1,], 'o', ms=12, color='orange')
    a1, = plt.plot([mn1,], [yy+0.1,], 'o', ms=12, color='lime')
    plt.text(-3.1, yy, vname, ha="right", va="center")
    yy -= 1

plt.xlim(-3, 3)
plt.ylim(yy+0.5, 21)
leg = plt.figlegend((a0,a1), ("Normal", "High"), "center right",
                    numpoints=1, handletextpad=0.0001)
leg.draw_frame(False)
plt.xlabel("Z-score", size=17)
pdf.savefig()

We can fit a logistic regression model using a vector (of binary values) as the outcome, and a matrix of predictors.

## Fit a logistic regression model with high versus normal systolic
## blood pressure as the dependent variable

Y = D.loc[:,VNH["SYBP"]]
VN = [VNH[x] for x in "Height", "Weight", "Age", "Gender"]
X = D.loc[:,VN]
X = sm.add_constant(X)
md = Logit(Y, X).fit()
print md.summary() 
print "\n\n"

If we drop gender, the height effect nearly vanishes, due to the interplay between gender and height.

Y = D.loc[:,VNH["SYBP"]]
VN = [VNH[x] for x in "Height", "Weight", "Age"]
X = D.loc[:,VN]
X = sm.add_constant(X)
md = Logit(Y, X).fit()
print md.summary() 
print "\n\n"

We can fit the same model using formulas:

## Fit the same regression model using formulas
fml = "BPXSY1 ~ " + " + ".join([VNH[x] for x in ("Height", "Weight", "Age")])
md = smf.logit(formula=fml, data=D).fit()
print md.summary()
print "\n\n"

A variable will be interpreted as being categorical if it is enclosed in a C() expression in the formula (variables whose values are strings are automatically treated as categorical, but the NHANES categorical variables are all coded numerically).

## Include categorical variables
fml = "BPXSY1 ~ RIDAGEYR + RIAGENDR + C(RIDRETH1) + BMXBMI + RIDAGEYR*RIAGENDR"
md = smf.logit(formula=fml, data=D).fit()
print md.summary()
print "\n\n"

If the motivation for the logistic regression analysis is prediction it is important to assess the predictive performance of the model unbiasedly. The ideal way to do this is to fit the model and test the model on independent data sets. If this is not possible, we can use cross validation.

## Cross validation
Q = np.zeros(D.shape[0], dtype=np.float64)
y,X = dmatrices(fml, data=D)
for i in range(X.shape[0]):

    ## This is slow, so print progress information
    if i % 100 == 0:  
        print i

    ## The indices of the non held-out cases
    ii = range(X.shape[0])
    ii.remove(i)

    md = Logit(y[ii], X[ii,:])
    mdf = md.fit(disp=False)
    Q[i] = md.predict(mdf.params, exog=X[i,:], linear=True)

Once we have the cross-validated risk scores in Q, we can calculate the sensitivity and specificity as measures of predictive performance.

## Calculate sensitivity and specificity
S = np.linspace(min(Q), max(Q), 100)
Sens = np.zeros(len(S), dtype=np.float64)
Spec = np.zeros(len(S), dtype=np.float64)
i1 = np.flatnonzero(y == 1)
i0 = np.flatnonzero(y == 0)
for i,s in enumerate(S):
    Sens[i] = np.mean(Q[i1] > s)
    Spec[i] = np.mean(Q[i0] <= s)

Based on the sensitivity and specificity, we can use the area under the (ROC) curve as a summary measure of predictive performance.

## Calculate the AUC using the trapezoidal rule
auc = 0.
for i in range(1,len(Sens)):
    auc += (Spec[i]-Spec[i-1]) * (Sens[i] + Sens[i-1])/2

Finally, we can plot the receiver operating characteristics curve_ (ROC curve) to visualize the predictive performance.

## Plot the ROC curve
plt.clf()
plt.grid(True)
plt.plot([0,1], [0,1], '-', color='lightgrey', lw=5)
plt.plot(1-Spec, Sens, '-', color='orange', lw=5)
plt.xlabel("1 - Specificity", size=17)
plt.ylabel("Sensitivity", size=17)
plt.title("AUC=%.2f" % auc)
pdf.savefig()

pdf.close()

NHANES Logistic Regression

Background

The code

See also

Downloads