NHANES Linear Regression

Top page

Contents:

Background

Linear regression can be used to assess how several independent variables are related to a dependent variable. This technique allows us to consider how two variables are related while controlling for the effects of other variables.

As an illustration, here we will consider linear regression analysis in which systolic blood pressure is the dependent variable.

The code

We begin with many import statements, and some code that allows us to use short variable names that are easier to remember.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.regression import linear_model as lm
from matplotlib.backends.backend_pdf import PdfPages
from nhanes_read_data_pandas import Z,VNH

"""
Use linear regression to assess the relationships between several
variables and blood pressure in the NHANES data.
"""

We next construct a data set that contains all the variables we will need for our regression analysis. We also reduce our data set to include only people who are at least 18 years old.

## Put a data set together for regression analysis
VN = [VNH[x] for x in "Age", "Gender", "BMI", "Weight", "Height", "Ethnicity"]
VNA = VN + [VNH["SYBP"],]
D = Z.loc[:,VNA].dropna()
ii = D.loc[:,VNH["Age"]] >= 18
D = D.loc[ii,:]

## Drop zeros from variables where they don't belong
for vn in "Height","SYBP","Weight":
    ii = (D.loc[:,VNH[vn]] > 0)
    D = D.loc[ii,:]

## Eliminate colons from the variable names because it
## confuses the formula system
D.columns = [x.split(":")[1] for x in D.columns]
for k in VNH:
    VNH[k] = VNH[k].split(":")[1]

Next, we make scatterplots between each continuous independent variable and blood pressure, and box plots between each categorical independent variable and blood pressure.

pdf = PdfPages("nhanes_linear_regression_plots.pdf")

## Make scatterplots between the continuous predictors and outcome
VN = [VNH[x] for x in "Height", "Weight", "BMI", "Age"]
for vname in VN:
    plt.clf()
    plt.plot(D.loc[:,vname], D.loc[:,VNH["SYBP"]], 'o', ms=5, alpha=0.5)
    plt.xlabel(vname, size=17)
    plt.ylabel(VNH["SYBP"], size=17)
    pdf.savefig()

## Make box plots between categorical predictors and the outcome
VN = [VNH[x] for x in "Gender", "Ethnicity"]
Codes = {VNH["Gender"]: {1: "Male", 2: "Female"},
         VNH["Ethnicity"]: {1: "Mexican American", 
                            2: "Other Hispanic",
                            3: "Non-Hispanic White", 
                            4: "Non-Hispanic Black",
                            5: "Other"}}
for vname in VN:
    X = D.loc[:,vname]
    S = list(set(X))
    Y = D.loc[:,VNH["SYBP"]]
    B = [Y[X==s] for s in S]
    plt.clf()
    plt.boxplot(B)
    codes = Codes[vname]
    plt.ylabel("Systolic blood pressure", size=17)
    if vname == VNH["Gender"]:
        plt.gca().set_xticklabels([codes[s] for s in S])
    else:
        plt.gca().set_xticklabels([])
        for j in range(len(S)):
            plt.text(j+1, 52 - 8*(j % 2), codes[S[j]], ha="center")
    pdf.savefig()

Here we fit a linear regression model using ordinary least squares (OLS), using systolic blood pressure as the dependent variable, and height, weight, age and gender as independent variables.

## Fit a regression model with systolic blood pressure as the
## dependent variable

Y = D.loc[:,VNH["SYBP"]]
VN = [VNH[x] for x in "Height", "Weight", "Age", "Gender"]
X = D.loc[:,VN]
X = sm.add_constant(X)
md = lm.OLS(Y, X).fit()
print md.summary() 
print "\n\n"

If we drop gender, we get an interesting example of omitted variable bias. Note from the previous model that blood pressure is negatively associated with height when controlling for gender, and male gender is associated with higher blood pressure. Since men are taller on average than women, when we omit the gender variable these effects cancel out and there is very little remaining effect for height.

Y = D.loc[:,VNH["SYBP"]]
VN = [VNH[x] for x in "Height", "Weight", "Age"]
X = D.loc[:,VN]
X = sm.add_constant(X)
md = lm.OLS(Y, X).fit()
print md.summary() 
print "\n\n"

We can fit the same model using formulas:

## Fit the same regression model using formulas
fml = "BPXSY1 ~ " + " + ".join([VNH[x] for x in ("Height", "Weight", "Age")])
md = smf.ols(formula=fml, data=D).fit()
print md.summary()
print "\n\n"

The formulas can include variable transformations and interactions:

## Include some transformations and interactions
fml = "BPXSY1 ~ RIDAGEYR + np.log(RIDAGEYR) + RIAGENDR + BMXBMI + RIDAGEYR*RIAGENDR"
md = smf.ols(formula=fml, data=D).fit()
print md.summary()
print "\n\n"

A variable will be treated as categorical and coded with dummy variables if it is enlosed in C():

## Include categorical variables
fml = "BPXSY1 ~ RIDAGEYR + RIAGENDR + C(RIDRETH1) + BMXBMI + RIDAGEYR*RIAGENDR"
md = smf.ols(formula=fml, data=D).fit()
print md.summary()
print "\n\n"

We can use the combinations generator in the itertools module to do best subsets regression:

## All subsets regression
from itertools import combinations
VN = [VNH[x] for x in "Age", "Gender", "BMI", "Weight", "Height", "Ethnicity"]
VN[-1] = "C(" + VN[-1] + ")"
Q = []
for nvar in 1,2,3,4,5:
    for V in combinations(VN, nvar):
        fml = "BPXSY1 ~ " + " + ".join(V)
        md = smf.ols(formula=fml, data=D).fit()
        Q.append([md.aic, V])

## Sort the fitted models by AIC
AIC = [q[0] for q in Q]
ii = np.argsort(AIC)
Q = [Q[i] for i in ii]

## Print out the top 10 models
for q in Q[0:10]:
    print "%12.2f %s" % (q[0], " + ".join(q[1]))

pdf.close()

See also

Downloads