NHANES Smoothing Analysis

Top page

Contents:

Background

Scatterplot smoothing can be used to estimate functional relationships between two continuously varying values. For example, in the NHANES data we may wish to estimate the relationship between age and systolic blood pressure. Specifically, we will estimate the conditional mean of a person's systolic blood pressure given his or her age.

A number of smoothing algorithms have been developed. The Python Statsmodels package implements the "lowess" smoothing procedure, which we will demonstrate here.

The code

We will use the data reading module nhanes_read_data_pandas.py based on Pandas. It would be straightforward to modify this script to use the Numpy-based data reading module (nhanes_read_data_numpy.py).

We begin with the import statements, and give a brief description of what this script will do.

import numpy as np
import pandas as pd
from nhanes_read_data_pandas import Z,VNH
import matplotlib.pyplot as plt
from statsmodels.nonparametric.smoothers_lowess import lowess

## PdfPages allows us to create multi-page PDF documents
from matplotlib.backends.backend_pdf import PdfPages

"""
Estimate smooth regression relationships among several of the
quantitative variables in NHANES, also stratified by a few categorical
stratifying variables.
"""

The NHANES income data is coded using a set of unordered, partially overlapping categories. We will use a simpler ordered income variable, created here.

## Create a new income variable (see codebook)
F_income = pd.Series(np.nan, index=Z.index, dtype='O')
I = Z.loc[:,VNH["Income"]]
F_income[I <= 5] = "Low"
F_income[I == 13] = "Low"
F_income[(I > 5) & (I <= 8)] = "Middle"
F_income[(I == 9) | (I == 10) | (I == 14) | (I == 15)] = "High"
VNH["Income2"] = "DEMO_F:F_income"
Z[VNH["Income2"]] = F_income

Since we are analyzing the data as a function of age, we reorder the data so that is sorted by age.

## Order with respect to age.
ii = np.argsort(Z.loc[:,VNH["Age"]])
Z = Z.iloc[ii,:]

Generating the plots

Finally we generate the plots. We have a loop that runs over the different variables that will be the dependent variable Y in the regression. Within each loop, we do the regression and plot the results for three different stratifying variables (gender, income, and ethnicity).

## Start a multi-page PDF document.
pdf = PdfPages("nhanes_smoothing.pdf")

colors = ["orange", "purple", "lime", "cyan"]

## Smooth each of these variables against age.  Watch out for missing
## data.
for vname in [VNH[x] for x in ("Height", "Weight", "BMI", "SYBP", "DIBP")]:

    ## Drop the cases with missing data
    VN = [VNH[x] for x in ("Gender", "Age", "Ethnicity", "Income2")]
    VN += [vname,]
    D = Z.loc[:,VN]
    D = D.dropna()

    ## Drop cases with 0 for quantitative outcomes
    ii = (D.loc[:,vname] > 0)
    D = D.loc[ii,:]

    ## Give everything its own name
    Y = D.loc[:,vname]
    Age = D.loc[:,VNH["Age"]]
    Gender = D.loc[:,VNH["Gender"]]
    Income = D.loc[:,VNH["Income2"]]
    Ethnicity = D.loc[:,VNH["Ethnicity"]]

    ## Plot the fitted values for the regression stratifying on gender
    plt.clf()
    plt.figure(figsize=(8,5))
    plt.axes([0.1,0.1,0.57,0.8])
    A,N = [],[]
    for ix,gl in enumerate((2,1)):
        ii = (Gender == gl)
        L = lowess(Y.loc[ii].values, Age.loc[ii].values)
        a, = plt.plot(L[:,0], L[:,1], '-', lw=5, 
                      color=colors[ix], alpha=0.6)
        A.append(a)      
        N.append(len(ii))

    bb = plt.figlegend(A, ("Female (n=%d)" % N[0], "Male (n=%d)" % N[1]), 
                       "center right")
    bb.draw_frame(False)
    plt.ylabel(vname, size=17)
    plt.xlabel("Age", size=17)
    pdf.savefig()

    ## Plot the fitted values for the regression stratifying on income
    plt.clf()
    plt.axes([0.1,0.1,0.7,0.8])
    A = []
    for ix,il in enumerate(("Low","Middle","High")):
        ii = (Income == il)
        L = lowess(Y.loc[ii].values, Age.loc[ii].values)
        a, = plt.plot(L[:,0], L[:,1], '-', lw=4, 
                      color=colors[ix], alpha=0.6)
        A.append(a)

    bb = plt.figlegend(A, ("Low", "Middle", "High"), "center right",
                   title="Income")
    bb.draw_frame(False)
    bb.get_title().set_fontsize(17)
    plt.ylabel(vname, size=17)
    plt.xlabel("Age", size=17)
    pdf.savefig()

    ## Plot the fitted values for the regression stratifying on ethnicity
    plt.clf()
    plt.axes([0.1,0.2,0.55,0.65])
    A = []
    for ix,ie in enumerate((1,2,3,4)):
        ii = (Ethnicity == ie)
        L = lowess(Y.loc[ii].values, Age.loc[ii].values)
        a, = plt.plot(L[:,0], L[:,1], '-', lw=4, 
                      color=colors[ix], alpha=0.6)
        A.append(a)

    bb = plt.figlegend(A, ("Mexican American", "Other Hispanic",
                           "Non-Hispanic White", "Non-Hispanic Black"), 
                           "center right", title="Ethnicity")
    bb.draw_frame(False)
    bb.get_title().set_fontsize(17)
    plt.ylabel(vname, size=17)
    plt.xlabel("Age", size=17)
    pdf.savefig()

pdf.close()

See also

Downloads