NHANES Contingency Tables

Top page

Contents:

Background

Contingency tables (also known as "crosstabs") describe the joint distribution of two or more variables, each of which takes on a relatively small number of distinct values (e.g. categorical, nominal, and certain ordinal types of data). Contingency tables are most often used two describe the relationship between two variables.

Contingency tables of NHANES demographic variables

We start with the usual import statements, and a description of what the script does. This script uses the NHANES data management script nhanes_read_data_pandas.py discussed here.

import pandas as pd
import numpy as np
from nhanes_read_data_pandas import Z
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

"""
Create some simple summary tables from the NHANES data.  Print the
tables to the screen, and produce dotplots of the tables.
"""

The variable names in the NHANES data are hard to remember, so we write a search function allows us to use substrings to search for variable names. We use the search function to identify the names for four variables of interest.

def searchcol(s, Z):
    """
    Convenience function to aid in locating variables of interest
    in the merged NHANES data frame.
    """

    s = s.lower()
    U = [x for x in Z.columns if s in x.lower()]
    return U


## Retrieve some variable names

gender = searchcol("riagendr", Z)
gender = gender[0]

ethnicity = searchcol("ridreth1", Z)
ethnicity = ethnicity[0]

military = searchcol("dmqmilit", Z)
military = military[0]

marital = searchcol("dmdmartl", Z)
marital = marital[0]

The categorical variables in the NHANES data set are coded with numerical labels. The text descriptions for these codes can be found here. We construct a dictionary that can be used to replace the codes with text descriptions.

## Dictionaries for converting the numeric labels to descriptive text.
## See:
## http://www.cdc.gov/nchs/nhanes/nhanes2009-2010/DEMO_F.htm
codes = {}
codes[gender] = {1: "Male", 2: "Female"} 
codes[ethnicity] = {1: "Mexican American", 2: "Other Hispanic", 3: "Non-Hispanic White",
                    4: "Non-Hispanic Black", 5: "Other race/multiracial"}
codes[military] = {1: "Yes", 2: "No"}
codes[marital] = {1: "Married", 2: "Widowed", 3: "Divorced", 4: "Separated",
                  5: "Never married", 6: "Living with partner"}

This function creates a bivariate contingency table, changes the row and column labels from numeric labels to text descriptions, and drops rows and columns corresponding to uninteresting categories.

def make_table(f1, f2, Z):
    """
  Also replaces numeric codes
    with the descriptive text label for each category.
    """

    ## The counts
    A = pd.crosstab(Z.loc[:,f1], Z.loc[:,f2])

    ## Change the numeric codes to text labels
    A.rename(codes[f1], columns=codes[f2], inplace=True)

    ## Drop the uncoded categories
    A = A.reindex(codes[f1].values(), columns=codes[f2].values())

    return A

The dotplot function shown next makes a dot plot of the contingency tables. We won't get into the details of this code now.

def dotplot(AR, ax):
    """
    Generate a dot plot of the within-row proportions in the table AR.
    The graph is plotted on the axes ax.
    """

    for loc, spine in ax.spines.items():
        if loc in ['bottom',]:
            spine.set_position(('outward',10))
        elif loc in ['right','top','left']:
            spine.set_color('none')

    ax.xaxis.set_ticks_position('bottom')
    ax.yaxis.set_ticks_position('none')

    ## Vertical coordinate of the row set labels
    yl = []

    ## For the legend
    V = []

    yy = 100
    for i in range(AR.shape[0]):
        ty = []
        for j in range(AR.shape[1]):
            plt.plot((0,1), (yy,yy), '-', color='grey')
            x = AR.iloc[i,j]
            v, = plt.plot((x,), (yy,), 'o', ms=8, color='rgbcmyk'[j],
                          alpha=0.8)
            if i == 0:
                V.append(v)
            ty.append(yy)
            yy -= 1
        yl.append(np.mean(ty))
        yy -= 5

    leg = ax.legend(V, AR.columns, 2, numpoints=1, 
                     handletextpad=0.0001, bbox_to_anchor=(1.05,0.9))
    leg.draw_frame(False)

    plt.xlim(0, 1)
    dy = 100 - (yy+5)
    ey = dy*0.02
    plt.ylim(yy+5-ey, 102+ey)
    plt.gca().set_yticks(yl)
    plt.gca().set_yticklabels(AR.index)
    plt.xlabel("Proportion", size=17)

This function generates a bivariate contingency table for two given factors, normalizes it by rows and by columns, and prints the table of counts together with the two normalized versions of the table. It also uses the dotplot function to produce a dotplot of the table. Note that the figure is resized vertically according to the number or rows that will appear in the dotplot.

def print_tables(f1, f2, Z):
    """
    Prints a 2-way contingency between variables f1 and f2 of Z to the
    screen.  The table is printed in three forms: cell counts,
    column-wise proportions, and row-wise proportions.
    """

    A = make_table(f1, f2, Z)

    ## The within-column proportions
    AC = A.apply(lambda x: x / float(sum(x)), axis=0)

    ## The within-row proportions
    AR = A.apply(lambda x: x / float(sum(x)), axis=1)

    print A
    print AC
    print AR
    print

    m = AR.shape[0] * AR.shape[1]
    h = 2+0.2*m
    plt.figure(figsize=(9,h))
    plt.clf()
    ax = plt.axes([0.25,1/h,0.4,0.9-1/h])
    dotplot(AR, ax)
    pdf.savefig()

Finally, we call the functions we developed above to produce contingency tables and dot plots for four pairs of variables.

pdf = PdfPages("nhanes_tables.pdf")

print_tables(gender, ethnicity, Z)

print_tables(gender, military, Z)

print_tables(gender, marital, Z)

print_tables(ethnicity, marital, Z)

pdf.close()

See also

NHANES data management

NHANES linear regression

NHANES logistic regression

NHANES smoothing analysis

Downloads

nhanes_tables.py

nhanes_read_data_pandas.py

nhanes_tables.pdf