NHANES Data Management

Contents:

Background and goals
The NHANES data
Reading the data using Python and NumPy
Reading the data using Pandas
See also
Downloads

Background and goals

The National Health and Nutrition Examination Survey (NHANES) is one of the most widely-used datasets describing the health and socioeconomic status of people residing in the US. Here we will illustrate two ways to read a subset of the NHANES data for subsequent analysis. The first approach will use basic Python and NumPy functionality. The second approach will use the Pandas library.

The NHANES data

NHANES is a survey that has been conducted in waves over many years. We will work with the data from the 2009-2010 wave.

NHANES includes many different types of assessments, which are stored in separate data files. NHANES data is provided as SAS XPORT files. You can download a Python reader for SAS XPORT files here. If you don't have the ability to install this package on your system, you can just copy the xport.py file from the xport directory of the package source tree into your working directory.

We will use five of the data files. Here are links to the data, and descriptions of the variables in the data files:

File	Description	Data
Demographic measures	Description	Data
Body measurements	Description	Data
Blood pressure	Description	Data
Nutrient intake day 1	Description	Data
Nutrient intake day 2	Description	Data

Reading the data using Python and NumPy

We will first write a module that can be used by other analysis scripts to obtain the data from the raw files. All the raw XPT files should be downloaded from the links given above to the NHANES web site, and placed in the directory ../Data relative to the analysis scripts. The module for reading the data is called nhanes_read_data_numpy.py. We start with the usual import statements, and a brief description of what the module does.

import xport
import numpy as np
import os

"""
Read five NHANES data files and merge them into a 2d array.

These values can be exported:

Z : an array of the data, with rows corresponding to subjects, and
    columns corresponding to variables

VN : an array of variable names, in 1-1 correspondence with the
     columns of Z; the variable names consist of the file name,
     followed by a ":", followed by the variable name

KY : the sequence numbers (subject identifiers), in 1-1 correspondence 
     with the rows of Z
"""

## Data file names (the files are in ../Data)
FN = ["DEMO_F.XPT", "BMX_F.XPT", "BPX_F.XPT", "DR1TOT_F.XPT", "DR2TOT_F.XPT"]

Next we write a function that reads a single data file, and produces two results: a dictionary Z mapping sequence numbers (subject identifiers) to lists containing the data for the given subject, and a list H of variable names.

def get_data(fname):
    """ 
    Place all the data in the file `fname` into a dictionary indexed 
    by sequence number.

    Arguments:
    ----------
    fname : The file name of the data

    Returns:
    --------
    Z : A dictionary mapping the sequence numbers to lists of data values
    H : The names of the variables, in the order they appear in Z
    """

    ## The data, indexed by sequence number
    Z = {}

    ## The variable names, in the order they appear in values of Z.
    H = None

    with xport.XportReader(fname) as reader:
        for row in reader:

            ## Get the variable names from the first record
            if H is None:
                H = row.keys()
                H.remove("SEQN")
                H.sort()

            Z[row["SEQN"]] = [row[k] for k in H]

    return Z,H

Now we call the get_data function on each of the data files. We also modify the variable names so they also include the file name (since a few variable names are identical between files).

## Read all the data files
D,VN = [],[]
for fn in FN:
    fn_full = os.path.join("../Data/", fn)
    X,H = get_data(fn_full)
    s = fn.replace(".XPT", "")
    H = [s + ":" + x for x in H]
    D.append(X)
    VN += H

We next will merge the five files, using only the subjects who are present in all the files. First, we need to obtain the identifiers of these subjects.

## The sequence numbers that are in all data sets
KY = set(D[0].keys())
for d in D[1:]:
    KY &= set(d.keys())
KY = list(KY)
KY.sort()

Now we are ready to do the merge, and convert everything to numbers.

def to_float(x):
    try:
        return float(x)
    except ValueError:
        return float("nan")

## Merge the data
Z = []
for ky in KY:

    z = []

    map(z.extend, (d[ky] for d in D))
    ## equivalent to
    ## for d in D:
    ##     z.extend(d[ky])

    z = [to_float(a) for a in z]
    ## equivalent to
    ## map(to_float, z)

    Z.append(z)

Z = np.array(Z)

Reading the data using Pandas

Pandas provides a simpler way to read the NHANES data. Since Pandas does not read SAS XPORT files directly, we still need to use the xport.py module as above. Thus the top part of the script is unchanged, except that we now need to import Pandas. Conventionally this is renamed as pd.

import xport
import numpy as np
import pandas as pd
import os

"""
Read five NHANES data files using Pandas and merge them into a 2d array.

This value can be exported:

Z : a Pandas data frame containing all the data
"""

## Data file names (the files are in ../Data)
FN = ["DEMO_F.XPT", "BMX_F.XPT", "BPX_F.XPT", "DR1TOT_F.XPT", "DR2TOT_F.XPT"]

First we write a function that loads a single data file into a Pandas data frame.

def get_data(fname):
    """ 
    Place all the data in the file `fname` into a dictionary indexed 
    by sequence number.

    Arguments:
    ----------
    fname : The file name of the data

    Returns:
    --------
    A : A Pandas DataFrame containing all the data in the file
    """

    ## The data, consisting of a list of dictionaries mapping
    ## variable names to values.  Each element of the list
    ## contains the data for one subject
    with xport.XportReader(fname) as reader:
        Z = [row for row in reader]

    ## We want to use the sequence number as the index for the data frame,
    ## and we don't want it as a variable.  Using 'pop' accomplishes both
    ## of these things for us.
    Ix = [z.pop("SEQN") for z in Z]

    ## Create the data frame
    A = pd.DataFrame(Z, index=Ix)

    return A

Next we loop over all the files that we are interested in and read them. As above, we prepend the file name to each variable name, since a few of the variables are used in several of the files.

## Read all the data files
D = []
for fn in FN:
    fn1 = os.path.join("../Data/", fn)
    X = get_data(fn1)
    s = fn.replace(".XPT", "")
    H = {x: s + ":" + x for x in X.columns}
    X.rename(columns=H, inplace=True)
    D.append(X)

Pandas tries to determine the data type automatically, so we don't usually need to do any explicit string conversions. However sometimes Pandas doesn't use the data type that you would like, so you can check the data types using D[0].dtypes (for the "DEMO_F.XPT" file in this case).

Pandas handles all the work of concatenation.

## Merge all the data files
Z = pd.concat(D, axis=1)

NHANES Data Management

Background and goals

The NHANES data

Reading the data using Python and NumPy

Reading the data using Pandas

See also

Downloads