Contents:
The National Health and Nutrition Examination Survey (NHANES) is one of the most widely-used datasets describing the health and socioeconomic status of people residing in the US. Here we will illustrate two ways to read a subset of the NHANES data for subsequent analysis. The first approach will use basic Python and NumPy functionality. The second approach will use the Pandas library.
NHANES is a survey that has been conducted in waves over many years. We will work with the data from the 2009-2010 wave.
NHANES includes many different types of assessments, which are stored
in separate data files. NHANES data is provided as SAS XPORT
files. You can download a Python reader for SAS XPORT files
here. If you don't have
the ability to install this package on your system, you can just copy
the xport.py
file from the xport directory of the package source
tree
into your working directory.
We will use five of the data files. Here are links to the data, and descriptions of the variables in the data files:
File | Description | Data |
---|---|---|
Demographic measures | Description | Data |
Body measurements | Description | Data |
Blood pressure | Description | Data |
Nutrient intake day 1 | Description | Data |
Nutrient intake day 2 | Description | Data |
We will first write a module that can be used by other analysis
scripts to obtain the data from the raw files. All the raw XPT files
should be downloaded from the links given above to the NHANES web
site, and placed in the directory ../Data
relative to the analysis
scripts. The module for reading the data is called
nhanes_read_data_numpy.py.
We start with the usual import statements, and a brief description of
what the module does.
import xport
import numpy as np
import os
"""
Read five NHANES data files and merge them into a 2d array.
These values can be exported:
Z : an array of the data, with rows corresponding to subjects, and
columns corresponding to variables
VN : an array of variable names, in 1-1 correspondence with the
columns of Z; the variable names consist of the file name,
followed by a ":", followed by the variable name
KY : the sequence numbers (subject identifiers), in 1-1 correspondence
with the rows of Z
"""
## Data file names (the files are in ../Data)
FN = ["DEMO_F.XPT", "BMX_F.XPT", "BPX_F.XPT", "DR1TOT_F.XPT", "DR2TOT_F.XPT"]
Next we write a function that reads a single data file, and produces
two results: a dictionary Z
mapping sequence numbers (subject
identifiers) to lists containing the data for the given subject, and a
list H
of variable names.
def get_data(fname):
"""
Place all the data in the file `fname` into a dictionary indexed
by sequence number.
Arguments:
----------
fname : The file name of the data
Returns:
--------
Z : A dictionary mapping the sequence numbers to lists of data values
H : The names of the variables, in the order they appear in Z
"""
## The data, indexed by sequence number
Z = {}
## The variable names, in the order they appear in values of Z.
H = None
with xport.XportReader(fname) as reader:
for row in reader:
## Get the variable names from the first record
if H is None:
H = row.keys()
H.remove("SEQN")
H.sort()
Z[row["SEQN"]] = [row[k] for k in H]
return Z,H
Now we call the get_data
function on each of the data files. We
also modify the variable names so they also include the file name
(since a few variable names are identical between files).
## Read all the data files
D,VN = [],[]
for fn in FN:
fn_full = os.path.join("../Data/", fn)
X,H = get_data(fn_full)
s = fn.replace(".XPT", "")
H = [s + ":" + x for x in H]
D.append(X)
VN += H
We next will merge the five files, using only the subjects who are present in all the files. First, we need to obtain the identifiers of these subjects.
## The sequence numbers that are in all data sets
KY = set(D[0].keys())
for d in D[1:]:
KY &= set(d.keys())
KY = list(KY)
KY.sort()
Now we are ready to do the merge, and convert everything to numbers.
def to_float(x):
try:
return float(x)
except ValueError:
return float("nan")
## Merge the data
Z = []
for ky in KY:
z = []
map(z.extend, (d[ky] for d in D))
## equivalent to
## for d in D:
## z.extend(d[ky])
z = [to_float(a) for a in z]
## equivalent to
## map(to_float, z)
Z.append(z)
Z = np.array(Z)
Pandas provides a simpler way to read the NHANES data. Since Pandas
does not read SAS XPORT files directly, we still need to use the
xport.py
module as above. Thus the top part of the script is
unchanged, except that we now need to import Pandas. Conventionally
this is renamed as pd
.
import xport
import numpy as np
import pandas as pd
import os
"""
Read five NHANES data files using Pandas and merge them into a 2d array.
This value can be exported:
Z : a Pandas data frame containing all the data
"""
## Data file names (the files are in ../Data)
FN = ["DEMO_F.XPT", "BMX_F.XPT", "BPX_F.XPT", "DR1TOT_F.XPT", "DR2TOT_F.XPT"]
First we write a function that loads a single data file into a Pandas data frame.
def get_data(fname):
"""
Place all the data in the file `fname` into a dictionary indexed
by sequence number.
Arguments:
----------
fname : The file name of the data
Returns:
--------
A : A Pandas DataFrame containing all the data in the file
"""
## The data, consisting of a list of dictionaries mapping
## variable names to values. Each element of the list
## contains the data for one subject
with xport.XportReader(fname) as reader:
Z = [row for row in reader]
## We want to use the sequence number as the index for the data frame,
## and we don't want it as a variable. Using 'pop' accomplishes both
## of these things for us.
Ix = [z.pop("SEQN") for z in Z]
## Create the data frame
A = pd.DataFrame(Z, index=Ix)
return A
Next we loop over all the files that we are interested in and read them. As above, we prepend the file name to each variable name, since a few of the variables are used in several of the files.
## Read all the data files
D = []
for fn in FN:
fn1 = os.path.join("../Data/", fn)
X = get_data(fn1)
s = fn.replace(".XPT", "")
H = {x: s + ":" + x for x in X.columns}
X.rename(columns=H, inplace=True)
D.append(X)
Pandas tries to determine the data type automatically, so we don't
usually need to do any explicit string conversions. However sometimes
Pandas doesn't use the data type that you would like, so you
can check the data types using D[0].dtypes
(for the "DEMO_F.XPT"
file in this case).
Pandas handles all the work of concatenation.
## Merge all the data files
Z = pd.concat(D, axis=1)