Statistics 506, Fall 2016

R Dataframes and dplyr


Dataframes in R

An R dataframe (usually written data.frame in R code) is a means for storing a rectangular dataset in a single object. In such a dataset, the rows are observations and the columns are variables. Each variable should have a consistent data type over all the observations, but different variables can have different data types.

For example, the following data set could be represented as a dataframe with three variables. These variables have date, string, and numeric type, respectively (we don’t need to worry right now about how these types map onto concrete R types).

Birthdate   State        Balance
--------------------------------
9-13-1992   Maryland     5000
6-29-1986   Ohio         7520
12-9-1979   California   12739

A dataframe is a class that inherits from the list data type. One common way to obtain a dataframe in R is by reading in a dataset from a file using read.table or read.csv.

If you want to create a literal dataframe in your R script, you can do the following:

birthdate = c("9-13-1992", "6-29-1986", "12-9-1979")
state = c("Maryland", "Ohio", "California")
balance = c(5000, 7520, 12739)
da = data.frame(Birthdate=birthdate, State=state, Balance=balance)

You can check that da has list data type and data.frame class:

typeof(da)
class(da)

Since the basic type of a dataframe is a list, and the list elements are the variables, you can access the variable information the same way you would access elements of a list, e.g. to access the (entire) State variable use

da$State

or

da[["State"]]

Like arrays, dataframes can have a names attribute for column names and a row.names attribute for row names. You can access individual elements of a dataframe as if you were indexing a two-dimensional array, e.g.

da[2, "State"]

will return “Ohio”. This is sometimes called label-based indexing.

Slicing also works. You can slice using integer ranges, e.g.

da[2:3, "State"]
da[2:3, 2:3]

but you cannot use row or column labels in a slice range.

Unlike arrays, the columns of a dataframe can have different types (e.g. double or character). To learn the type of a particular column of a dataframe use typeof on the column, for example

typeof(da$State)

will be “character”.

You can add a variable to a dataframe simply by assigning it a value

da$Country = c("USA", "USA", "USA")

You can also use da$Country = "USA" to initialize the column with the same value for all rows.

Row and column names in a dataframe should be distinct, but R cannot always enforce this. If you change the names, e.g. using

names(da) = c("a", "b", "b")`

then you will have an array with non-unique column names. In this case, indexing da[2, "b"] will return the values from the first (left-most) column named b. When loading a dataframe from a file or creating a dataframe from a list using as.dataframe, R will adjust the names to make them unique.

dplyr

dplyr is an R package for manipulating datasets, including data stored in dataframes as well data in other data storage formats. Since dplyr is not part of base R, you need to import it in every script that calls a dplyr function, using:

library(dplyr)

If you have never used dplyr you will first need to install it using:

install.packages("dplyr")

There are several on-line tutorials and other references for dplyr, including this.

We will illustrate some of the key functionality of dplyr here, using data from the 2009 Residential Energy Consumption Survey (RECS), which is available here.

library(dplyr)

# The argument to read.csv should be replaced with a path to the
# data file.  You can compress the data file using gzip if you
# want.
da = read.csv("/var/tmp/recs2009_public.csv.gz")

dplyr has a number of basic functions for manipulating datasets. We won’t review the simpler functions here. One of these basic functions that is particularly useful is filter, which selects rows from a dataset based on conditions (boolean expressions) involving the variables. Here is an example in which we retain only the buildings that were constructed in 1980 or later (YEARMADE >= 1980), that are single family homes (TYPEHUQ == 2) and that are in a rural area (UR == "R").

da1 = filter(da, YEARMADE >= 1980, TYPEHUQ == 2, UR == "R")

Split-apply-combine

dplyr emphasizes a pattern called “split-apply-combine” for thinking about many operations that arise in data management and analysis. The basic idea is that you first split a dataset based on some condition, then you apply a function to each piece that results from the split, then you somehow combine these results.

A basic illustration of split-apply-combine is to split the survey data by region, then calculate the mean age of construction (YEARMADE) for buildings within each region (the census region codes are given here):

gb = group_by(da, REGIONC)
summarize(gb, YEARMADE=mean(YEARMADE))

The summarize function creates a new dataframe with variables as defined through keyword arguments in the function call. This particular call to summarize creates a summary variable called YEARMADE which contains the mean value of the original YEARMADE variable within each level of REGIONC.

A similar but slightly more complex analysis looks at the average size of each building in terms of the number of rooms, grouped based on year of construction. The construction year is cut into five-year intervals to make the results more stable and easier to read as a table.

da = mutate(da, YEARMADE5=cut(YEARMADE, seq(1920, 2015, by=5)))
gb = group_by(da, YEARMADE5)
summarize(gb, TOTROOMS=mean(TOTROOMS), n=n())

The n() function calculates the sample size within each level of the grouping variable (which is YEARMADE5 here) and creates a new variable (here called n) to hold these values.

Note that the mutate function is equivalent to creating a new variable with the usual assignment syntax. The advantage of using mutate is that the function arguments can refer to the variables by name without referencing the dataframe, e.g. below in cut we have YEARMADE not da$YEARMADE. The variable names inside mutate automatically refer to the dataframe provided in the first argument to mutate.

dax = mutate(da, YEARMADE5=cut(YEARMADE, seq(1920, 2015, by=5)))

# Equivalent approach using conventional assignment
day = data.frame(da)
day$YEARMADE5 = cut(day$YEARMADE, seq(1920, 2015, by=5))

# Confirm that the results are identical
stopifnot(all.equal(dax, day))

It’s also possible to group the data based on two or more variables. Next we will cut the year of construction into two intervals, pre-1970 and post-1970, and then group the data based on the combined values of this variable and the climate region (a categorical variable with five levels, see the codebook for the level names).

da = mutate(da, YEARMADE50=cut(YEARMADE, c(1920, 1970, 2015)))
gb = group_by(da, YEARMADE50, Climate_Region_Pub)
summarize(gb, n=n())

It’s easier to understand the results of this analysis by calculating frequencies within each level of YEARMADE50, which can be done as follows:

sf = summarize(gb, n=n())
mutate(sf, f=n/sum(n))

Note that sum(n) sums over the last level of the grouping, which is the Climate_Region_Pub variable. Thus the proportions are calculated within the groups defined by the remaining variables, which in this case is only the variable YEARMADE50.

Exercise

Produce a table containing the fraction of houses that have basements (CELLAR == 1) within each level of REPORTABLE_DOMAIN (which is basically a state or group of adjacent states). Restrict this analysis to single family detached homes (TYPEHUQ == 2). Sort your table from least to greatest, then from greatest to least. What are some of the states with the lowest and highest frequencies of basements?

Pipes

In the examples above we applied a series of dplyr functions, with each function operating on the output of the previous function. To make this work, we needed to name each returned value. Pipes allow us to do the same thing without naming any of the intermediate values (which we generally won’t use again anyway).

In R, a pipe is the operator %>% which takes the output of one function and passes it as the first argument of the next function. As a simple example, the following code

sum(1, 2) %>% sum(3)

is equivalent to

sum(sum(1, 2), 3),

which yields 6. We can use this to simplify what we did above:

mutate(da, YEARMADE50=cut(YEARMADE, c(1920, 1970, 2015))) %>%
   group_by(YEARMADE50, Climate_Region_Pub) %>%
   summarize(n=n()) %>%
   mutate(f=n/sum(n))

If you look up the climate codes you will see that construction in cold climates declined after 1970 and construction in hot, humid climates increased. Construction in the other climate regions is fairly constant as a proportion of the total.

Joining data sets with dplyr

Several libraries in R implement methods for joining data sets. Here we will look at the join methods in dplyr.

The basic idea of any join is to take two datasets and combine their information into a single new dataset. As a simple example, you may have dataframes da1 and da2, each containing data on human subjects. The variables (columns) in da1 and da2 may be different or may have partial overlap. The rows of the two datasets correspond to overlapping (not necessarily identical) sets of people. However the rows are generally in different orders in the two datasets, and some people may be present in only one of the two datasets. The join operation lines up the two dataframes so that the aligned rows correspond to the same person, and creates a new dataframe from this result.

To be concrete, we will work with a few of the many datasets from the U.S. government’s “National Health and Nutrition Examination Survey” (NHANES). To begin, download these two NHANES datafiles through the web, or through the shell using wget:

http://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT

http://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPX_F.XPT

You will find codebooks for these files here.

The NHANES data files are in an archaic format called “SAS XPORT”. You can read the files into R using the read.xport function in the foreign library.

library(foreign)

demo = read.xport("/var/tmp/DEMO_F.XPT")
bpx = read.xport("/var/tmp/BPX_F.XPT")

You can now use head to take a peek at the first few rows of each file:

head(demo)
head(bpx)

All joins use an “index” variable to line up rows in the datasets being joined. For the NHANES datasets we will join on the SEQN variable, which is a unique subject identifier. We can confirm that the SEQN values are distinct, so we will not need to deal with “many-to-one” joins.

dim(demo)[1] == length(unique(demo$SEQN))
dim(bpx)[1] == length(unique(bpx$SEQN))

However looking at the dimensions, it is clear that there are at least some people who are missing from one or the other of the two datasets.

setdiff(demo$SEQN, bpx$SEQN)
setdiff(bpx$SEQN, demo$SEQN)

Simple joins

Perhaps the most commonly used join operation is the inner_join, which joins based on rows that are present in both datasets.

da = inner_join(demo, bpx, by="SEQN")
dim(da)

Based on the dimensions, we see that this join only includes people who are present in both datasets, and it includes all columns from both datasets. Note that the by keyword argument defines the index variable. If the by argument is ommitted, by default the join takes place on any variables that have the same name in both datasets. Also, if the index variable has different names in the two datasets, you can use the syntax

da = inner_join(demo, bpx, by=c("SEQN" = "SEQN"))

where, e.g. by("S1" = "S2") would use S1 as the index variable for the first dataset and S2 as the index variable or the second dataset.

The semi_join is similar to the inner_join, except that the columns of the second dataset (bpx here) are not included. In other words, it drops the rows from the first dataset that are not in the second dataset, but only includes the columns from the first dataset.

dx = semi_join(demo, bpx, by="SEQN")
dim(dx)

Sometimes it is useful to know which rows do not match in a join. We can do this using the anti_join function:

fx = anti_join(demo, bpx, by="SEQN")

We can also join by forcing the result to have exactly the same rows as the first dataframe:

da = left_join(demo, bpx, by="SEQN")

Many-to-one joins and wide/long conversion

Many datasets have repeated measures for each subject. Datasets with repeated measures may be stored either in “long form” with one row per observation, or in “wide form”, with one row per subject.

Here is a simple dataset in wide form, containing data on subjects’ body mass index (BMI) values during two clinical visits:

Subject_id    Age       Visit_1     Visit_2
-------------------------------------------
1             27        33          31
2             22        18          22
3             35        21          24
4             41        25          27

Here is the same dataset in long form:

Subject_id    Visit     Age     BMI
-------------------------------------
1             1         27      33
1             2         27      31
2             1         22      18
2             2         22      22
3             1         35      21
3             2         35      24
4             1         41      25
4             2         41      27

The tidyr package is useful for converting among these formats. For example, the NHANES study records up to three blood pressure measurements for each subject. They are named BPXSY1, BPXSY2, and BPXSY3 in the data file. Since there is one row per subject and the repeated blood pressure measurements are stored in different columns under different variable names, these data are in wide format. We can use tidyr to convert them to long format as follows:

library(tidyr)
da = gather(bpx, BPX_visit, BPXSY, BPXSY1, BPXSY2, BPXSY3)

If needed, we can also go the other way – long to wide format conversions can be performed using the tidyr spread function.

Now that we have the blood pressure data in long format we may wish to join it with the demographic data. First we can eliminate the columns from the blood pressure data that we will no longer use:

da = select(da, SEQN, BPX_visit, BPXSY)

Next we can convert the visit variable from strings (BPXSY1, BPXSY2, …) to integers (1, 2, …):

f = function(x){as.integer(sub("BPXSY", "", x))}
da = mutate(da, BPX_visit=sapply(BPX_visit, f))

Finally, it is easier to view the table if the records from individual subjects are contiguous:

da = arrange(da, SEQN, BPX_visit)

Now we are ready to do the join. Since there are three rows in the long format blood pressure data, this will be a “many-to-one” join, which is exactly what is done by the inner_join function:

da = inner_join(demo, da)