Dataframes in R
An R dataframe (usually written data.frame
in R code) is a means for
storing a rectangular dataset in a single object. In such a dataset,
the rows are observations and the columns are variables. Each
variable should have a consistent data type over all the observations,
but different variables can have different data types.
For example, the following data set could be represented as a dataframe with three variables. These variables have date, string, and numeric type, respectively (we don’t need to worry right now about how these types map onto concrete R types).
Birthdate State Balance
--------------------------------
9-13-1992 Maryland 5000
6-29-1986 Ohio 7520
12-9-1979 California 12739
A dataframe
is a class that inherits from the list
data type. One
common way to obtain a dataframe in R is by reading in a dataset from
a file using read.table
or read.csv
.
If you want to create a literal dataframe in your R script, you can do the following:
birthdate = c("9-13-1992", "6-29-1986", "12-9-1979")
state = c("Maryland", "Ohio", "California")
balance = c(5000, 7520, 12739)
da = data.frame(Birthdate=birthdate, State=state, Balance=balance)
You can check that da
has list
data type and data.frame
class:
typeof(da)
class(da)
Since the basic type of a dataframe is a list, and the list elements
are the variables, you can access the variable information the same
way you would access elements of a list, e.g. to access the (entire)
State
variable use
da$State
or
da[["State"]]
Like arrays, dataframes can have a names
attribute for column names
and a row.names
attribute for row names. You can access individual
elements of a dataframe as if you were indexing a two-dimensional
array, e.g.
da[2, "State"]
will return “Ohio”. This is sometimes called label-based indexing.
Slicing also works. You can slice using integer ranges, e.g.
da[2:3, "State"]
da[2:3, 2:3]
but you cannot use row or column labels in a slice range.
Unlike arrays, the columns of a dataframe can have different types
(e.g. double or character). To learn the type of a particular column
of a dataframe use typeof
on the column, for example
typeof(da$State)
will be “character”.
You can add a variable to a dataframe simply by assigning it a value
da$Country = c("USA", "USA", "USA")
You can also use da$Country = "USA"
to initialize the column with
the same value for all rows.
Row and column names in a dataframe should be distinct, but R cannot always enforce this. If you change the names, e.g. using
names(da) = c("a", "b", "b")`
then you will have an array with non-unique column names. In this
case, indexing da[2, "b"]
will return the values from the first
(left-most) column named b
. When loading a dataframe from a file or
creating a dataframe from a list using as.dataframe
, R will adjust
the names to make them unique.
dplyr
dplyr is an R package for manipulating datasets, including data stored in dataframes as well data in other data storage formats. Since dplyr is not part of base R, you need to import it in every script that calls a dplyr function, using:
library(dplyr)
If you have never used dplyr you will first need to install it using:
install.packages("dplyr")
There are several on-line tutorials and other references for dplyr, including this.
We will illustrate some of the key functionality of dplyr here, using data from the 2009 Residential Energy Consumption Survey (RECS), which is available here.
library(dplyr)
# The argument to read.csv should be replaced with a path to the
# data file. You can compress the data file using gzip if you
# want.
da = read.csv("/var/tmp/recs2009_public.csv.gz")
dplyr has a number of basic functions for manipulating datasets. We
won’t review the simpler functions here. One of these basic functions
that is particularly useful is filter
, which selects rows from a
dataset based on conditions (boolean expressions) involving the
variables. Here is an example in which we retain only the buildings
that were constructed in 1980 or later (YEARMADE >= 1980
), that are
single family homes (TYPEHUQ == 2
) and that are in a rural area (UR
== "R"
).
da1 = filter(da, YEARMADE >= 1980, TYPEHUQ == 2, UR == "R")
Split-apply-combine
dplyr emphasizes a pattern called “split-apply-combine” for thinking about many operations that arise in data management and analysis. The basic idea is that you first split a dataset based on some condition, then you apply a function to each piece that results from the split, then you somehow combine these results.
A basic illustration of split-apply-combine is to split the survey
data by region, then calculate the mean age of construction
(YEARMADE
) for buildings within each region (the census region codes
are given
here):
gb = group_by(da, REGIONC)
summarize(gb, YEARMADE=mean(YEARMADE))
The summarize
function creates a new dataframe with variables as
defined through keyword arguments in the function call. This
particular call to summarize
creates a summary variable called
YEARMADE
which contains the mean value of the original YEARMADE
variable within each level of REGIONC
.
A similar but slightly more complex analysis looks at the average size of each building in terms of the number of rooms, grouped based on year of construction. The construction year is cut into five-year intervals to make the results more stable and easier to read as a table.
da = mutate(da, YEARMADE5=cut(YEARMADE, seq(1920, 2015, by=5)))
gb = group_by(da, YEARMADE5)
summarize(gb, TOTROOMS=mean(TOTROOMS), n=n())
The n()
function calculates the sample size within each level of the
grouping variable (which is YEARMADE5
here) and creates a new
variable (here called n
) to hold these values.
Note that the mutate
function is equivalent to creating a new
variable with the usual assignment syntax. The advantage of using
mutate
is that the function arguments can refer to the variables by
name without referencing the dataframe, e.g. below in cut
we have
YEARMADE
not da$YEARMADE
. The variable names inside mutate
automatically refer to the dataframe provided in the first argument to
mutate
.
dax = mutate(da, YEARMADE5=cut(YEARMADE, seq(1920, 2015, by=5)))
# Equivalent approach using conventional assignment
day = data.frame(da)
day$YEARMADE5 = cut(day$YEARMADE, seq(1920, 2015, by=5))
# Confirm that the results are identical
stopifnot(all.equal(dax, day))
It’s also possible to group the data based on two or more variables. Next we will cut the year of construction into two intervals, pre-1970 and post-1970, and then group the data based on the combined values of this variable and the climate region (a categorical variable with five levels, see the codebook for the level names).
da = mutate(da, YEARMADE50=cut(YEARMADE, c(1920, 1970, 2015)))
gb = group_by(da, YEARMADE50, Climate_Region_Pub)
summarize(gb, n=n())
It’s easier to understand the results of this analysis by calculating
frequencies within each level of YEARMADE50
, which can be done as
follows:
sf = summarize(gb, n=n())
mutate(sf, f=n/sum(n))
Note that sum(n)
sums over the last level of the grouping, which is
the Climate_Region_Pub
variable. Thus the proportions are
calculated within the groups defined by the remaining variables, which
in this case is only the variable YEARMADE50
.
Exercise
Produce a table containing the fraction of houses that have basements
(CELLAR == 1
) within each level of REPORTABLE_DOMAIN
(which is
basically a state or group of adjacent states). Restrict this
analysis to single family detached homes (TYPEHUQ == 2
). Sort your
table from least to greatest, then from greatest to least. What are
some of the states with the lowest and highest frequencies of
basements?
Pipes
In the examples above we applied a series of dplyr functions, with each function operating on the output of the previous function. To make this work, we needed to name each returned value. Pipes allow us to do the same thing without naming any of the intermediate values (which we generally won’t use again anyway).
In R, a pipe is the operator %>%
which takes the output of one
function and passes it as the first argument of the next function.
As a simple example, the following code
sum(1, 2) %>% sum(3)
is equivalent to
sum(sum(1, 2), 3),
which yields 6. We can use this to simplify what we did above:
mutate(da, YEARMADE50=cut(YEARMADE, c(1920, 1970, 2015))) %>%
group_by(YEARMADE50, Climate_Region_Pub) %>%
summarize(n=n()) %>%
mutate(f=n/sum(n))
If you look up the climate codes you will see that construction in cold climates declined after 1970 and construction in hot, humid climates increased. Construction in the other climate regions is fairly constant as a proportion of the total.
Joining data sets with dplyr
Several libraries in R implement methods for joining data sets. Here we will look at the join methods in dplyr.
The basic idea of any join is to take two datasets and combine their
information into a single new dataset. As a simple example, you may
have dataframes da1
and da2
, each containing data on human
subjects. The variables (columns) in da1
and da2
may be different
or may have partial overlap. The rows of the two datasets correspond
to overlapping (not necessarily identical) sets of people. However
the rows are generally in different orders in the two datasets, and
some people may be present in only one of the two datasets. The
join
operation lines up the two dataframes so that the aligned rows
correspond to the same person, and creates a new dataframe from this
result.
To be concrete, we will work with a few of the many datasets from the
U.S. government’s “National Health and Nutrition Examination Survey”
(NHANES). To begin, download these two NHANES datafiles through the
web, or through the shell using wget
:
http://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT
http://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPX_F.XPT
You will find codebooks for these files here.
The NHANES data files are in an archaic format called “SAS XPORT”.
You can read the files into R using the read.xport
function in the
foreign
library.
library(foreign)
demo = read.xport("/var/tmp/DEMO_F.XPT")
bpx = read.xport("/var/tmp/BPX_F.XPT")
You can now use head
to take a peek at the first few rows of each file:
head(demo)
head(bpx)
All joins use an “index” variable to line up rows in the datasets
being joined. For the NHANES datasets we will join on the SEQN
variable, which is a unique subject identifier. We can confirm that
the SEQN
values are distinct, so we will not need to deal with
“many-to-one” joins.
dim(demo)[1] == length(unique(demo$SEQN))
dim(bpx)[1] == length(unique(bpx$SEQN))
However looking at the dimensions, it is clear that there are at least some people who are missing from one or the other of the two datasets.
setdiff(demo$SEQN, bpx$SEQN)
setdiff(bpx$SEQN, demo$SEQN)
Simple joins
Perhaps the most commonly used join operation is the inner_join
,
which joins based on rows that are present in both datasets.
da = inner_join(demo, bpx, by="SEQN")
dim(da)
Based on the dimensions, we see that this join only includes people
who are present in both datasets, and it includes all columns from
both datasets. Note that the by
keyword argument defines the index
variable. If the by
argument is ommitted, by default the join takes
place on any variables that have the same name in both datasets.
Also, if the index variable has different names in the two datasets,
you can use the syntax
da = inner_join(demo, bpx, by=c("SEQN" = "SEQN"))
where, e.g. by("S1" = "S2")
would use S1
as the index variable for
the first dataset and S2
as the index variable or the second
dataset.
The semi_join
is similar to the inner_join
, except that the
columns of the second dataset (bpx
here) are not included. In other
words, it drops the rows from the first dataset that are not in the
second dataset, but only includes the columns from the first dataset.
dx = semi_join(demo, bpx, by="SEQN")
dim(dx)
Sometimes it is useful to know which rows do not match in a join. We
can do this using the anti_join
function:
fx = anti_join(demo, bpx, by="SEQN")
We can also join by forcing the result to have exactly the same rows as the first dataframe:
da = left_join(demo, bpx, by="SEQN")
Many-to-one joins and wide/long conversion
Many datasets have repeated measures for each subject. Datasets with repeated measures may be stored either in “long form” with one row per observation, or in “wide form”, with one row per subject.
Here is a simple dataset in wide form, containing data on subjects’ body mass index (BMI) values during two clinical visits:
Subject_id Age Visit_1 Visit_2
-------------------------------------------
1 27 33 31
2 22 18 22
3 35 21 24
4 41 25 27
Here is the same dataset in long form:
Subject_id Visit Age BMI
-------------------------------------
1 1 27 33
1 2 27 31
2 1 22 18
2 2 22 22
3 1 35 21
3 2 35 24
4 1 41 25
4 2 41 27
The tidyr
package is useful for converting among these formats. For
example, the NHANES study records up to three blood pressure
measurements for each subject. They are named BPXSY1
, BPXSY2
, and
BPXSY3
in the data file. Since there is one row per subject and the
repeated blood pressure measurements are stored in different columns
under different variable names, these data are in wide format. We can
use tidyr
to convert them to long format as follows:
library(tidyr)
da = gather(bpx, BPX_visit, BPXSY, BPXSY1, BPXSY2, BPXSY3)
If needed, we can also go the other way – long to wide format
conversions can be performed using the tidyr spread
function.
Now that we have the blood pressure data in long format we may wish to join it with the demographic data. First we can eliminate the columns from the blood pressure data that we will no longer use:
da = select(da, SEQN, BPX_visit, BPXSY)
Next we can convert the visit variable from strings (BPXSY1
,
BPXSY2
, …) to integers (1, 2, …):
f = function(x){as.integer(sub("BPXSY", "", x))}
da = mutate(da, BPX_visit=sapply(BPX_visit, f))
Finally, it is easier to view the table if the records from individual subjects are contiguous:
da = arrange(da, SEQN, BPX_visit)
Now we are ready to do the join. Since there are three rows in the
long format blood pressure data, this will be a “many-to-one” join,
which is exactly what is done by the inner_join
function:
da = inner_join(demo, da)