Statistics 506, Fall 2016

Large data sets in R


Compressed files

A compressed files take up far less room on the disk compared to the same file stored without compression. Not only does this save disk space, but it often allows the file to be read more quickly, since the cost of decompression is more than offset by the reduced need to read data off the disk (i.e. reading a compressed file is an “io bound” task, not a “cpu bound” task).

There are several forms of compression. Some of the main ones you will encounter are:

  • gzip: file extension .gz (very common on linux/unix)

  • xz: file extension .xz

  • winzip : file extension .zip (very common on Windows)

  • bzip2 : file extension .bz2

The zip format is both an archiving and a compression format, where “archiving” here means to bind multiple files together into a single file. Archiving on Linux systems is done using tar, but we won’t consider archiving further here, we are only interested here in compressing single files.

To compress a file at the command line using gzip just type:

gzip filename

where filename is the name of your file.

To decompress a compressed gzip file use:

gunzip filename.gz

For .xz files, use

xz filename

and

unxz filename.xz

For most purposes, xz will give the greatest level of compression. xz is considerably slower than gzip when compressing, but the two files can be decompressed in roughly the same time. For very large files, say 100GB and above, gzip is very slow. You can sacrifice some compression in exchange for speed by using gzip --fast filename when compressing. Another useful compression format that focuses on speed rather than compression ratios is Snappy.

When processing compressed files, it is usually desirable to process the file as a “stream”, avoiding the need to decompress the entire file to the disk. You can use pipes to decompress a file in memory and send the result to another program. For example, to count the lines in a gzip compressed file use

gzip -dc file.gz | wc -l

We can also use pipes to convert from one zip format to another without decompressing on-disk, e.g.

gzip -dc file.gz | xz > file.xz

While any type of file can be compressed, many special-purpose binary file formats do not contain much redundancy and therefore the file size is not reduced much after compressing. For example, there is usually little benefit to compressing Excel or pdf files. Text files are generally highly compressable.

data.table

data.table is an R package that provides indexed rectangular data structures. “Indexing” means that specific columns are designated as keys, and the table is pre-sorted with respect to these columns. This allows very fast lookups along the indexed columns. data.table also provides facilities for aggregation and selection, similar in some ways to dplyr.

To demonstrate, here we create a small data table, which initially has no keys:

library(data.table)
da = rnorm(100)
dim(da) = c(25, 4)
colnames(da) = c("a", "b", "c", "d")
da = data.table(da)

Now we can designate column c as the index column:

setkey(da, c)

Note that after doing this, the data table becomes sorted with respect to column c. It now becomes very fast to retrieve a subset of rows by querying the indexed column, e.g.

da[c < 0,,]

It is also possible to define multiple columns as keys:

setkey(da, c, d)
da[(c < 0) & (d > 0),,]

If you want to check which variables are set as keys, you can use the key function:

key(da)

The data.table bracket operator [...] has up to three arguments. The first position selects rows, the second position selects columns and/or creates new columns, and the third position defines grouping actions. All of the arguments are optional, and additional commas can be omitted if there is no ambiguity. For example, da[x,,], da[x,] and da[x] are all equivalent.

The data.table library has a function called fread for reading csv format data files. It returns a data.table object. For example, suppose we have a local copy of the 2014 GHCN weather data (obtained from here). We can read it using:

ghcn = fread('zcat 2014.csv.gz')

Note that the argument to fread can be a shell command that processes the data file as it is read off the disk. Shell commands differ among operating systems, so this code (using zcat) will not work on most Windows and some MacOS systems, but it will work on most linux systems.

Since the datafile does not contain column names we can define them here:

setnames(ghcn, c("station", "date", "vtype", "value", "x1", "x2", "x3", "x4"))

Note that setnames is preferred over the colnames(da) = c(...) syntax because the latter may copy the data table.

Next we can set search keys

setkey(ghcn, vtype, station)

and begin querying the table:

tmax = ghcn[vtype=='TMAX',]

To query on both index fields, we pass the values that we want to select as a list:

dr = ghcn[.("AWDR", "USS0005N04S"),,]

Note that data.table has a special alias .( ) for the list function. The previous code is equivalent to

dr = ghcn[list("AWDR", "USS0005N04S"),,]

If you only want to select on one part of an index, you can pass NULL for index fields that you want to ignore:

dr = ghcn[list(NULL, "USS0005N04S"),,]

In general, the [] operator applied to a data.table has three arguments, i.e. da[x, y, z], where x, y, z can contain expressions. Above we used the first axis, x for selection of rows from an indexed data.table object. The second position, denoted y here, can contain any expresssion that evaluates using the columns of the data.table. This could be a column name, or an expression involving column names.

dr = ghcn[, mean(value)]

Several values can be calculated at once using a list of expressions:

dr = ghcn[, .(mean(value), sd(value))]

You can give these new variables names:

dr = ghcn[, .(m=mean(value), s=sd(value))]

If you do not give explicit names, default names (V1, etc.) are given.

It is possible to add a column in-place using the := operator, for example, we can add a “month” column to the dataset:

ghcn[, month:=as.integer(substr(date, 5, 6))]

Note that this operation has a return value, e.g. you can do:

x = ghcn[, month:=as.integer(substr(date, 5, 6))]

However usually you do not need to assign the result to a new variable, since the := expression is mutating (i.e. it changes the state of the data.frame, which is ghcn above, in-place).

If we want to do fast queries involving several variables, we need to update the indices:

setkey(ghcn, station, vtype, month)

The third component of the data.table [] operator, da[x, y, z], is used to support “group by” operations. The following defines a new data table containing only the TMAX values, and calculates the mean TMAX for each station/month combination:

ghcn1 = ghcn[vtype=='TMAX', mean(value), by=.(station, month)]

The second argument mean(value) is a reducing expression so it creates a new column by evaluating this expression over every group in the by statement. The two-element list .(station, month) indicates that the first level of grouping is over stations, and the second level is over months. The resulting table has one row for each station/month combination.

If needed, we can also ask data.table to include the size of each group using the .N symbol:

ghcn1 = ghcn[vtype=='TMAX', .(m=mean(value), n=.N), by=.(station, month)]

Another usage of the by term is to apply a non-reducing function to each group. Unlike the situation above where we applied a reducing function to the grouped data, in this case the resulting data table has the same dimensions as the input data table. One example usage for this would be to center the data for each month:

ghcn2 = ghcn[vtype=='TMAX', rvalue:=value - mean(value), by=station]

We can confirm that the centering was successful:

ghcn2[, sum(rvalue, na.rm=T), by=station]

data.table has a function called tables that displays all the current tables and their size in memory.

There is significant overlap between the functionality of dplyr and data.table. Here is one discussion on this topic.

HDF5

The HDF5 (Hierarchical Data Format version 5) file format is a disk storage format for data. An HDF5 file can contain one or many potentially large datasets. The format allows these datasets to be accessed randomly (i.e. without reading the entire file from the beginning). Each HDF5 dataset is a homogeneous array with up to 32 dimensions. The datasets are stored in a directory-like hierarchy organized around groups. A group can contain datasets, and other groups.

HDF5 files are a very powerful way to work with large datasets while leaving most of the data on the disk. Note in contrast that data.table, while very efficient, places all the data in memory.

An important aspect of the HDF5 format is that a dataset can be stored as chunks. That is, a single dataset can be broken into multiple pieces. Each piece is stored contiguously, but the different pieces may not be continguous. This allows random access into the dataset, and also allows the dataset to be extended without rewriting the entire HDF5 archive.

Another important aspect of the HDF5 format is that the datasets are “self documenting”, in that arbitrary metadata can be stored in the same HDF5 file.

HDF5 is an open standard, and utilities for processing HDF5 files are available for many programming languages. Since the file format is the same regardless of what language is being used, the HDF5 format can serve as an interchange format between software systems.

R has several libraries for working with HDF5 files. In the examples below we will use the rdhf5 library. This is not part of CRAN, to install it use the following:

source("https://bioconductor.org/biocLite.R")
biocLite("rhdf5")

You may need to replace https with http in the above code.

Note that there is a newer HDF5 library for R called h5 which seems to have a more elegant interface, and is part of CRAN. However it may not work on machines that lack an updated hdf5 system library.

Below is a complete program that takes the 2014 GHCN data, reads and organizes it using data.table, and writes it to a HDF5 file on the disk using the rdh5 library. We organize the data by measurement type (e.g. TMAX, TMIN, etc.), and store a separate dataset for each station. This layout would be most appropriate if we subsequently planned to access data for individual stations. Other layouts would be more appropriate if we planned to query by other variables, say by date or location.

Note that writing a dataset to a HDF5 file involves two steps. The first step is to call h5createDataset to create space for the data, the second step is to call h5write to actually write the data to the physical file. The following script creates a HDF5 dataset using one year of GHCN data. The file is fairly large so this script takes some time (over an hour) to run.

library(rhdf5)
library(data.table)

ghcn = fread("zcat /nfs/kshedden/GHCN/2014.csv.gz")
setnames(ghcn, c("station", "date", "vtype", "value", "x1", "x2", "x3", "x4"))

setkey(ghcn, vtype, station)

stations = unique(ghcn[, station])

H5close()
fname = "/nfs/kshedden/ghcn_2014.hdf5"
h5createFile(fname)

for (vt in c("TMIN", "TMAX", "PRCP")) {
    h5createGroup(fname, vt)
    for (j in 1:length(stations)) {
        st = stations[j]
        if (j %% 100 == 0) {
            print(j)
        }
        dd = ghcn[.(vt, st), .(date, value)]
        dd = as.matrix(dd)
        if (min(dim(dd)) > 0) {
            dname = sprintf("%s/%s", vt, st)
            h5createDataset(fname, dname, dim(dd), level=7)
            h5write(dd, fname, dname)
       }
    }
}
H5close()

The HDF5 format is often considered to be a “write once, read many times” format. For this reason, the underlying HDF5 C library makes it somewhat difficult to change an HDF5 file that has already been written to disk. To work around this, make sure you erase any local file called ghcn_2014.hdf5 before running the above program. Also note that we have several calls to the H5close function to ensure that we are starting from an empty data structure.

Now that we have created the HDF5 file, we can use it for some simple analyses. First, to obtain the contents of the file we can use h5ls:

contents = h5ls(fname)

Alternatively, working outside of the R session we can use command-line HDF5 tools that are installed on most Linux servers. To list the file contents on the command line use:

h5ls ghcn_2014.hdf5

If we know the group and dataset name of the data that we wish to obtain, we can extract it using h5read:

df1 = h5read(fname, "PRCP/US1MNWG0005")

The following program calculates the mean February TMAX value for some of the stations:

contents = h5ls(fname)
contents = contents[2:dim(contents)[1],]

stations = unique(contents$name)

dfr = data.frame()
for (j in 1:10) {

    dx = h5read(fname, sprintf("TMAX/%s", stations[j]))

    month = as.integer(substr(dx[,1], 5, 6))
    ii = which(month == 2)
    dfr[stations[j], "tmax"] = mean(dx[ii, 2], na.rm=TRUE) / 10
}

HDF5 is powerful but is also complex and sometimes seen as quite “heavy”. It also was developed more for purely numeric data, although it is also possible to store text in an HDF5 file. There is a lot of interest lately in data containers, and it is possible that another format may become dominant in the future. In particular, there are several columnar data formats that are becoming widely used. The Apache foundation sponsors the parquet and arrow container formats. The latter has been adapted to the feather format which is gaining popularity in the R and Python worlds.