Statistics 506, Fall 2016

Concurrent programming in R


Most basic computer programs (including everything we have seen so far this term) run sequentially. This means that only one statement executes at a time. In a very sequential basic program, the statements run from top to bottom, with each statement executing in sequence:

x = 3 + 4
y = x + 5

if (y < 3) {
    y = 2 * y
}

A more complex program uses looping and branching statements to manage the control flow, so that the statements do not execute in order:

f = function(x) { 3*x }

x = 5
for (k in 1:5) {
    x = f(x)
}

However even in this example, at any given point in time only a single statement is being executed.

Modern computers are capable of managing multiple processes. There are three main ways that this can be achieved:

  • Multiple processes can run non-simultaneously on a single processing core, this is called multitasking (or time sharing).

  • Multiple processes can run simultaneously on different cores of a single computer. This is called multiprocessing.

  • Multiple processes can run simultaneously on distinct computers on a network. This is usually called cluster computing.

Note that in the first two approaches, the main and secondary memory is shared by all processes, but in the third approach, processes running on separate computers have their own main/secondary memory.

Gains in processing speed from one generation of processors to the next are slowing. Therefore, to accelerate complex computations, it is becoming more effective to use multiple cores and/or multiple computers, rather than to obtain computers with ultra-fast processors.

The terms parallelism and concurrency are used to refer to programming strategies that allow parts of a program to run simultaneously. The key challenge in any type of concurrent programming is to express through your code that certain values are not needed immediately, and therefore the execution of the program can continue while waiting for these values to be produced “asynchronously” (in the background). Most popular computing languages were not designed to natively support this type of logic, although many popular add-on frameworks allow concurrent programming to be done using standard languages like R.

Below is a concrete example showing how concurrent programming could be useful. In the code below, we do not need the values of b or d until line 6. Therefore, rather than “blocking” on line 2 while waiting for the function f to return, it would in principle be possible to compute line 2 in the background while continuing with line 3. We could then calculate line 4 in the background while computing line 5, since the result of line 4 is not needed until line 6. However upon reaching line 6, we must wait for all previous lines to be fully executed.

a = 32             # 1
b = f(a)           # 2
c = 64             # 3
d = f(c)           # 4
e = 2*a            # 5
f = b + d + e      # 6

How do we express this logic through our code? Any standard programming language, including R, will execute the above code sequentially, meaning that each line must complete before the following line is run. There is no way for R to know that lines 3, 4, and 5 can run before line 2 has completed, but then at line 6 we must wait for lines 1-5 to complete.

As a note on terminology, the distinction between “concurrent” and “parallel” is not that important for our purposes. But generally speaking, “concurrency” refers to a program that expresses that certain values can be calculated in the background while the execution path continues (and also expresses that certain values cannot be calculated until other values are known). “Concurrency” is usually taken to be agnostic to the particular way that the execution logic is implemented, for example it could even apply to a multi-tasking context where two statements never actually execute simultaneously. In contrast, “parallelism” refers to both the programming language and the underlying operating system, hardware and network that allow many operations to take place at the same time.

The “future” package and concurrent programming in R

future is an R package that provides an expressive syntax for concurrent programming operations (see also here). There are other R packages that enable concurrent programming, like snow and foreach, but I think that future is particularly useful for learning the principles of concurrency.

The basic idea of programming with the future package is that you can explicitly state that certain values are not needed immediately. This allows the calculations of these values to run in the background until they are needed, at which point the execution must block until the value is available. The future package is able to figure out when to block the execution, or you can signal this explicitly with the value function. We will only use the explicit approach here, but you can find more information about the implicit approach, which uses the %<-% operator, in the future package documentation.

The main responsibility of a programmer using the future package is to specify which statements can potentially be run concurrently. Note that this does not mean that they will definitely run concurrently. This decision is up to the future framework and depends on some user-settable parameters that we will discuss below.

Below is a very simple program using future. We use the future function to define two values that are not immediately needed. Only when the value function is called on these futures are their values required.

library(future)

plan(multicore)

fa = future({ 1 + 1 })        # 1
fb = future({ 2 + 2 })        # 2

v = c(value(fa), value(fb))   # 3

In the above code, lines 1 and 2 define code blocks whose results are not immediately needed. Therefore they can in principle be run in the background. Only when the value functions are called in line 3 are the results of the two future expressions be needed. Therefore, the execution must block at line 3 until the two futures are finished running.

By default, futures run in “eager” model, which essentially behaves the same as a standard R program. For futures to actually run asynchronously, you must choose an asynchronous “plan” as done above. On linux (or MacOS), plan(multicore) is generally the best choice. On windows you should use plan(multisession) instead.

A future can only be evaluated once. Its result is memoized (cached) after it is executed and subsequent calls return the cached value:

library(future)

plan(multicore)

fa = future({ print("a"); 1 + 1; })
fb = future({ print("b"); 2 + 2; })

v = c(value(fa), value(fb), value(fa))

Example: a simulation study

Simulation studies are a natural place to use concurrency. In most simulation studies, a single task is performed many times, and there is no need for communication until all the tasks are complete. Communication is always the main challenge in concurrent programming, so a program design that requires no communication should be easy to implement concurrently.

The code below uses futures in a simulation study looking at the actual versus theoretical standard errors in a large logistic regression (the theoretical standard errors are based on the Fisher information and therefore may deviate somewhat from the actual standard deviations of the parameter estimates).

library("future")

plan(multicore)

n = 100000
p = 100

# Be nice, limit concurrency to use this number of ores
maxcores = 5

# Number of simulation replications
nrep = 100

# Design matrix (covariate data)
x = rnorm(n*p)
dim(x) = c(n, p)

# P(Y=1 | X)
lp = rowSums(x) / 10
pr = 1 / (1 + exp(-lp))

# Fit the model to one data set, return coefficients and standard errors
f = function() {
    y = 1*(runif(n) < pr)
    lr = glm(y ~ x, family=binomial())
    se = sqrt(diag(vcov(lr)))
    return(list(b=coef(lr), s=se))
}

# Create a list of futures.  In asynchronous mode the calls to the
# future function below are not blocking (until we run out of cores).
# Note that we need to refer to all globals used in the future inside
# of the future code block.  This is a limitation of the future
# package that may eventually go away.
ii = 0
z = list()
while (ii <= nrep) {
    ft = list()
    for (k in 1:maxcores) {
        ft[[k]] = future({ n; pr; x; f() })
    }

    for (k in 1:maxcores) {
        z[[ii + k]] = value(ft[[k]])
    }

    ii = ii + maxcores
}

# Some data management to extract the coefficients and
# standard errors into separate arrays
bl = list()
sl = list()
for (k in 1:nrep) {
    bl[[k]] = z[[k]]$b
    sl[[k]] = z[[k]]$s
}

# More data management
b = unlist(bl)
dim(b) = c(p+1, nrep)
b = t(b)

# More data management
s = unlist(sl)
dim(s) = c(p+1, nrep)
s = t(s)

# Things to inspect
apply(b, 2, mean)
apply(b, 2, sd)
apply(s, 2, mean)

To assess the timing of this code, you can use the following on the command line. For comparison, use plan(eager) to obtain the run time of the synchronous version of the calculations, or change the value of maxcores to reduce or increase the amount of concurrency.

date; R CMD BATCH tmp.R; date

Example: asynchronous data processing

Another way to use concurrent programming is when processing large amounts of data, especially when the data can naturally be divided into independent “chunks” that are then processed concurrently. The Global Historical Climatology Network provides data files for weather stations containing measurements of temperature, rainfall, and other weather variables. Many of the stations have been active for 100 years or longer. Since there are thousands of stations, and each station has its own data file, it is natuiral to use concurrency to process them efficiently.

Below, we calculate the number of months for each station when the temperature range exceeded 10 degrees C.

library(future)
library(data.table)
library(readr)

dpath = "/nfs/kshedden/GHCN/ghcnd_gsn"

files = list.files(dpath)

plan(multicore)

maxcores = 5

# Field widths
w = c(11, 4, 2, 4)
p = c(5, 1, 1, 1)
for (j in 1:31) {
    w = c(w, p)
}

# Column names
cn = c("id", "year", "month", "element")
p = c("value", "mflag", "qflag", "sflag")
for (j in 1:31) {
    for (k in 1:4) {
        cn = c(cn, sprintf("%s%d", p[k], j))
    }
}

# Column specification for read_fwf
fw = fwf_widths(w, cn)

# Column names corresponding to the actual measurements
valnames = NULL
for (j in 1:31) {
    valnames[j] = sprintf("value%d", j)
}

# Type codes for the data columns
colcodes = paste(rep("dccc", 31), collapse="")
colcodes = paste(c("cidc", colcodes), collapse="")

# Count the number of years where the temperature range exceeds 10
# degrees.
f = function(fname) {
    fn = sprintf("%s/%s", dpath, fname)
    z = read_fwf(fn, fw, col_types=colcodes)
    z[z == -9999] = NA
    z = data.table(z)
    z = z[element=="TMAX"]
    x = z[, valnames, with=FALSE]
    x = x / 10
    r = apply(x, 1, max, na.rm=T) - apply(x, 1, min, na.rm=T)
    return(sum(r > 10))
}

# Create the futures
fl = list()
ii = 0
for (k in 1:length(files)) {
    ft = list()
    for (j in 1:maxcores) {
        if (ii + j > length(files)) {
            break
        }
        ft[[j]] = future({ f(files[ii + j]) })
    }

    for (j in 1:maxcores) {
        if (ii + j > length(files)) {
            break
        }
        fl[[ii + j]] = value(ft[[j]])
    }

    ii = ii + maxcores
}

# Evaluate the futures asynchronously
fl = unlist(fl)