Parallel Processing in R
Serial versus Parallel processing
We’ve vaguely discussed this idea in passing, specifically in that *apply
functions are faster than for loops (usually). Let’s be a little more formal.
Consider that we have a series of functions to run, f1
, f2
, etc.
Serial processing means that f1
runs first, and until f1
completes, nothing else can run. Once f1
completes, f2
begins, and the process repeats.
Parallel processing (in the extreme) means that all the f#
processes start simultaneously and run to completion on their own.
If we have a single computer at our disposal and have to run n
models, each taking s
seconds, the total running time will be n*s
. If however, we have k < n
computers we can run our models on, the total running time will n*s/k
. In the old days this was how parallel code was run; and is still run on larger servers. However, modern computers have “multicore” processors and can be equivalent to running multiple computers at a time. The equation is not quite as clean (there are other things running on each process; overhead in transferring between processors exists; etc) but in general we see the same gain.
The serial-parallel scale
A problem can range from “inherently serial” to “perfectly parallel”1.
An “inherently serial” problem is one which cannot be parallelized at all - for example, if f2
depended on the output of f1
before it could begin, even if we used multiple computers, we would gain no speed-ups2.
On the other hand a “perfectly parallel” problem is one in which there is absolutely no dependency between iterations; most of the *apply
calls or simulations we’ve discussed in this class fall into this category. In this case, any f*
can start at any point and run to completion regardless of the status of any other f*
.
Sometimes the dependency occurs at the end of each function; so we could start running f1
and f2
in completion, but f2
would pause before finishing while it waits for f1
to finish. We can still see a speed gain here.
In general, if each f*
is very fast, running them all parallel may not be the most efficient (due to overhead). It may be better to run the first \(k\) in serial, the next \(k\) in serial, etc.
Here we’re going to exclusively discuss the perfectly parallel situation; it is where most (though of course not all) statistical situations will land. The more complicated parallel situations arise generally in deeper computer science scenarios, for example handling many users of a service.
Terminology
Let’s just nail down some terminology.
- A core is a general term for either a single processor on your own computer (technically you only have one processor, but a modern processor like the i7 can have multiple cores - hence the term) or a single machine in a cluster network.
- A cluster is a collection of objecting capable of hosting cores, either a network or just the collection of cores on your personal computer.
- A process is a single running version of R (or more generally any program). Each core runs a single process.
The parallel
package
There are a number of packages which can be used for parallel processing in R. Two of the earliest and strongest were multicore
and snow
. However, both were adopted in the base R installation and merged into the parallel
package.
library(parallel)
You can easily check the number of cores you have access to with detectCores
:
detectCores()
## [1] 4
The number of cores represented is not neccessarily correlated with the number of processors you actually have thanks to the concept of “logical CPUs”. For the most part, you can use this number as accurate. Trying to use more cores than you have available won’t provide any benefit.
Methods of Paralleization
There are two main ways in which code can be parallelized, via sockets or via forking. These function slightly differently:
- The socket approach launches a new version of R on each core. Technically this connection is done via networking (e.g. the same as if you connected to a remote server), but the connection is happening all on your own computer3 I mention this because you may get a warning from your computer asking whether to allow R to accept incoming connections, you should allow it.
- The forking approach copies the entire current version of R and moves it to a new core.
There are various pro’s and con’s to the two approaches:
Socket:
- Pro: Works on any system (including Windows).
- Pro: Each process on each node is unique so it can’t cross-contaminate.
- Con: Each process is unique so it will be slower
- Con: Things such as package loading need to be done in each process separately. Variables defined on your main version of R don’t exist on each core unless explicitly placed there.
- Con: More complicated to implement.
Forking:
- Con: Only works on POSIX systems (Mac, Linux, Unix, BSD) and not Windows.
- Con: Because processes are duplicates, it can cause issues specifically with random number generation (which should usually be handled by
parallel
in the background) or when running in a GUI (such as RStudio). This doesn’t come up often, but if you get odd behavior, this may be the case. - Pro: Faster than sockets.
- Pro: Because it copies the existing version of R, your entire workspace exists in each process.
- Pro: Trivially easy to implement.
In general, I’d recommend using forking if you’re not on Windows.
Note: These notes were compiled on OS X.
Forking with mclapply
The most straightforward way to enable parallel processing is by switching from using lapply
to mclapply
. (Note I’m using system.time
instead of profvis
here because I only care about running time, not profiling.)
library(lme4)
## Loading required package: Matrix
f <- function(i) {
lmer(Petal.Width ~ . - Species + (1 | Species), data = iris)
}
system.time(save1 <- lapply(1:100, f))
## user system elapsed
## 2.048 0.019 2.084
system.time(save2 <- mclapply(1:100, f))
## user system elapsed
## 1.295 0.150 1.471
If you were to run this code on Windows, mclapply
would simply call lapply
, so the code works but sees no speed gain.
mclapply
takes an argument, mc.cores
. By default, mclapply
will use all cores available to it. If you don’t want to (either becaues you’re on a shared system or you just want to save processing power for other purposes) you can set this to a value lower than the number of cores you have. Setting it to 1 disables parallel processing, and setting it higher than the number of available cores has no effect.
Using sockets with parLapply
As promised, the sockets approach to parallel processing is more complicated and a bit slower, but works on Windows systems. The general process we’ll follow is
- Start a cluster with \(n\) nodes.
- Execute any pre-processing code necessary in each node (e.g. loading a package)
- Use
par*apply
as a replacement for*apply
. Note that unlikemcapply
, this is not a drop-in replacement. - Destroy the cluster (not necessary, but best practices).
Starting a cluster
The function to start a cluster is makeCluster
which takes in as an argument the number of cores:
numCores <- detectCores()
numCores
## [1] 4
cl <- makeCluster(numCores)
The function takes an argument type
which can be either PSOCK
(the socket version) or FORK
(the fork version). Generally, mclapply
should be used for the forking approach, so there’s no need to change this.
If you were running this on a network of multiple computers as opposed to on your local machine, there are additional argumnts you may wish to run, but generally the other defaults should be specific.
Pre-processing code
When using the socket approach to parallel processing, each process is started fresh, so things like loaded packages and any variables existing in your current session do not exist. We must instead move those into each process.
The most generic way to do this is the clusterEvalQ
function, which takes a cluster and any expression, and executes the expression on each process.
clusterEvalQ(cl, 2 + 2)
## [[1]]
## [1] 4
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 4
##
## [[4]]
## [1] 4
Note the lack of inheritance:
x <- 1
clusterEvalQ(cl, x)
## Error in checkForRemoteErrors(lapply(cl, recvResult)): 4 nodes produced errors; first error: object 'x' not found
We could fix this by wrapping the assignment in a clusterEvalQ
call:
clusterEvalQ(cl, y <- 1)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 1
clusterEvalQ(cl, y)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 1
y
## Error in eval(expr, envir, enclos): object 'y' not found
However, now y
doesn’t exist in the main process. We can instead use clusterExport
to pass objects to the processes:
clusterExport(cl, "x")
clusterEvalQ(cl, x)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 1
The second argument is a vector of strings naming the variables to pass.
Finally, we can use clusterEvalQ
to load packages:
clusterEvalQ(cl, {
library(ggplot2)
library(stringr)
})
## [[1]]
## [1] "stringr" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[2]]
## [1] "stringr" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "stringr" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[4]]
## [1] "stringr" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
Note that this helpfully returns a list of the packages loaded in each process.
Using par*apply
There are parallel versions of the three main apply
statements: parApply
, parLapply
and parSapply
for apply
, lapply
and sapply
respectively. They take an additional argument for the cluster to operate on.
parSapply(cl, Orange, mean, na.rm = TRUE)
## Tree age circumference
## NA 922.1429 115.8571
All the general advice and rules about par*apply
apply as with the normal *apply
functions.
Close the cluster
stopCluster(cl)
This is not fully necessary, but is best practices. If not stopped, the processes continue to run in the background, consuming resources, and any new processes can be slowed or delayed. If you exit R, it should automatically close all processes also. This does not delete the cl
object, just the cluster it refers to in the background.
Keep in mind that closing a cluster is equivalent to quitting R in each; anything saved there is lost and packages will need to be re-loaded.
Continuing the example
cl <- makeCluster(detectCores())
clusterEvalQ(cl, library(lme4))
## [[1]]
## [1] "lme4" "Matrix" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[2]]
## [1] "lme4" "Matrix" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "lme4" "Matrix" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[4]]
## [1] "lme4" "Matrix" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
system.time(save3 <- parLapply(cl, 1:100, f))
## user system elapsed
## 0.095 0.017 1.145
stopCluster(cl)
Timing this is tricky - if we just time the parLapply
call we’re not capturing the time to open and close the cluster, and if we time the whole thing, we’re including the call to lme4. To be completely fair, we need to include loading lme4
in all three cases. I do this outside of this Markdown file to ensure no added complications. The three pieces of code were, with a complete restart of R after each:
### lapply
library(parallel)
f <- function(i) {
lmer(Petal.Width ~ . - Species + (1 | Species), data = iris)
}
system.time({
library(lme4)
save1 <- lapply(1:100, f)
})
### mclapply
library(parallel)
f <- function(i) {
lmer(Petal.Width ~ . - Species + (1 | Species), data = iris)
}
system.time({
library(lme4)
save2 <- mclapply(1:100, f)
})
### mclapply
library(parallel)
f <- function(i) {
lmer(Petal.Width ~ . - Species + (1 | Species), data = iris)
}
system.time({
cl <- makeCluster(detectCores())
clusterEvalQ(cl, library(lme4))
save3 <- parLapply(cl, 1:100, f)
stopCluster(cl)
})
lapply | mclapply | parLapply |
---|---|---|
4.237 | 4.087 | 6.954 |
This shows the additional overhead that can occur with the socket approach - it can definitely be faster, but in this case the overhead which is added slows it down. The individual running time of the single parLapply
call is faster.