Profiling R code

Introduction

Slow code can put a major damper on coding efforts and research in general. Optimizing code can offer major speed gains, often without changing the overall algorithm. Of course, if adjusting the algorithm to speed things up can be accomplished, that’s useful too!

Optimizing is always an iterative process: Find a bottleneck, improve it if possible, find the next bottleneck, repeat. There’s rarely a clear end-point; the goal is more to get “acceptable” performance. The definition of “acceptable” depends greatly on context; in some contexts anything slower than a few seconds might be intolerable, in other contexts running time in hours rather than days would be great.

Most (but not all) R code is well-optimized, mostly because most (but not all) is written not in R, but in a lower level language like C. In general, any code will be faster in a lower level language. We will not cover C code in this class, however, if your work involves the need for extreme optimization, you should look into it1 There is one situation in which R code is far from optimal, which we’ll discuss below.

Profiling

RStudio has built-in profiling capabilities which outshine R’s default profiling. It mimics R’s version and I’ll briefly mention below how to profile outside of RStudio, but for these notes we’ll be focusing on RStudio.

The “diamonds” data set in ggplot2 has around 50,000 rows; large enough that models have a noticeable delay. The profvis package (and function) can let us look in more detail. This can also be run from the Profile menu of RStudio, but the end result will be identical.

profvis::profvis(summary(lm(price ~ clarity + color + cut + depth + table + carat, data = diamonds)))

At the very bottom, we see the final running time of around a quarter of a second.

The profile window has two tabs, the Flame Graph and the Data. The top half of Flame Graph is blank because this is all internal R code; we’ll see what it looks like later. The bottom half is the Flame Graph.

The x-axis is time (in milliseconds) of execution; at each point we see the call stack as it stands. Since this is an internal function, a lot of this may not be familiar. Each box represents a level in the call stack, its width being the time it took to run that line and all lines above it in the call stack. Here we can see that the lm calls three big slowdown lines are eval (leading to model.frame), model.matrix and lm.fit.

In the Data tab, we get much the same information as the flame graph, just in a averaged format. If you were to fit several models (as we will later), the Data tab would collapse all fits together. See below for details.

Example

Let’s look at a more useful example. Obviously, slower code is better for demonstration purposes as minor relative improvements can seem large. We’ve not talked about simulations yet, but let’s imagine a scenario where we want to test whether a given characteristic in our data is introducing bias in the coefficient estimates.

Say we have data from some survey in which interviewers provide a series of questions to subjects. To protect anonymity, we are blinded to all identifying information, including which interviewer questioned which subject. (Of course, this is a terrible design. We’d like to know which interviewer performed each interview to be able to adjust for interviewer-effects, perhaps with a random effect.) After the data is collected, we hear that one of the interviewers realized he made a terrible mistake and almost always rounded salary up substantially, from for example $77k to $80k. Unfortunately, since we don’t know which interviewer it was, nor can we identify by rounded salaries (people tend to round their own salaries anyways), we can repeatedly create fake data and fit a model to it to determine whether the estimated coefficient is biased from the truth.

library(ggplot2)
library(profvis)
profvis({
  reps <- 10000
  n <- 100
  beta0 <- 2
  beta1 <- .7

  save <- vector()

  save <- for (i in 1:reps) {
    x <- rnorm(n)
    y <- beta0 + beta1*x + rnorm(n)

    # Add rounding. Since we're dealing with standardized data, round to 2 decimals.
    badinterviewer <- sample(1:n, .25*n, replace = FALSE)
    x[badinterviewer] <- ceiling(x[badinterviewer])

    coef <- lm(y ~ x)$coef[2]
    save <- c(save, coef)
  }
  save <- data.frame(save)
  ggplot(save, aes(x = save)) + geom_density() + geom_vline(xintercept = beta1)
})
## Warning in rnorm(n): '.Random.seed' is not an integer vector but of type
## 'NULL', so ignored

Pretty slow, almost 15 seconds!

Optimization Step 1

Let’s start optimizing.

The flame graph is mostly useless here because we have so much going on. The source code above indicates much more cleanly where the bottlenecks lie. The columns next to the code represent the memory regained during garbage collection (The negative value; not really that useful), the memory allocated at that call, and the time (in ms) spent in each line.

From this, we can clearly see that the lm call is the most wasteful. I mentioned before that built-in R code is tightly optimized, but here’s a situation where it isn’t. And the main cause is that lm does far more than we need here! All we want here is the coefficient of a linear model. lm returns far more; look at the documentation. We’ve got two options.

First, we can write our own solver. We know that if the response is Y, an \(n\times 1\) matrix and the predictors are X, an \(n\times p + 1)\) matrix with the first column constant 1, then the coefficients are simply

\[ (X'X)^{-1}X'Y \]

We can compute this directly.

Alternatively, there’s an internal R function, .lm.fit, which takes in only a design matrix (with the intercept already added) and a response and is designed to run much quicker.

Rather than run our entire function (which is trivial in this case, but imagine if your code took days!), we would like to be able to test the speed of these three approaches. However, profvis will complain. Enter microbenchmark by Olaf Mersmann and others. The sole purpose is to measure very fast code!

microbenchmark takes a list of expressions (You can pass longer code by wrapping in { and }, but that somewhat defeats the “micro” purpose…) and runs them repeatedly (100 by default), returning runtime statistics.

# Copy the data generating code:
n <- 100
beta0 <- 2
beta1 <- .7
x <- rnorm(n)
y <- beta0 + beta1*x + rnorm(n)

badinterviewer <- sample(1:n, .25*n, replace = FALSE)
x[badinterviewer] <- ceiling(x[badinterviewer])

y <- matrix(y, ncol = 1)
x <- cbind(1, x)
library(microbenchmark)
(mbm <- microbenchmark(lm(y ~ x),
                       solve(t(x) %*% x) %*% (t(x) %*% y),
                       .lm.fit(x, y)))
## Unit: microseconds
##                                expr     min       lq      mean   median
##                           lm(y ~ x) 750.155 773.0215 827.34957 784.0270
##  solve(t(x) %*% x) %*% (t(x) %*% y)  25.158  30.8700  43.18425  42.8780
##                       .lm.fit(x, y)   6.262   7.2730  11.36469   8.9095
##        uq      max neval cld
##  808.6270 2510.707   100   b
##   45.4080  277.096   100  a
##   10.1845  226.120   100  a
boxplot(mbm)

We can see that although my manual solution is way better than lm, R’s solution is far better.

profvis({
  reps <- 10000
  n <- 100
  beta0 <- 2
  beta1 <- .7

  save <- vector()

  save <- for (i in 1:reps) {
    x <- rnorm(n)
    y <- beta0 + beta1*x + rnorm(n)

    # Add rounding. Since we're dealing with standardized data, round to 2 decimals.
    badinterviewer <- sample(1:n, .25*n, replace = FALSE)
    x[badinterviewer] <- ceiling(x[badinterviewer])

    y <- matrix(y, ncol = 1)
    x <- cbind(1, x)

    coef <- .lm.fit(x, y)$coef
    save <- c(save, coef[2])
  }
  save <- data.frame(save)
  ggplot(save, aes(x = save)) + geom_density() + geom_vline(xintercept = beta1)
})

We’re running 10x faster than before already! Notice also that a number of lines before the lm are now relatively slower. So we’ve taken a huge step towards optimizing this, but we can go farther.

Since this is the first appearance of a useful flame graph, let me note a few useful features.

  • You can zoom in/out of the flame graph This is somewhat unwieldy.
  • If you mouse over a yellow box, it highlights the corresponding code - and vice-versa.
  • The white boxes correspond to code that the profiler doesn’t have direct access to. There are ways around that, but generally you should be focusing efforts on pieces in the call stack you can directly control.
  • See the “Sample Interval: 10 ms”? You can control this with the interval argument to profvis. The reason why some lines seem to be skipped is because they occur between measurements; you could increase the sampling interval to gain further insight. Really only useful with already fast code (as if a line is running faster than 10ms, its pretty good!)

Optimization Step 2

The next slowest line is save <- c(save, coef[2]). In R, anytime you perform an operation to create a large object (e.g. c, rbind), R has to create a whole new copy of the object. For one off operations this is trivial, but in a loop or simulation, those small inefficiencies can add up. There are two ways around it.

The first is to create an empty vector (or matrix as needed) once before we start looping, and then fill it in. The commands vector(length = 10) or matrix(nrow = x, ncol = y) are useful here.

profvis({
  reps <- 10000
  n <- 100
  beta0 <- 2
  beta1 <- .7

  save <- vector(length = reps)

  save <- for (i in 1:reps) {
    x <- rnorm(n)
    y <- beta0 + beta1*x + rnorm(n)

    # Add rounding. Since we're dealing with standardized data, round to 2 decimals.
    badinterviewer <- sample(1:n, .25*n, replace = FALSE)
    x[badinterviewer] <- ceiling(x[badinterviewer])

    y <- matrix(y, ncol = 1)
    x <- cbind(1, x)

    coef <- .lm.fit(x, y)$coef
    save[i] <- coef[2]
  }
  save <- data.frame(save)
  ggplot(save, aes(x = save)) + geom_density() + geom_vline(xintercept = beta1)
})

Alternatively, sapply should (in theory) be even faster.

profvis({
  reps <- 10000
  n <- 100
  beta0 <- 2
  beta1 <- .7

  save <- sapply(1:reps, function(i) {
    x <- rnorm(n)
    y <- beta0 + beta1*x + rnorm(n)

    # Add rounding. Since we're dealing with standardized data, round to 2 decimals.
    badinterviewer <- sample(1:n, .25*n, replace = FALSE)
    x[badinterviewer] <- ceiling(x[badinterviewer])

    y <- matrix(y, ncol = 1)
    x <- cbind(1, x)

    coef <- .lm.fit(x, y)$coef
    coef[2]
  })
  save <- data.frame(save)
  ggplot(save, aes(x = save)) + geom_density() + geom_vline(xintercept = beta1)
})

We’ve got from ~15 seconds to half a second.

Note that sapply isn’t always faster than a loop, especially if the output from sapply isn’t a clean vector/matrix and needs to be converted from a list - functions like Reduce as quite slow. If you’re really optimizing for performance, you’ll need to test both. Here it doesn’t appear to make much of a difference.

In either case, the remaining bottlenecks now are the rnorm calls. I’ve not found a way to speed these up. We could try some of the other lines (in particular, I suspect there’s a better way to create the badinterviewer effect), but any gain there would be very minimal.

Tips for speeding up code

  • Vectorize your functions instead of loop. For an example of this, consider the movie question from the first assignment.
  • Google for other solutions (I found .lm.fit for this class!). Try “R <function name> faster”.
  • Never dynamically expand objects - pre-allocate space.
  • If your code is fast enough that profvis won’t help, run it repeatedly. The flame graph might not be useful, but the rest of the output will.
profvis(rnorm(100))
## Error in parse_rprof(prof_output, expr_source): No parsing data available. Maybe your function was too fast?
profvis({
  for (i in 1:10000) {
    rnorm(100)
  }
})
  • Common wisdom isn’t always right. For your specific case, the “slower” function may outperform.
  • Writing C or C++ code can be very rewarding, especially if you’re doing matrix operations.
  • Sometimes its a better time commitment just to run on a more powerful computer.
  • Parallelization can help (which we’ll get to in a few weeks).
  • If you can, optimize (and debug) your code on a subset of your data to keep yourself sane!

Code specific

  • mean is ridiculously slow for what it is. There are a few other situations like this, e.g. x*x is faster than x^2.
x <- rnorm(100)
microbenchmark(mean(x),
               sum(x)/length(x))
## Unit: nanoseconds
##              expr  min     lq    mean median   uq   max neval cld
##           mean(x) 2472 2661.5 3377.96 2898.5 3155 23118   100   b
##  sum(x)/length(x)  391  428.5  617.11  464.5  542  8619   100  a

  1. Code which is best for low-level code typically involves heavy arithmetic (especially matrix operations) or loops. Looping in R is generally slower than vectorized functions in R which are generally slower than loops in C.

Josh Errickson