Debugging R code

Note

RMarkdown is not suited for visualizing debugging tasks (due to the fact that the R scripts it contains are not interactive). Therefore, all of the debugging output below is not run live, but manually copied from the R console. Hopefully this does not impact understanding, but there’s the possibility that input and output can become disconnected.

Introduction

Finding your bug is a process of confirming the many things you believe are true, until you find one which is not true. - Norman Matloff

Manual debugging

The most basic way of debugging that I’m sure we’ll all done is stepping through the code line-by-line to identify where things go wrong. With relative short code, or for a particularly pernicious bug, this can be a fine strategy. However, we can do better.

Rather than stepping through the entire code, we can instead insert print() calls to observe certain key objects. For larger objects (e.g. models from lm or data.frames), trying printing some summary which may be informative.

Syntax highlighting and indenting

Most code editors (such as RStudio) offer syntax highlighting and automatic indentation. These can be used to identify simple bugs such as missed quotes or parentheses. For example, below is a chunk of code from lm:

    mt <- attr(mf, "terms")
    y <- model.response(mf, "numeric")
    w <- as.vector(model.weights(mf))
    if (!is.null(w) && !is.numeric(w))
        stop('weights' must be a numeric vector")
    offset <- as.vector(model.offset(mf))
    if (!is.null(offset)) {
        if (length(offset) != NROW(y)) 
            stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
                length(offset), NROW(y)), domain = NA)
    }

RMarkdown’s syntax highlighting isn’t as good as RStudio’s, but it illustrates the point. Notice that halfway through the code, starting with the offset line, the text turns red, whereas before that, functions are bold and quotes are red. This is a good indication that there is a missed quotation somewhere; in this case there is no opening quote in the stop just prior to offset.

Become familiar with your syntax highlighting. If you notice the highlighting acting differently than expected, you likely have a bug.

Similar to this is automatic indentation. RStudio attempts to help you maintain proper code style by automatically indenting inside conditional expressions or if you break function arguments into multiple lines. Consider the following code:

if (num.controls == 0) {
  if (!exists("tmp")) {
    tmp <- .fullmatch(d.r, mnctl.r, mxctl.r, omf.r)
  }
  new.omit.fraction <- c(new.omit.fraction, omf.r)
}

There is proper indentation throughout. However, if we were to make a typo:

if (num.controls == 0) {
  if (!exists("tmp"))
    tmp <- .fullmatch(d.r, mnctl.r, mxctl.r, omf.r)
}
new.omit.fraction <- c(new.omit.fraction, omf.r)
}

Notice how the indenting breaks when we remove the opening { after the second if statement. Its hard to visualize this in RMarkdown, but if you try it live, you’ll see what happens.

It can help to run Code -> Reindent Lines (after highlighting a chunk of code) and let R try to automatically indent existing code - if that looks odd, you may have an issue.

Note that these tools will only catch trivial bugs like a missed character, not more serious issues.

Formal Debugging

R does offer formal debugging capabilities. These will allow us to find the bugs more rapidly than manual debugging.

traceback

Errors in R can be somewhat obtuse, especially if they occur in the more internal functions rather than in the user-facing functions. It can be handy to look at the call stack when the error occurs. For example, let’s say we have the following functions:

foo <- function(x) {
  x <- x + 1
  bar(x)
}

bar <- function(x) {
  a <- baz(x)
  return(a)
}

baz <- function(x) {
  sum(x, "a")
}
foo(3)
## Error in sum(x, "a"): invalid 'type' (character) of argument
> traceback()
3: baz(x) at #2
2: bar(x) at #3
1: foo(3)

The stack as it appears at the time of error is printed. Starting at the bottom, foo(3) was called (obviously). Inside foo is a call to bar (at line #3) so bar(x) gets added to the call stack. Then inside bar is a call to baz (at line #2) so baz(x) is added to the call stack. The final error occurs directly inside baz, so that’s the call stack.

A quick note about the line numbers: They are relative inside each function. You can examine each function’s compiled source by typing edit(foo) which will give you the line numbers directly.

This is most useful when writing your own code and functions and of limited use debugging base R functions, as they tend to refer to internal functions very rapidly; e.g.

lm(y ~ x)
## Error in eval(expr, envir, enclos): object 'y' not found
> traceback()
7: eval(expr, envir, enclos)
6: eval(predvars, data, env)
5: model.frame.default(formula = y ~ x, drop.unused.levels = TRUE)
4: stats::model.frame(formula = y ~ x, drop.unused.levels = TRUE)
3: eval(expr, envir, enclos)
2: eval(mf, parent.frame())
1: lm(y ~ x)

browser

Of course, if we do not get an error but rather an unexpected output, traceback will not help. For that task, we need to step through the code to figure out where things go wrong. In more complex tasks, particularly when the call stack can get large, stepping into and out of each sub-function can be unnecessarily burdensome. The browser function is an automated way to step through code.

Accessing the browser

There are a few ways to invoke the browser:

  1. Add the call browser() anywhere in your function. If you’ve previously identified where the issue starts to arise (perhaps with traceback) you can start just prior to that, otherwise place it in the beginning of the function. Call the function normally and you’ll enter the browser at the place of your choosing.
  2. Enable the recover option via options(error = recover). If you encounter an error, you will have a choice of where to enter the call stack and start the browser. To stop this functionality, reset the option to NULL: options(error = NULL). You could also save the existing option and restore it:
oldopt <- options()
options(error = recover)
options(oldopt)
  1. Enable debugging on a specific function: debug(<funname>). Whenever this function is run (whether directly or it is hit in the call stack), a browser will start automatically. Run undebug(<funname>) to disable.

Using the browser

First, let’s see what the browser looks like:

foo <- function(x, y = 3) {
  browser()
  bar(x, y)
}
bar <- function(a, b) {
  out <- b
  for (i in 1:a) {
    out <- out + b
  }
  out
}
> foo(2)
Called from: foo(2)
Browse[1]> 

We see the text Browse[n] prepended to the typical > prompt. The number there represents the position in the call stack.

In addition to normal R commands, there are a few new commands. We’ll list them and then demonstrate:

  • Q: Terminate the browser, not running any further code.
  • c: Terminate the browser by running the remaining code to completion.
  • f: Run the current loop or function to completion but stay in browser.
  • s: Run the next line, stepping into any function calls.
  • n: Run the next line, not stepping into any function calls.
  • Blank: Run s or n, whichever was most recent.
  • where: Print the call stack.

Some examples:

> foo(2)
Called from: foo(2)
Browse[1]> n
debug at #3: bar(x, y)
Browse[2]> 
[1] 9

Here, we first run n to go to the next line, then enter a blank, which repeats n, finishing the function and producing output.

> foo(2)
Called from: foo(2)
Browse[1]> n
debug at #3: bar(x, y)
Browse[2]> s
debugging in: bar(x, y)
debug at #1: {
    out <- b
    for (i in 1:a) {
        out <- out + b
    }
    out
}
Browse[3]> f
[1] 9

This time we step into bar, but then immediately use f to finish bar. Since there is no more left in foo, the output is produced.

> foo(2)
Called from: foo(2)
Browse[1]> c
[1] 9
> foo(2)
Called from: foo(2)
Browse[1]> Q
> 

c versus Q for terminating a browser.

Now, why is the browser useful? We can do the same sort of debugging as the manual case, but with the benefits of browser. All R code is supported in the browser, though I wouldn’t recommend getting too complicated (You can start a second browser nested inside the first if you wanted…)

> foo(2)
Called from: foo(2)
Browse[1]> 
debug at #3: bar(x, y)
Browse[2]> s
debugging in: bar(x, y)
debug at #1: {
    out <- b
    for (i in 1:a) {
        out <- out + b
    }
    out
}
Browse[3]> 
debug at #2: out <- b
Browse[3]> x
Error: object 'x' not found
Browse[3]> y
Error: object 'y' not found
Browse[3]> a
[1] 2
Browse[3]> b
[1] 3
Browse[3]> 
debug at #3: for (i in 1:a) {
    out <- out + b
}
Browse[3]> f
[1] 9

Here we print the value of certain objects at a few points which could be useful for debugging purposes. Note specifically the scoping going on here; inside bar, only a and b, the arguments of bar are seen.

Preventing errors

Of course, another approach is to prevent errors before they occur. We’ve briefly discussed checking user input. We’ll go into it a bit further here, and then discuss tryCatch.

Checking user input

R does support generic functions, in the sense that you can write functions which only accept a certain type of input. Theses occur during the creation of S3 or S4 classes, which we won’t cover. These sort of functions have a name like summary.lm, which implies it is a summary function whose first argument is an lm. You don’t need to call summary.lm(lm(...)), R is smart enough to figure out that summary(lm(...)) should pass the lm into summary.lm (known as “dispatching”).

You can see whether a given function is generic and what objects it supports using methods:

head(methods(summary))
## [1] "summary.aov"                   "summary.aovlist"
## [3] "summary.aspell"                "summary.check_packages_in_dir"
## [5] "summary.connection"            "summary.data.frame"
length(methods(summary))
## [1] 33

However, most functions are not generic:

methods(lm)
## Warning in .S3methods(generic.function, class, parent.frame()): function
## 'lm' appears not to be S3 generic; found functions that look like S3
## methods
## [1] lm.fit       lm.influence lm.wfit
## see '?methods' for accessing help and source code
methods(sum)
## no methods found

So for these functions, if you pass the wrong type of object, the function needs to detect and properly error (or else the user will get an obtuse error later in the code). Additionally, further arguments are not generic1 so we’ll need to check them anyway.

Generally, this checking entails ensuring that arguments are of the proper type and have feasible values. For example, in lm:

if (!is.null(w) && !is.numeric(w))
  stop("'weights' must be a numeric vector")

Code style issues aside (missing { and }!) this will provide a useful error to the user. %in% can also be useful; say you have an argument that only has reasonable values of 1, 4, and 10:

if (!(argument %in% c(1, 4, 10))) {
  stop("Bad input in 'argument'!")
}

There is also the function stopifnot which takes in any number of expressions that return TRUE or FALSE and produces an error if untrue. This is much shorter than the conditional, but make sure the error it will produce is useful to the user:

stopifnot(1 == 1)
stopifnot(1 == 2)
## Error: 1 == 2 is not TRUE

tryCatch

Sometimes errors may occur either due to statistical issues (e.g. failure of a glm to converge) or due to a bug inside someone else’s function that you can’t modify. In these cases, it’s useful to wrap the potentially offending code in a tryCatch block to either produce a useful error message or to provide a graceful fallback. For example:

smartSum <- function(x, y) {
  tryCatch({
    sum(x, y)
  },
  error = function(e) {
    print(e)
    sum(as.numeric(x), as.numeric(y))
  })
}
smartSum(1,2)
## [1] 3
sum("1", 2)
## Error in sum("1", 2): invalid 'type' (character) of argument
smartSum("1", 2)
## <simpleError in sum(x, y): invalid 'type' (character) of argument>
## [1] 3

We can see that the sum(x, y) line is tried. When it runs without error nothing else is done. When it does error, we capture the error in the error = function(e) line, print the error as a warning, and try converting to numeric.

Note that you need not use the error e (remove the print(e) line and the code works fine) but you do have to capture it as an argument to the function.

Some common issues

This list is by no means comprehensive, but a small list of common issues which I’ve come across.

Numeric precision and all.equal

a <- sqrt(2)^2
a
## [1] 2
a == 2
## [1] FALSE
isTRUE(all.equal(a, 2))
## [1] TRUE

Due to some numeric precision issues, be careful checking equality when anything more complicated than basic arithmetic is involved. all.equal allows a bit of flexibility by checking equality up to some tolerance (default is \(1.5\textrm{x}10^{-8}\)), however, all.equal will NOT return false if equality fails:

all.equal(1, 2)
## [1] "Mean relative difference: 1"

Instead, call isTRUE(all.equal(...))).

[ vs [[

When working with a list, [ extracts a list of length 1, while [[ extracts the object at the position.

l <- list(a = 1, b = 2)
is(l[1])
## [1] "list"   "vector"
is(l[[1]])
## [1] "numeric" "vector"

& vs &&

R has two versions of logical “AND”, & and &&. These often operate identically, but differ in two key ways. (Replace “AND”/&/&& with “OR”/|/|| as necessary.)

  1. The && version is not vectorized whereas & is.
c(TRUE, FALSE) && c(TRUE, TRUE)
## [1] TRUE
c(TRUE, FALSE) & c(TRUE, TRUE)
## [1]  TRUE FALSE
  1. The && version will only evaluate until the outcome is determined, whereas & evaluates all arguments.
f <- function() {
  print("In F")
  return(FALSE)
}
t <- function() {
  print("In T")
  return(TRUE)
}
f() & t()
## [1] "In F"
## [1] "In T"
## [1] FALSE
f() && t()
## [1] "In F"
## [1] FALSE

Aside from the obvious bugs that using the wrong version can cause, there’s also some more subtle issues:

  • If subsetting a matrix/data.frame, using && subsets all (if TRUE) or nothing (if FALSE) instead of the subset which the vectorized & would evaluate to.
  • Say you have two conditions, condA && condB. If your test cases always have condA as FALSE, then condB will never be evaluated. Then in a real run, if you hit condB it could error.

That said, && is preferred over & in any case where you don’t need to vectorize, as it will be slightly faster. Just be sure to test all combinations of TRUE/FALSE!

Conditionals when NA/NULL is plausible

x <- NA
x == 2
## [1] NA
!is.na(x) & x == 2
## [1] FALSE

<- vs =

While its not incorrect to use = for assignment, the standard tends to be <-. However, make sure you do not use <- in a function call for an argument:

f <- function(x) {
    print(x)
}
f(x = 2)
## [1] 2
x
## [1] NA
f(x <- 2)
## [1] 2
x
## [1] 2

T and F

Never use T or F in place of TRUE or FALSE:

> T <- 5
> TRUE <- 5
Error in TRUE <- 5 : invalid (do_set) left-hand side to assignment

T and F can be accidentally re-defined.

Matrix subsetting

This is a common issue I run into: Normally, subsetting matrices returns matrices. However, if you subset a single column or row, R converts it into a vector. To prevent this, add as a third argument drop = FALSE:

x <- matrix(1:4, nrow = 2)
x[, 1]
## [1] 1 2
x[, 1, drop = FALSE]
##      [,1]
## [1,]    1
## [2,]    2

Function returns

Without an explicit return call, a function returns the output of the last line. Note the difference when the last line contains an assignment:

f1 <- function() {
    a <- 1
}
f1()
k <- f1()
k
## [1] 1
f2 <- function() {
    1
}
f2()
## [1] 1
k <- f2()
k
## [1] 1

  1. Some functions, such as `+` can dispatch on the first two or several arguments, but nothing dispatches on all arguments.

Josh Errickson