Apply functions

Thank you for visiting my notes. I've noticed a large uptick in views lately which is terrific! However, these notes are quite old (2017) and some details may be outdated. I have a more recent class (2023) which covers much of the same information and is hopefully more up-to-date.

If you are using these notes, I'd love to hear from you and how you are using them! Please reach out to jerrick (at) umich (dot) edu.

Introduction

When we talk about “vectorizing” in R, typically this will use one of the apply family of functions. Loops are inefficient, and the apply functions can be used in many situations where one would otherwise use a loop.

I found this nice summary on StackOverflow by user joran of when to use each version; here’s my slightly updated version:

  • apply: When you want to apply a function to the rows or columns of a matrix, data.frame or array. (Not recommended for data.frames.)
  • lapply: When you want to apply a function to each element of a list, data.frame or vector in turn and get a list back.
  • sapply: When you want to apply a function to each element of a list, data.frame or vector in turn, but you want a vector back, rather than a list.

apply

The easiest function is apply without any modifier. apply can be used with any matrix-ish object (a matrix, a data.frame, an array) and performs a function along a given dimension. For example, to get the column minimums, we could run:

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
apply(iris[,-5], 2, min)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width
##          4.3          2.0          1.0          0.1

This is equivalent to running

min(iris[,1])
## [1] 4.3
min(iris[,2])
## [1] 2
min(iris[,3])
## [1] 1
min(iris[,4])
## [1] 0.1

or

save <- vector(length = 4)
for (i in 1:4) {
  save[i] <- min(iris[,i])
}
save
## [1] 4.3 2.0 1.0 0.1

Note that with data.frames, often you have mixed types (strings or factors alongside numeric) to which functions may or may not be defined. Hence my sub-setting of iris to drop the Species column.

The second argument to apply is the dimension to work on - “1” is row, “2” is column. For arrays, which can have higher dimensions (a matrix is a 2-dimensional array), you can have higher values.

The third argument is a function. You can pass additional arguments to that function as additional arguments to apply:

apply(iris[, -5], 2, mean)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width
##     5.843333     3.057333     3.758000     1.199333
apply(iris[, -5], 2, mean, trim = .15)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width
##     5.803774     3.042453     3.789623     1.186792

Finally, you can write your own function directly in-line:

apply(iris[, -5], 2, function(x) {
  return(mean(x))
})
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width
##     5.843333     3.057333     3.758000     1.199333

Note the curly braces, and note that x represents each column of data.

lapply

For non-array data structures (lists, vectors) we can instead of lapply. It works very similar, except it has no 2nd argument (there’s only one dimension to a list or a vector). lapply can also operate on a data.frame but only on columns (variables).

l <- list(1,2,3)
l
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
lapply(l, function(x) {
  return(x + 1)
})
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
lapply(1:3, `+`, 1)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
lapply(iris[, -5], min)
## $Sepal.Length
## [1] 4.3
##
## $Sepal.Width
## [1] 2
##
## $Petal.Length
## [1] 1
##
## $Petal.Width
## [1] 0.1

Recall that + is a function in R, `+`(1, 2) is identical to 1+2.

sapply

sapply is a wrapper around lapply that attempts to “simplify” the output - usually creating a vector where appropriate.

sapply(1:3, `+`, 1)
## [1] 2 3 4
sapply(iris[, -5], min)
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width
##          4.3          2.0          1.0          0.1

Unless you explicitly want a list output, you’ll probably use sapply over lapply - but if that doesn’t work, fall back on lapply. Look into Reduce or do.call for converting lists into a more usable object.

Note that sapply is much faster than apply for operating on columns of data.frames:

data(diamonds, package = "ggplot2")
# Drop off factors
diamonds <- data.frame(diamonds[, c(1, 5:10)])
dim(diamonds)
## [1] 53940     7
system.time(save <- apply(diamonds, 2, mean, na.rm = TRUE))
##    user  system elapsed
##   0.049   0.002   0.051
system.time(save2 <- sapply(diamonds, mean, na.rm = TRUE))
##    user  system elapsed
##   0.003   0.001   0.005
all.equal(save, save2)
## [1] TRUE

The others

There are few others of limited use:

  • vapply: It can be a bit faster than sapply because you explicitly define the output. I find it of limited use; the speed gains are rarely worth the coding overhead.
  • mapply: Allows you to pass in multiple objects to operate on. Has its arguments reversed; the first is the function, followed by any others. Try running mapply(sum, 1:4, 5:8). More useful if you have two data.frames!
  • tapply: Over a “ragged” array. Think of it more as a grouping function. If you want the sum of a variable by group, you could run something like tapply(variable, group, mean)

Josh Errickson