Apply functions
Introduction
When we talk about “vectorizing” in R, typically this will use one of the apply family of functions. Loops are inefficient, and the apply functions can be used in many situations where one would otherwise use a loop.
I found this nice summary on StackOverflow by user joran of when to use each version; here’s my slightly updated version:
apply
: When you want to apply a function to the rows or columns of a matrix, data.frame or array. (Not recommended for data.frames.)lapply
: When you want to apply a function to each element of a list, data.frame or vector in turn and get a list back.sapply
: When you want to apply a function to each element of a list, data.frame or vector in turn, but you want a vector back, rather than a list.
apply
The easiest function is apply
without any modifier. apply
can be used with any matrix-ish object (a matrix
, a data.frame
, an array
) and performs a function along a given dimension. For example, to get the column minimums, we could run:
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
apply(iris[,-5], 2, min)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4.3 2.0 1.0 0.1
This is equivalent to running
min(iris[,1])
## [1] 4.3
min(iris[,2])
## [1] 2
min(iris[,3])
## [1] 1
min(iris[,4])
## [1] 0.1
or
save <- vector(length = 4)
for (i in 1:4) {
save[i] <- min(iris[,i])
}
save
## [1] 4.3 2.0 1.0 0.1
Note that with data.frame
s, often you have mixed types (strings or factors alongside numeric) to which functions may or may not be defined. Hence my sub-setting of iris
to drop the Species
column.
The second argument to apply
is the dimension to work on - “1” is row, “2” is column. For array
s, which can have higher dimensions (a matrix
is a 2-dimensional array
), you can have higher values.
The third argument is a function. You can pass additional arguments to that function as additional arguments to apply
:
apply(iris[, -5], 2, mean)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
apply(iris[, -5], 2, mean, trim = .15)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.803774 3.042453 3.789623 1.186792
Finally, you can write your own function directly in-line:
apply(iris[, -5], 2, function(x) {
return(mean(x))
})
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5.843333 3.057333 3.758000 1.199333
Note the curly braces, and note that x
represents each column of data.
lapply
For non-array data structures (lists, vectors) we can instead of lapply
. It works very similar, except it has no 2nd argument (there’s only one dimension to a list or a vector). lapply
can also operate on a data.frame but only on columns (variables).
l <- list(1,2,3)
l
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
lapply(l, function(x) {
return(x + 1)
})
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
lapply(1:3, `+`, 1)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
lapply(iris[, -5], min)
## $Sepal.Length
## [1] 4.3
##
## $Sepal.Width
## [1] 2
##
## $Petal.Length
## [1] 1
##
## $Petal.Width
## [1] 0.1
Recall that +
is a function in R, `+`(1, 2)
is identical to 1+2
.
sapply
sapply
is a wrapper around lapply
that attempts to “simplify” the output - usually creating a vector where appropriate.
sapply(1:3, `+`, 1)
## [1] 2 3 4
sapply(iris[, -5], min)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 4.3 2.0 1.0 0.1
Unless you explicitly want a list output, you’ll probably use sapply
over lapply
- but if that doesn’t work, fall back on lapply
. Look into Reduce
or do.call
for converting lists into a more usable object.
Note that sapply
is much faster than apply
for operating on columns of data.frame
s:
data(diamonds, package = "ggplot2")
# Drop off factors
diamonds <- data.frame(diamonds[, c(1, 5:10)])
dim(diamonds)
## [1] 53940 7
system.time(save <- apply(diamonds, 2, mean, na.rm = TRUE))
## user system elapsed
## 0.049 0.002 0.051
system.time(save2 <- sapply(diamonds, mean, na.rm = TRUE))
## user system elapsed
## 0.003 0.001 0.005
all.equal(save, save2)
## [1] TRUE
The others
There are few others of limited use:
vapply
: It can be a bit faster thansapply
because you explicitly define the output. I find it of limited use; the speed gains are rarely worth the coding overhead.mapply
: Allows you to pass in multiple objects to operate on. Has its arguments reversed; the first is the function, followed by any others. Try runningmapply(sum, 1:4, 5:8)
. More useful if you have two data.frames!tapply
: Over a “ragged” array. Think of it more as a grouping function. If you want the sum of a variable by group, you could run something liketapply(variable, group, mean)