Counting the number of times that a condition is satisfied
Don’t use loops to count things. If x
is a vector,
sum(x > 5)
tells us the number of values in x
that are greater than 5. To get
the index positions of the values in x
that are greater than 5, use
which(x > 5)
So
length(which(x > 5))
is an inefficient way to get the number of values in x
that are
greater than 5.
If you mix these two approaches incorrectly you get the wrong answer:
length(x > 5)
is just telling you the length of x
, and
sum(which(x > 5))
is summing the indices, which is probably not what you want.
The following gives the correct result, but is very inefficient and not idiomatic at all for R:
count = 0
for (j in 1:length(x)) {
if (x[j] > 5) {
count = count + 1
}
}
Apply
The first argument to apply should be 2-dimensional or higher. It is
an error to use apply
with a vector (an array without a dim
attribute).
The third argument to apply
should be a “reducing” function, like
sum
or max
that takes a vector as input and returns a scalar. It
does not usually make sense to call apply
with a “pointwise”
function like abs
, which returns an output that is the same shape as
the input. You can use abs(x)
to get the absolute value of every
element of x
, there is no need for apply
.
The third argument passed to apply
should be a function, so
something like this doesn’t work:
apply(x, 1, median(x - median(x)))
It can be useful to “wrap” a function, for example if you want to
specify an optional argument. Below we wrap the mad
function so
that we have the raw MAD, not the MAD that is adjusted to match the
Gaussian distribution:
apply(x, 2, function(x)mad(x, constant=1))
But you don’t need to wrap a function if you are just calling it in the default way. For example, this is needlessly complex:
apply(x, 2, function(x)median(x))
Loops
Statements in a loop that do not change during the loop iterations should be moved outside the loop. For example, this is inefficient:
# Counts the ratio between the proportion of elements in each row
# of an array that are positive and the proportion of elements in
# the first row of the array that are positive.
for (j in 1:length(x)) {
v[j] = sum(x[j,] > 0) / sum(x[1,] > 0)
}
It is better to use the code below, so that we don’t repeatedly recalculate the result for the first row:
p1 = sum(x[1,] > 0)
for (j in 1:length(x)) {
v[j] = sum(x[j,] > 0) / p1
}