Statistics 506, Fall 2016

R tips and common errors


Counting the number of times that a condition is satisfied

Don’t use loops to count things. If x is a vector,

sum(x > 5)

tells us the number of values in x that are greater than 5. To get the index positions of the values in x that are greater than 5, use

which(x > 5)

So

length(which(x > 5))

is an inefficient way to get the number of values in x that are greater than 5.

If you mix these two approaches incorrectly you get the wrong answer:

length(x > 5)

is just telling you the length of x, and

sum(which(x > 5))

is summing the indices, which is probably not what you want.

The following gives the correct result, but is very inefficient and not idiomatic at all for R:

count = 0
for (j in 1:length(x)) {
    if (x[j] > 5) {
        count = count + 1
    }
}

Apply

The first argument to apply should be 2-dimensional or higher. It is an error to use apply with a vector (an array without a dim attribute).

The third argument to apply should be a “reducing” function, like sum or max that takes a vector as input and returns a scalar. It does not usually make sense to call apply with a “pointwise” function like abs, which returns an output that is the same shape as the input. You can use abs(x) to get the absolute value of every element of x, there is no need for apply.

The third argument passed to apply should be a function, so something like this doesn’t work:

apply(x, 1, median(x - median(x)))

It can be useful to “wrap” a function, for example if you want to specify an optional argument. Below we wrap the mad function so that we have the raw MAD, not the MAD that is adjusted to match the Gaussian distribution:

apply(x, 2, function(x)mad(x, constant=1))

But you don’t need to wrap a function if you are just calling it in the default way. For example, this is needlessly complex:

apply(x, 2, function(x)median(x))

Loops

Statements in a loop that do not change during the loop iterations should be moved outside the loop. For example, this is inefficient:

# Counts the ratio between the proportion of elements in each row
# of an array that are positive and the proportion of elements in
# the first row of the array that are positive.
for (j in 1:length(x)) {
    v[j] = sum(x[j,] > 0) / sum(x[1,] > 0)
}

It is better to use the code below, so that we don’t repeatedly recalculate the result for the first row:

p1 = sum(x[1,] > 0)
for (j in 1:length(x)) {
    v[j] = sum(x[j,] > 0) / p1
}