Statistics 506, Fall 2016

Vectorization in R


Vectorization is a programming technique used in many mathematical and statistical programming languages such as R and Matlab. It provides a way to avoid loops, which is important since loops generally execute very slowly in these languages. Vectorization has the potential to produce more readable and transparent code, although excessively complicated vectorization can be quite unreadable and may not perform well either.

Simulation study for basic confidence intervals

As a basic illustration of vectorization, suppose we are interested in the coverage probabilities of nominal 95% confidence intervals when the data are drawn from an exponential distribution.

We generate data sets of sample size n, and since this is a Monte-Carlo study we need to do this many times. The number of Monte Carlo replications is denoted below as mcrep. For each data set, we calculate the nominal 95% confidence interval for the population mean, and check whether it covers the true parameter value.

mcrep = 10000                # Simulation replications
n = 30                       # Sample size we are studying
xmat = rexp(n * mcrep)       # Simulate standard exponential data
dim(xmat) = c(mcrep, n)      # Each row is a dataset
mn = apply(xmat, 1, mean)    # Sample mean of each row
std = apply(xmat, 1, sd)     # Sample SD of each row
se = std / sqrt(n)           # Standard error of the mean
lcb = mn - 2*se              # lower confidence bound
ucb = mn + 2*se              # upper confidence bound
target = 1                   # value we are estimating
cvrg_prob = mean((lcb < target) & (ucb > target)) # coverage probability

Broadcasting

In general when you use arithmetic operations on two arrays, say by taking A + B, the two arrays should have the same shape. When this is the case, the arithmetic proceeds “pointwise”, meaning that (A+B)[i,j] = A[i,j] + B[i,j]. However there is an exception to this called broadcasting, which can be very useful in certain situations.

Broadcasting can be used to apply an arithmetic operation to two arrays in which the flattened length of one of the arrays is a multiple of the flattened length of the other array. One application for this is subtracting a given sequence of values from the corresponding rows of a two-dimensional array. For example, if we define a two-dimensional array

xmat = c(1, 2, 3, 4, 5, 6)
dim(xmat) = c(3, 2)

so that xmat is

1    4
2    5
3    6

Then we can subtract the one-dimensional array [1, 2, 3] using

xmat - c(1, 2, 3)

which yields

0    3
0    3
0    3

Note that the 1 is subtracted from the first row of xmat, the 2 is subtracted from the second row of xmat, and so on.

This technique can be used to center a two-dimensional array row-wise:

xmat_c = xmat - rowMeans(xmat)

This in turn allows us to efficiently compute the row-wise variances:

rowvar = rowMeans((xmat - rowMeans(xmat))^2) * dim(xmat)[2] / (dim(xmat)[2] - 1)

We can test that this result agrees with the results calculated directly using the var function:

rowvar_alt = apply(xmat, 1, var)
stopifnot(all.equal(rowvar, rowvar_alt))

We can also compare the timings of the two approaches:

m = 100000
n = 30
xmat = rnorm(m*n)
dim(xmat) = c(m, n)

tm1 = proc.time()
rowvar1 = apply(xmat, 1, var)
tm1 = proc.time() - tm1

tm2 = proc.time()
rowvar2 = rowMeans((xmat - rowMeans(xmat))^2) * dim(xmat)[2] / (dim(xmat)[2] - 1)
tm2 = proc.time() - tm2

Screening for mean differences

When working with large data sets it is often of interest to screen through the data in search of potentially interesting associations. For example, suppose we have two independent samples from two possibly different populations, and a very large number of attributes are measured on each unit. We may wish to compare the mean of each attribute between the populations.

Below is a vectorized version of this analysis. Note that the variance calculation uses broadcasting to center the rows.

m = 20000 # Number of attributes
n = 50    # Number of observations in each group

# Simulate group 1 data
x1 = rnorm(m*n)
dim(x1) = c(m, n)

# Simulate group 2 data
x2 = rnorm(m*n)
dim(x2) = c(m, n)

# Return Z-scores comparing means of corresponding rows of x1 and x2.
multi_zscore = function(x1, x2) {

    x1_mn = rowMeans(x1)
    x1_va = rowMeans((x1 - rowMeans(x1))^2) * n / (n - 1)

    x2_mn = rowMeans(x2)
    x2_va = rowMeans((x2 - rowMeans(x2))^2) * n / (n - 1)

    mean_diff = x1_mn - x2_mn
    z_score = mean_diff / sqrt(x1_va/n + x2_va/n)

    return(z_score)
}

Correlation coefficients

Here is another example of vectorization, this time using linear algebra. We wish to estimate a large number of correlation coefficients efficiently. These correlation coefficients are to be calculated between each row of a matrix xmat, and a single vector yvec. Below we calculate the correlation coefficients four ways and time the results. The fourth approach, which is vectorized and uses broadcasting, is the fastest.

First we simulate some data to use for testing.

n = 30
m = 10000
r = 0.4
yvec = rnorm(n)
xmat = outer(array(1, m), yvec)
xmat = 0.4*xmat + sqrt(1 - 0.4^2)*rnorm(n * m)

Now we can compare four approaches to calculating many correlation coefficients.

# First approach, calculate as a loop
tm1 = proc.time()
r1 = NULL
for (i in 1:m) {
    r1[i] = cor(xmat[i, ], yvec)
}
tm1 = proc.time() - tm1

# Second approach, functional style with apply
tm2 = proc.time()
r2 = apply(xmat, 1, function(v) cor(v, yvec))
tm2 = proc.time() - tm2

# Third approach, use linear algebra
tm3 = proc.time()
rmn = apply(xmat, 1, mean)
xmat_c = xmat - outer(rmn, array(1, n))
rsd = apply(xmat, 1, sd)
xmat_s = xmat_c / outer(rsd, array(1, n))
yvec_s = (yvec - mean(yvec)) / sd(yvec)
r3 = xmat_s %*% yvec_s / (n - 1)
r3 = as.vector(r3)
tm3 = proc.time() - tm3

# Fourth approach, use linear algebra with broadcasting
tm4 = proc.time()
rmn = rowMeans(xmat)
xmat_c = xmat - rmn
rvar = rowMeans(xmat_c^2) * dim(xmat)[2] / (dim(xmat)[2] - 1)
rsd = sqrt(rvar)
xmat_s = xmat_c / rsd
yvec_s = (yvec - mean(yvec)) / sd(yvec)
r4 = xmat_s %*% yvec_s / (n - 1)
r4 = as.vector(r4)
tm4 = proc.time() - tm4

# Confirm that the three approaches give equivalent results
stopifnot(all.equal(r1, r2))
stopifnot(all.equal(r1, r3))
stopifnot(all.equal(r1, r4))

Bootstrapping

Bootstrapping is a generic technique that can be applied in numerous ways. One useful application of bootstrapping is to construct confidence intervals for statistics that do not allow for easy derivation of analytic confidence intervals. One such setting is computing confidence intervals for the quantiles of a distribution.

To illustrate this, we use the non-parametric bootstrap to estimate the confidence intervals for a given quantile. It is not easy to fully vectorize this operation, but we can use apply to avoid having an explicit loop.

# Return a bootstrap 95% confidence interval for the
# q^th population quantile based on an iid sample.
quant_ci = function(y, q=0.5, n_boot=1000) {
    m = length(y)
    mat = sample(y, n_boot*m, replace=TRUE)
    dim(mat) = c(n_boot, m)
    f = function(x)quantile(x, q)
    v = apply(mat, 1, f)
    lcb = quantile(v, 0.025)
    ucb = quantile(v, 0.975)
    return(c(lcb, ucb))
}

For comparison, here is a version with a loop:

# Return a bootstrap 95% confidence interval for the
# q^th population quantile based on an iid sample.
quant_ci_alt = function(y, q=0.5, n_boot=1000) {
    m = length(y)
    qvec = NULL
    for (i in 1:n_boot) {
        dvec = sample(y, m, replace=TRUE)
        qvec[i] = quantile(dvec, q)
    }
    lcb = quantile(qvec, 0.025)
    ucb = quantile(qvec, 0.975)
    return(c(lcb, ucb))
}

We can compare the running times of the two versions:

tm1 = proc.time()
quant_ci(rexp(1000))
tm1 = proc.time() - tm1

tm2 = proc.time()
quant_ci_alt(rexp(1000))
tm2 = proc.time() - tm2

You will see that the version with the explicit loop actually runs slightly faster than the version using apply. This example does not exploit any array-based indexing or linear algebra, which is where the performance gain usually occurs under vectorization. The apply function hides the loop, but a loop must still be executed behind the scenes.