Distributed computing is a term that describes the simultaneous use of many computing nodes to process very large datasets, or to perform very large-scale computations.

Computer clusters and grids

Typically, distributed computing involves computer clusters that are built out of standard off-the-shelf compute nodes (usually Linux servers). These nodes each have their own processors, primary memory (RAM), and stable storage (disk). The compute nodes communicate with each other using standard computer networking protocols such as Ethernet.

A typical compute cluster has a cluster manager and multiple (10’s-1000’s) of worker nodes. The role of the cluster manager is to manage the loads on the worker nodes. A distributed computation must request access to worker nodes from the cluster manager, and should only utilize worker nodes to which it has been granted access.

A grid computing system, like a cluster, consists of many computers working together. The term “grid” usually refers to a collection of computers with possibly different hardware and operating system configurations, that are possibly not-colocated, whereas a cluster usually refers to a homogeneous colleciton of nodes housed in a single location.

Programming models for distributed computing

Spark is a framework for cluster computing originally developed at UC Berkeley and currently managed by the Apache Software Foundation. Spark is one of the two most well-known cluster computing frameworks, the other being Hadoop. Spark is focused on calculations that are performed in-memory, whereas Hadoop makes much heavier use of disk storage.

The key idea of distributed computing with Spark or Hadoop is that a computation is split into pieces that can be run independently on multiple computing nodes. The driver program initiates the computation. It receives permission form the cluster manager to launch executor process on worker nodes. Each executor process completes one part of the computation. The result of each executor process is then communicated back to the driver process, which integrates the results and makes them available to the user.

Cluster computing frameworks are built to accommodate node failures (e.g. hardware or network failures). If a node fails the computation can continue, using other nodes to take the place of the failed nodes. This process will be transparent to the application in most cases.

RDDs

The central idea in Spark is the Resilient Distributed Dataset, or “RDD”. This is a dataset that is stored in memory across multiple nodes. It is stored redundantly so that if any node goes down, the RDD can be recovered. RDD’s are immutable (or read-only), meaning that they cannot be changed after being created (this is important for performance reasons).

A RDD is an ordered collection of records which are split into partitions. The partitions may be stored on different machines. Since an RDD is read-only, there is no need to synchronize changes to the RDD within or across machines.

A RDD is created in three main ways:

By processing data in stable storage (i.e. a file system). This usually involves reading a file such that the rows of the file become the records of the RDD.
By parallelizing a “collection” in the language of your program (Python, R, Scala, …). For example, a Python list of lists [[1, 3], [2, 1], [5, 5]] could be parallelized into an RDD with three records, each of which is a list containing two values.
By operating on other RDDs using transformations. For example, a transformation could be to select the values in another RDD that satisfy a certain condition.

Most RDD’s will result from either a load from file storage, or the parallelization of an existing collection, followed by application of multiple transformations. A RDD is formally defined through the steps that are used to create it. Thus an RDD can always be recreated through knowledge of these steps. RDD’s are constructed lazily, meaning that the steps that define the RDD are not actually executed until the RDD is needed to perform a computation, and even then only the part of the RDD that is needed is created.

To make use of an RDD, the program applies an action to it. Most actions return a result to the driver application. An action could be something simple such as summing the numbers in an RDD, or counting the elements of an RDD that meet a certain condition. By combining multiple actions, complex tasks such as fitting a logistic regression model to data in a RDD can be preformed.

The Spark framework

The Spark framework consists of several components that are written largely in Scala and Java. The term “Spark” itself refers to the specification for the format and behavior of RDDs, and to a set of tools that have been implemented for manipulating RDDs. Programs using the Spark framework can be written in a variety of languages. There is an R interface to Spark, but the examples below use the Python interface to Spark. However they do not use any advanced Python language features. The PySpark API is modeled on the Scala Spark API, which is the native and most mature Spark API.

Example: counting lines in a file

Below we develop a PySpark program to count the lines in a GHCN file that are TMAX or TMIN records. We will develop this program line-by-line, then show the whole program.

A PySpark program must import the SparkContext module, this is analogous to a library or require call in R.

from pyspark import SparkContext

Every Spark program must begin by creating a “context”. This is a single object that contains information about the program. The “local” argument means that we are running Spark on a single machine, and the second argument is a name for the application that appears in the logging output.

sc = SparkContext("local", "Count TMIN and TMAX lines")

To access a datafile, use the sc.textFile function. This creates an RDD called data (since the RDD is created lazily, no data will actually be processed at this point, since no action has been taken).

data = sc.textFile("2014.csv")

We then apply the filter transformation to the RDD so that we only have lines containing either TMAX or TMIN. This creates a new RDD, and since we will not be using the original RDD we can just reuse its name. Note that Python uses lambda expressions to define anonymous functions.

data = data.filter(lambda x : 'TMAX' in x or 'TMIN' in x)

A RDD is created when it is required for an action. After it is used in an action, it may be discarded and then recreated if it is used in a subsequent action. If you are sure you are going to need a RDD for multiple actions, you can suggest that the system cache it so that it persists as long as your program is running (this is just a hint and may be ignored by the system).

data.cache()

Next we create two more RDDs, each containing only one type of line.

data_tmax = data.filter(lambda x: 'TMAX' in x)
data_tmin = data.filter(lambda x: 'TMIN' in x)

Now we can apply the count action to our RDDs. Since the RDDs are constructed lazily, the text file is not actually read until this point in the program. The Spark framework will manage creation of this RDD by successively applying the transformations defined above to the rows of the text file. The RDD may be spread across multiple nodes if it is too big to fit in the memory of one node. As noted above, the abstraction that can be used to recover the RDD in the event of a node failure is always stored on multiple nodes, but the data itself may not be.

n_tmax = data.filter(lambda x: 'TMAX' in x).count()
n_tmin = data.filter(lambda x: 'TMIN' in x).count()

Finally we print the results:

print("%d lines contain 'TMAX'" % n_tmax)
print("%d lines contain 'TMIN'" % n_tmin)

Here is the complete program:

from pyspark import SparkContext

sc = SparkContext("local", "Count TMIN and TMAX lines")
data = sc.textFile("2014.csv")
data = data.filter(lambda x : 'TMAX' in x or 'TMIN' in x)
data.cache()

data_tmax = data.filter(lambda x: 'TMAX' in x)
data_tmin = data.filter(lambda x: 'TMIN' in x)

n_tmax = data_tmax.count()
n_tmin = data_tmin.count()

print("%d lines contain 'TMAX'" % n_tmax)
print("%d lines contain 'TMIN'" % n_tmin)

Example: average by group (SQL approach)

Spark has a SQL module that allows distributed calculations to be carried out using a SQL-like syntax. Below we demonstrate how to use this to average the TMAX values within each station.

First we walk through the script line-by-line. We first need to import the usual Spark modules, as well as some SQL-specific modules:

from pyspark import SparkContext
from pyspark.sql import functions as func
from pyspark.sql import SQLContext

Next we create the Spark context object, and also a SQL context object:

sc = SparkContext("local", "Average TMAX by station")
sqlc = SQLContext(sc)

To use SQL in Spark, we will need to create an object called a DataFrame. To create a SQL DataFrame in Spark, we first create an RDD containing the rows of the DataFrame. To do this, we will need a function that takes a line from the text datafile and turns it into an array containing the station identifier and the TMAX value.

def split(x):
    x = x.split(",")
    v = [x[0], x[3]]
    try:
        v[1] = float(v[1]) / 10
    except:
        v[1] = None
    return v

The following line contains a RDD whose elements are the raw text rows of the data file:

lines = sc.textFile("2014.csv")
lines = lines.filter(lambda x : "TMAX" in x)

Next we transform this into another RDD which contains only the station identifier and TMAX values. We convert the TMAX value to a number, and remove the rows with missing values:

data = lines.map(split)
data = data.filter(lambda x : x[1] is not None)

Now we are ready to create the DataFrame. The first argument below is the RDD containing the data from which the DataFrame is composed, and the second argument is the column names.

df = sqlc.createDataFrame(data, ['station', 'tmax'])

Using the DataFrame, we can now use SQL-like syntax to conduct the calculations.

averages_rdd = df.groupby('station').agg(func.avg('tmax').alias('avg_tmax'))

The averages_rdd object is an RDD, meaning that it may be distributed across multiple compute nodes. The collect method of the RDD transfers the data from the worker nodes and creates a single Python object from them on the driver node. Note that collect can only be done to an RDD that is small enough to assemble on a single node.

averages = averages_rdd.collect()
print(averages)

Here is the complete program:

from pyspark import SparkContext
from pyspark.sql import functions as func
from pyspark.sql import SQLContext

sc = SparkContext("local", "Average tmax by station")
sqlc = SQLContext(sc)

def split(x):
    x = x.rstrip().split(",")
    v = [x[0], x[3]]
    try:
        v[1] = float(v[1]) / 10
    except:
        v[1] = None
    return v

lines = sc.textFile("2014.csv")
lines = lines.filter(lambda x : "TMAX" in x)
data = lines.map(split)
data = data.filter(lambda x : x[1] is not None)
df = sqlc.createDataFrame(data, ['station', 'tmax'])

averages_rdd = df.groupby('station').agg(func.avg('tmax').alias('avg_tmax'))
averages = averages_rdd.collect()
print(averages)

Example: average by group (without SQL)

We can also average by group using reduceByKey. This approach does not require SQL. Reduce is a basic concept from computer science that uses a function to transform a list into a scalar. The function must take two values as arguments and return one value. Suppose the function is f and our list has three elements [a, b, c]. Then reducing the list using f yields f(f(a, b), c). If f is associative and commutative (like sum or max) the result does not depend on the ordering of the list.

To compute the average of a collection of numbers, we need their sum, and the number of values in the collection. In Spark, we do this by creating an RDD whose values are pairs [x, 1]. When we reduce these pairs using sum, we get the pair containing the sum in the first position, and the number of values in the second position. We then divide the two elements in the pair to obtain the average.

from pyspark import SparkContext

sc = SparkContext("local", "Average by group")

data1 = sc.textFile("2014.csv")
data1 = data1.filter(lambda x : 'TMAX' in x)

def split(x):
    x = x.split(",")
    v = [x[0], None]
    try:
        v[1] = [float(x[3]) / 10, 1]
    except:
        v[1] = None
    return v

data2 = data1.map(split)
data2 = data2.filter(lambda x : x[1] is not None)

data3 = data2.reduceByKey(lambda x,y: [x[0]+y[0], x[1]+y[1]])
data4 = data3.map(lambda x : [x[0], x[1][0] / x[1][1]]).collect()

print(data4)

Here is an alternative version of the above program that uses a standard Python library called Numpy. Using Numpy allows us to do arithmetic on arrays, so we can add the pairs [x, 1] in a single step.

from pyspark import SparkContext
import numpy as np

sc = SparkContext("local", "Sum by group")

data1 = sc.textFile("2014.csv")
data1 = data1.filter(lambda x : 'TMAX' in x)

def split(x):
    x = x.split(",")
    v = [x[0], None]
    try:
        v[1] = np.r_[float(x[3]) / 10, 1]
    except:
        v[1] = None
    return v

data2 = data1.map(split)
data2 = data2.filter(lambda x : x[1] is not None)

data3 = data2.reduceByKey(lambda x,y: x+y)
data4 = data3.map(lambda x : [x[0], x[1][0] / x[1][1]]).collect()

print(data4)

Calculating the variance with `mapPartitions`

Calculating the mean, or calculating the mean for each group (as in the examples above), only requires one pass through the data. Calculating the variance is less straightforward. There is a built-in Spark function for calculating the variance, shown below. But first we will implement the variance calculation using mapPartitions. This technique is very useful in situations where no efficient built-in Spark function exists.

There are one-pass approaches to calculating the variance, for example, by calculating E[X^2] and (EX)^2 in a single pass, then using the identity var(X) = E[X^2] - (EX)^2. But this is not numerically stable. The usual way to calculate the variance for a single in-memory dataset is to first calculate the mean m, then calculate the mean of the squared deviations (x[i] - m)^2. But this requires two passes through the data, which is expensive when using Spark.

An alternative approach is to exploit the fact that we can operate efficiently within a partition of an RDD. Thus, we can easily calculate the mean m[j] and variance v[j] within each partition, then use the “law of total variation” to obtain the overall variance from these quantities. The law of total variation states that the (marginal) variance is equal to mean of the conditional variances plus the variance of the conditional means. Here “conditional” can be taken to mean “within a partition”. Since the partitions may contain unequal numbers of data values, these will need to be weighted means and variances.

from pyspark import SparkContext
import numpy as np

sc = SparkContext("local", "Compute variance")

data1 = sc.textFile("2014.csv").filter(lambda x : "TMAX" in x).repartition(100).cache()

# Calculate the mean, variance, and sample size over the iterator.
def sumstats(iterator):
    data = []
    for row in iterator:
        row = row.rstrip().split(",")
        try:
            data.append(float(row[3]) / 10)
        except:
            continue

    data = np.asarray(data)
    return [[data.mean(), data.var(), len(data)]]

data2 = data1.mapPartitions(sumstats).collect()
data2 = np.asarray(data2)

# Remove partitions with undefined values
ii = np.isfinite(data2).all(1)
data2 = data2[ii, :]

# Remove empty partitions
ii = (data2[:, 2] > 0)
data2 = data2[ii, :]

# Weights derived from partition sizes
wgt = data2[:, -1]
wgt /= wgt.sum()

# Variance of the conditional means
mn = np.average(data2[:, 0], weights=wgt)
vare = np.average((data2[:, 0] - mn)**2, weights=wgt)

# Mean of the conditional variances
evar = np.average(data2[:, 1], weights=wgt)

var = vare + evar

print(np.sqrt(var), vare, evar)

Next we can calculate the variance directly using Spark’s built-in variance method for RDDs:

from pyspark import SparkContext
import numpy as np

sc = SparkContext("local", "Compute variance")

data1 = sc.textFile("2014.csv").filter(lambda x : "TMAX" in x).repartition(100).cache()


def split(x):
    x = x.split(",")
    try:
        return float(x[3]) / 10
    except:
        return None


data1 = data1.map(split)
data1 = data1.filter(lambda x : x is not None)

print(data1.variance())

MapReduce

Mapping is a concept from functional programming that usually refers to applying a transforming function to each element of a collection. If we have a function f and a collection [x1, x2, ...], then mapping f over the collection yields [f(x1), f(x2), ...]. The transforming function f takes an element from some domain, and returns an element from some (possibly different) domain.

Reducing is also a concept from functional programming that refers to repeated application of a reducing function to pairs of values in a collection. A reducing function takes pairs of value and returns a single value, e.g. the sum function on real numbers is a reducing function. If we have the collection [x1, x2, ...], then by reducing this collection with f we get f(f(f(x1, x2), x3), x4) (for a collection of length 4).

The MapReduce programming model was inspired by these ideas from functional programming. It has three steps:

Map each (k1, v1) pair to a new pair (k2, v2)
Shuffle the pairs so that those with a common k2 value are on the same node
Reduce the values with the same k2 value using a reducing function

For example, suppose we have data records of the form (name, (state, age)), consisting of information about people, specifically their name, the state in which they live, and their age (suppose that names are unique here). If we want to get the mean age within each state, we would do the folowing:

Map (name, (state, age)) pairs to (state, age) pairs
Shuffle the values by state, so that all the records for people living in a single state are stored on the same node
Apply the reducing function f(a, b) = a + b to obtain the total age per state, then divide by the number of values for each state

There are various ways that this model can be implemented, but the most common approach is shown in this diagram:

The initial (k1, v1) pairs are distributed in an arbitrary way across several computers. The mapping step can then take place concurrently, with each computer responsible for mapping the elements that it maintains.
The shuffle step can be expensive, as it requires a lot of communication; it can start before the mapping is completed
The reducing steps can run concurrently, and can start before the shuffling is completed if the reducing function is commutative and associative

A partitioning algorithm decides how the keys are distributed over the nodes in the shuffle phase of MapReduce (a partitioning algorithm may also be used to spread the initial key/value pairs over the nodes prior to the initial mapping stage). The default partitioning algorihm is simple and hash based.

A hashing algorithm is a determistic mapping from arbitrary inputs to integers. For example, if I use the default Python hashing algorithm to hash a long string of characters, I might get something like this:

>>> hash("zfhfljsdsdadfasdfasjahlxf")
7056554121927191821

A hashing algorithm is supposed to distribute most sets of inputs uniformly across its range, i.e. if you apply the hashing algorithm to some set of inputs k1, k2, etc., it should appear that you are getting random results (although the results are actually deterministic). To use a hashing algorithm for partitioning in the shuffle step of MapReduce, simply hash each key, and take the remainder modulo the number of nodes. For example, if there are 1000 nodes and a key hashes to 14344348029400123, then that record is sent to node 123.

Distributed computing and statistical model fitting

Statistical models are fit to data using algorithms that make one or more passes through the data. Some statistical models can be fit using algorithms that are amenable to being run in a distributed manner. For example, least square regression requires only the cross products between covariates (X’*X), and the cross products between the covariates and the data (X’*Y). If the data are distributed across multiple nodes, these two quantities can be computed piece-wise, then sent make to the driver node for final processing.

Pig

The standard implementations of Hadoop and Spark are largely written in Java, so to use them “natively” you would write your analysis code in Java as well. There are frameworks for many other languages (R, Python, etc.) that allow you to use Hadoop and Spark from within those languages.

A new language called “Pig” was developed specifically for data processing on Hadoop-like systems. At the present time, Pig code can be executed on Hadoop, Spark, and several other related systems. In some ways, Pig is analogous to SQL because it is purpose-designed for querying databases. But the languages are different in many ways.

Resources

A key paper about Spark and RDD’s.
Spark documentation
MapReduce tutorial
Pig language reference