Distributed computing is a term that describes the simultaneous use of many computing nodes to process very large datasets, or to perform very large-scale computations.
Computer clusters and grids
Typically, distributed computing involves computer clusters that are built out of standard off-the-shelf compute nodes (usually Linux servers). These nodes each have their own processors, primary memory (RAM), and stable storage (disk). The compute nodes communicate with each other using standard computer networking protocols such as Ethernet.
A typical compute cluster has a cluster manager and multiple (10’s-1000’s) of worker nodes. The role of the cluster manager is to manage the loads on the worker nodes. A distributed computation must request access to worker nodes from the cluster manager, and should only utilize worker nodes to which it has been granted access.
A grid computing system, like a cluster, consists of many computers working together. The term “grid” usually refers to a collection of computers with possibly different hardware and operating system configurations, that are possibly not-colocated, whereas a cluster usually refers to a homogeneous colleciton of nodes housed in a single location.
Programming models for distributed computing
Spark is a framework for cluster computing originally developed at UC Berkeley and currently managed by the Apache Software Foundation. Spark is one of the two most well-known cluster computing frameworks, the other being Hadoop. Spark is focused on calculations that are performed in-memory, whereas Hadoop makes much heavier use of disk storage.
The key idea of distributed computing with Spark or Hadoop is that a computation is split into pieces that can be run independently on multiple computing nodes. The driver program initiates the computation. It receives permission form the cluster manager to launch executor process on worker nodes. Each executor process completes one part of the computation. The result of each executor process is then communicated back to the driver process, which integrates the results and makes them available to the user.
Cluster computing frameworks are built to accommodate node failures (e.g. hardware or network failures). If a node fails the computation can continue, using other nodes to take the place of the failed nodes. This process will be transparent to the application in most cases.
RDDs
The central idea in Spark is the Resilient Distributed Dataset, or “RDD”. This is a dataset that is stored in memory across multiple nodes. It is stored redundantly so that if any node goes down, the RDD can be recovered. RDD’s are immutable (or read-only), meaning that they cannot be changed after being created (this is important for performance reasons).
A RDD is an ordered collection of records which are split into partitions. The partitions may be stored on different machines. Since an RDD is read-only, there is no need to synchronize changes to the RDD within or across machines.
A RDD is created in three main ways:
By processing data in stable storage (i.e. a file system). This usually involves reading a file such that the rows of the file become the records of the RDD.
By parallelizing a “collection” in the language of your program (Python, R, Scala, …). For example, a Python list of lists
[[1, 3], [2, 1], [5, 5]]
could be parallelized into an RDD with three records, each of which is a list containing two values.By operating on other RDDs using transformations. For example, a transformation could be to select the values in another RDD that satisfy a certain condition.
Most RDD’s will result from either a load from file storage, or the parallelization of an existing collection, followed by application of multiple transformations. A RDD is formally defined through the steps that are used to create it. Thus an RDD can always be recreated through knowledge of these steps. RDD’s are constructed lazily, meaning that the steps that define the RDD are not actually executed until the RDD is needed to perform a computation, and even then only the part of the RDD that is needed is created.
To make use of an RDD, the program applies an action to it. Most actions return a result to the driver application. An action could be something simple such as summing the numbers in an RDD, or counting the elements of an RDD that meet a certain condition. By combining multiple actions, complex tasks such as fitting a logistic regression model to data in a RDD can be preformed.
The Spark framework
The Spark framework consists of several components that are written largely in Scala and Java. The term “Spark” itself refers to the specification for the format and behavior of RDDs, and to a set of tools that have been implemented for manipulating RDDs. Programs using the Spark framework can be written in a variety of languages. There is an R interface to Spark, but the examples below use the Python interface to Spark. However they do not use any advanced Python language features. The PySpark API is modeled on the Scala Spark API, which is the native and most mature Spark API.
Example: counting lines in a file
Below we develop a PySpark program to count the lines in a
GHCN
file that are TMAX
or TMIN
records. We will develop this program
line-by-line, then show the whole program.
A PySpark program must import the SparkContext
module, this is
analogous to a library
or require
call in R.
from pyspark import SparkContext
Every Spark program must begin by creating a “context”. This is a single object that contains information about the program. The “local” argument means that we are running Spark on a single machine, and the second argument is a name for the application that appears in the logging output.
sc = SparkContext("local", "Count TMIN and TMAX lines")
To access a datafile, use the sc.textFile
function. This creates an
RDD called data
(since the RDD is created lazily, no data will
actually be processed at this point, since no action has been taken).
data = sc.textFile("2014.csv")
We then apply the filter
transformation to the RDD so that we only
have lines containing either TMAX
or TMIN
. This creates a new
RDD, and since we will not be using the original RDD we can just reuse
its name. Note that Python uses lambda expressions to define
anonymous functions.
data = data.filter(lambda x : 'TMAX' in x or 'TMIN' in x)
A RDD is created when it is required for an action. After it is used in an action, it may be discarded and then recreated if it is used in a subsequent action. If you are sure you are going to need a RDD for multiple actions, you can suggest that the system cache it so that it persists as long as your program is running (this is just a hint and may be ignored by the system).
data.cache()
Next we create two more RDDs, each containing only one type of line.
data_tmax = data.filter(lambda x: 'TMAX' in x)
data_tmin = data.filter(lambda x: 'TMIN' in x)
Now we can apply the count
action to our RDDs. Since the RDDs are
constructed lazily, the text file is not actually read until this
point in the program. The Spark framework will manage creation of
this RDD by successively applying the transformations defined above to
the rows of the text file. The RDD may be spread across multiple
nodes if it is too big to fit in the memory of one node. As noted
above, the abstraction that can be used to recover the RDD in the
event of a node failure is always stored on multiple nodes, but the
data itself may not be.
n_tmax = data.filter(lambda x: 'TMAX' in x).count()
n_tmin = data.filter(lambda x: 'TMIN' in x).count()
Finally we print the results:
print("%d lines contain 'TMAX'" % n_tmax)
print("%d lines contain 'TMIN'" % n_tmin)
Here is the complete program:
from pyspark import SparkContext
sc = SparkContext("local", "Count TMIN and TMAX lines")
data = sc.textFile("2014.csv")
data = data.filter(lambda x : 'TMAX' in x or 'TMIN' in x)
data.cache()
data_tmax = data.filter(lambda x: 'TMAX' in x)
data_tmin = data.filter(lambda x: 'TMIN' in x)
n_tmax = data_tmax.count()
n_tmin = data_tmin.count()
print("%d lines contain 'TMAX'" % n_tmax)
print("%d lines contain 'TMIN'" % n_tmin)
Example: average by group (SQL approach)
Spark has a SQL module that allows distributed calculations to be
carried out using a SQL-like syntax. Below we demonstrate how to use
this to average the TMAX
values within each station.
First we walk through the script line-by-line. We first need to import the usual Spark modules, as well as some SQL-specific modules:
from pyspark import SparkContext
from pyspark.sql import functions as func
from pyspark.sql import SQLContext
Next we create the Spark context object, and also a SQL context object:
sc = SparkContext("local", "Average TMAX by station")
sqlc = SQLContext(sc)
To use SQL in Spark, we will need to create an object called a
DataFrame. To create a SQL DataFrame in Spark, we first create an
RDD containing the rows of the DataFrame. To do this, we will need a
function that takes a line from the text datafile and turns it into an
array containing the station identifier and the TMAX
value.
def split(x):
x = x.split(",")
v = [x[0], x[3]]
try:
v[1] = float(v[1]) / 10
except:
v[1] = None
return v
The following line contains a RDD whose elements are the raw text rows of the data file:
lines = sc.textFile("2014.csv")
lines = lines.filter(lambda x : "TMAX" in x)
Next we transform this into another RDD which contains only the
station identifier and TMAX
values. We convert the TMAX
value to
a number, and remove the rows with missing values:
data = lines.map(split)
data = data.filter(lambda x : x[1] is not None)
Now we are ready to create the DataFrame. The first argument below is the RDD containing the data from which the DataFrame is composed, and the second argument is the column names.
df = sqlc.createDataFrame(data, ['station', 'tmax'])
Using the DataFrame, we can now use SQL-like syntax to conduct the calculations.
averages_rdd = df.groupby('station').agg(func.avg('tmax').alias('avg_tmax'))
The averages_rdd
object is an RDD, meaning that it may be
distributed across multiple compute nodes. The collect
method of
the RDD transfers the data from the worker nodes and creates a single
Python object from them on the driver node. Note that collect
can only
be done to an RDD that is small enough to assemble on a single node.
averages = averages_rdd.collect()
print(averages)
Here is the complete program:
from pyspark import SparkContext
from pyspark.sql import functions as func
from pyspark.sql import SQLContext
sc = SparkContext("local", "Average tmax by station")
sqlc = SQLContext(sc)
def split(x):
x = x.rstrip().split(",")
v = [x[0], x[3]]
try:
v[1] = float(v[1]) / 10
except:
v[1] = None
return v
lines = sc.textFile("2014.csv")
lines = lines.filter(lambda x : "TMAX" in x)
data = lines.map(split)
data = data.filter(lambda x : x[1] is not None)
df = sqlc.createDataFrame(data, ['station', 'tmax'])
averages_rdd = df.groupby('station').agg(func.avg('tmax').alias('avg_tmax'))
averages = averages_rdd.collect()
print(averages)
Example: average by group (without SQL)
We can also average by group using reduceByKey
. This approach does
not require SQL. Reduce is a basic concept from computer science
that uses a function to transform a list into a scalar. The function
must take two values as arguments and return one value. Suppose the
function is f
and our list has three elements [a, b, c]
. Then
reducing the list using f
yields f(f(a, b), c)
. If f
is
associative and commutative (like sum
or max
) the result does not
depend on the ordering of the list.
To compute the average of a collection of numbers, we need their sum,
and the number of values in the collection. In Spark, we do this by
creating an RDD whose values are pairs [x, 1]
. When we reduce these
pairs using sum
, we get the pair containing the sum in the first
position, and the number of values in the second position. We then
divide the two elements in the pair to obtain the average.
from pyspark import SparkContext
sc = SparkContext("local", "Average by group")
data1 = sc.textFile("2014.csv")
data1 = data1.filter(lambda x : 'TMAX' in x)
def split(x):
x = x.split(",")
v = [x[0], None]
try:
v[1] = [float(x[3]) / 10, 1]
except:
v[1] = None
return v
data2 = data1.map(split)
data2 = data2.filter(lambda x : x[1] is not None)
data3 = data2.reduceByKey(lambda x,y: [x[0]+y[0], x[1]+y[1]])
data4 = data3.map(lambda x : [x[0], x[1][0] / x[1][1]]).collect()
print(data4)
Here is an alternative version of the above program that uses a
standard Python library called Numpy
. Using Numpy allows us to do
arithmetic on arrays, so we can add the pairs [x, 1]
in a single
step.
from pyspark import SparkContext
import numpy as np
sc = SparkContext("local", "Sum by group")
data1 = sc.textFile("2014.csv")
data1 = data1.filter(lambda x : 'TMAX' in x)
def split(x):
x = x.split(",")
v = [x[0], None]
try:
v[1] = np.r_[float(x[3]) / 10, 1]
except:
v[1] = None
return v
data2 = data1.map(split)
data2 = data2.filter(lambda x : x[1] is not None)
data3 = data2.reduceByKey(lambda x,y: x+y)
data4 = data3.map(lambda x : [x[0], x[1][0] / x[1][1]]).collect()
print(data4)
Calculating the variance with mapPartitions
Calculating the mean, or calculating the mean for each group (as in
the examples above), only requires one pass through the data.
Calculating the variance is less straightforward. There is a built-in
Spark function for calculating the variance, shown below. But first
we will implement the variance calculation using mapPartitions
.
This technique is very useful in situations where no efficient
built-in Spark function exists.
There are one-pass approaches to calculating the variance, for
example, by calculating E[X^2]
and (EX)^2
in a single pass, then
using the identity var(X) = E[X^2] - (EX)^2
. But this is not
numerically stable. The usual way to calculate the variance for a
single in-memory dataset is to first calculate the mean m
, then
calculate the mean of the squared deviations (x[i] - m)^2
. But this
requires two passes through the data, which is expensive when using
Spark.
An alternative approach is to exploit the fact that we can operate
efficiently within a partition of an RDD. Thus, we can easily
calculate the mean m[j]
and variance v[j]
within each partition,
then use the “law of total variation” to obtain the overall variance
from these quantities. The law of total variation states that the
(marginal) variance is equal to mean of the conditional variances plus
the variance of the conditional means. Here “conditional” can be
taken to mean “within a partition”. Since the partitions may contain
unequal numbers of data values, these will need to be weighted means
and variances.
from pyspark import SparkContext
import numpy as np
sc = SparkContext("local", "Compute variance")
data1 = sc.textFile("2014.csv").filter(lambda x : "TMAX" in x).repartition(100).cache()
# Calculate the mean, variance, and sample size over the iterator.
def sumstats(iterator):
data = []
for row in iterator:
row = row.rstrip().split(",")
try:
data.append(float(row[3]) / 10)
except:
continue
data = np.asarray(data)
return [[data.mean(), data.var(), len(data)]]
data2 = data1.mapPartitions(sumstats).collect()
data2 = np.asarray(data2)
# Remove partitions with undefined values
ii = np.isfinite(data2).all(1)
data2 = data2[ii, :]
# Remove empty partitions
ii = (data2[:, 2] > 0)
data2 = data2[ii, :]
# Weights derived from partition sizes
wgt = data2[:, -1]
wgt /= wgt.sum()
# Variance of the conditional means
mn = np.average(data2[:, 0], weights=wgt)
vare = np.average((data2[:, 0] - mn)**2, weights=wgt)
# Mean of the conditional variances
evar = np.average(data2[:, 1], weights=wgt)
var = vare + evar
print(np.sqrt(var), vare, evar)
Next we can calculate the variance directly using Spark’s built-in
variance
method for RDDs:
from pyspark import SparkContext
import numpy as np
sc = SparkContext("local", "Compute variance")
data1 = sc.textFile("2014.csv").filter(lambda x : "TMAX" in x).repartition(100).cache()
def split(x):
x = x.split(",")
try:
return float(x[3]) / 10
except:
return None
data1 = data1.map(split)
data1 = data1.filter(lambda x : x is not None)
print(data1.variance())
MapReduce
Mapping is a concept from functional programming that usually refers
to applying a transforming function to each element of a collection.
If we have a function f
and a collection [x1, x2, ...]
, then
mapping f
over the collection yields [f(x1), f(x2), ...]
. The
transforming function f
takes an element from some domain, and
returns an element from some (possibly different) domain.
Reducing is also a concept from functional programming that refers
to repeated application of a reducing function to pairs of values in
a collection. A reducing function takes pairs of value and returns a
single value, e.g. the sum function on real numbers is a reducing
function. If we have the collection [x1, x2, ...]
, then by reducing
this collection with f
we get f(f(f(x1, x2), x3), x4)
(for a
collection of length 4).
The MapReduce programming model was inspired by these ideas from functional programming. It has three steps:
Map each
(k1, v1)
pair to a new pair(k2, v2)
Shuffle the pairs so that those with a common
k2
value are on the same nodeReduce the values with the same
k2
value using a reducing function
For example, suppose we have data records of the form (name, (state, age)), consisting of information about people, specifically their name, the state in which they live, and their age (suppose that names are unique here). If we want to get the mean age within each state, we would do the folowing:
Map (name, (state, age)) pairs to (state, age) pairs
Shuffle the values by state, so that all the records for people living in a single state are stored on the same node
Apply the reducing function
f(a, b) = a + b
to obtain the total age per state, then divide by the number of values for each state
There are various ways that this model can be implemented, but the most common approach is shown in this diagram:
The initial
(k1, v1)
pairs are distributed in an arbitrary way across several computers. The mapping step can then take place concurrently, with each computer responsible for mapping the elements that it maintains.The shuffle step can be expensive, as it requires a lot of communication; it can start before the mapping is completed
The reducing steps can run concurrently, and can start before the shuffling is completed if the reducing function is commutative and associative
A partitioning algorithm decides how the keys are distributed over the nodes in the shuffle phase of MapReduce (a partitioning algorithm may also be used to spread the initial key/value pairs over the nodes prior to the initial mapping stage). The default partitioning algorihm is simple and hash based.
A hashing algorithm is a determistic mapping from arbitrary inputs to integers. For example, if I use the default Python hashing algorithm to hash a long string of characters, I might get something like this:
>>> hash("zfhfljsdsdadfasdfasjahlxf")
7056554121927191821
A hashing algorithm is supposed to distribute most sets of inputs
uniformly across its range, i.e. if you apply the hashing algorithm to
some set of inputs k1
, k2
, etc., it should appear that you are
getting random results (although the results are actually
deterministic). To use a hashing algorithm for partitioning in the
shuffle step of MapReduce, simply hash each key, and take the
remainder modulo the number of nodes. For example, if there are 1000
nodes and a key hashes to 14344348029400123, then that record is sent
to node 123.
Distributed computing and statistical model fitting
Statistical models are fit to data using algorithms that make one or more passes through the data. Some statistical models can be fit using algorithms that are amenable to being run in a distributed manner. For example, least square regression requires only the cross products between covariates (X’*X), and the cross products between the covariates and the data (X’*Y). If the data are distributed across multiple nodes, these two quantities can be computed piece-wise, then sent make to the driver node for final processing.
Pig
The standard implementations of Hadoop and Spark are largely written in Java, so to use them “natively” you would write your analysis code in Java as well. There are frameworks for many other languages (R, Python, etc.) that allow you to use Hadoop and Spark from within those languages.
A new language called “Pig” was developed specifically for data processing on Hadoop-like systems. At the present time, Pig code can be executed on Hadoop, Spark, and several other related systems. In some ways, Pig is analogous to SQL because it is purpose-designed for querying databases. But the languages are different in many ways.
Resources