**Contents:**

This document briefly reviews some of the Python concepts that are most useful and relevant when programming with data. It is not a comprehensive summary of the Python language (many such resources are available). It should be especially useful to people who are comfortable using R or Matlab, and want to be able to do some of their work in Python.

Python is a general purpose programming language. The core language is useful for many data management tasks. But it is the libraries that make Python a powerful language for statistical analysis and other forms of advanced numerical computing. The central library for this type of work is NumPy, which provides efficient storage and manipulation of numerical and other homogeneously-typed arrays. Many other libraries are also useful for numerical computing with Python. This document considers only the core Python language and NumPy.

Python is currently undergoing a transition from version 2 to version 3. Most people using Python for data and numerical programming are still using version 2. This document was developed and tested using Python version 2.7, but should be applicable from version 2.3 to version 2.7.

Python uses indentation (tabs or spaces) to define blocks of code. The line preceding each indented block terminates with a colon (:). Most other modern programming languages use braces to define code blocks.

Python has four numeric types (NumPy adds several more), but usually
you will be using just two of them: integers (type `int`

) and floats
(type `float`

). Literal numbers in Python that look like integers are
treated as such. Hence, `x = 3`

creates a variable named `x`

that
holds the integer value 3. If you want `x`

to be a float, define it
as `x = 3.`

, `x = 3.0`

, or `x = float(3)`

.

Division of integers returns an integer result. The value of `y`

in
the expression `y = 3 / 5`

is zero.

The exponentiation operator in Python is `**`

, not `^`

.

Python has a subset of c's in-place arithmetic operators:

```
x += y ## Add y to x in place
x -= y ## Subtract y from x in place
x *= y ## Multiply x by y in place
x /= y ## Divide x by y in place
```

There are no autoincrement/decrement operators (`++`

or `--`

) in Python.

See here for more information about numeric values and arithmetic in Python.

Strings (of text) can be enclosed in single quotes or double quotes, the meaning is the same:

```
name = "Fred"
name = 'Fred' ## The same
```

You can concatenate strings using `+`

:

```
place = "Ann Arbor" + ", " + "Michigan"
```

Strings can be replicated using `*`

```
## v will hold the string 'abcabcabc'
v = "abc" * 3
```

The `in`

operator tests whether a substring is contained within a string:

```
## This is True
"grapes" in "apples, grapes, oranges"
```

There are many useful string methods in Python, see here for details.

Python has three main built-in data structures, all of which are heavily-used in nearly all types of programming.

A list holds a sequence of arbitrary data types. Literal lists are defined using square brackets:

```
L = ["a", "egg", 35]
```

Lists are not "flattened", so `[1, 2, [3, 4]]`

is a list of length 3,
and is not the same as `[1, 2, 3, 4]`

which is a list of length 4.

All indexing in Python is zero-based, so the first element of `L`

is
`L[0]`

(below we will discuss indexing in more detail).

There are many useful built-in list methods, such as `append`

,
`index`

, `extend`

, `remove`

, and `pop`

. A method is a function that
is called using the notation `object.method(argument)`

. For example:

```
A = [5, 3, 1, 2] ## Create a list called "A" with 4 elements
A.append(35) ## A now has 5 elements
A.index(3) ## Returns 1
```

The `in`

operator tests whether something is an element of a list:

```
5 in [2, 3, 6, 5, 4] ## Returns True
"dog" in ["cat", 27, "elephant"] ## Returns False
```

The `range`

function is used to build lists of regularly spaced
integer values (i.e. arithmetic sequences). The NumPy analogue of
`range`

is `np.arange`

.

Two arrays can be flatly concatenated using the "+" operator

```
In [53]: [1,2,3] + [4,5]
Out[53]: [1, 2, 3, 4, 5]
```

A variant of a list is a "tuple", which is defined using parentheses instead of square brackets:

```
L = ("a", "egg", 35)
```

Tuples behave similarly to lists in most ways, but a key difference is that they are "immutable". This means that once you create a tuple, it cannot be changed.

A dictionary in Python is a map from keys to values. The keys can be any immutable type (usually a string or number), and the values can be any Python type. A dictionary in Python is more or less equivalent to what is called a "map", "hash table", or "associative array" in other languages.

You can create a literal dictionary using braces, with colons used to separate the key/value pairs.

```
H = {"dog": 36, "cat": [1,4,5], "elephant": "mammal"}
```

Since tuples are immutable, they can be dictionary keys:

```
H = {(1,2): "A", (5,4): "C"}
```

You can retrieve the keys and values of a dictionary using the `keys`

and `values`

methods:

```
In [54]: H = {1: "A", 2: "B"}
In [55]: H.keys()
Out[55]: [1, 2]
In [56]: H.values()
Out[56]: ['A', 'B']
```

Dictionaries and lists can be nested in arbitrary ways:

```
## A list containing two dictionaries
L = [{1: 1, 2: 4, 3: 9}, {1: 1, 2: 8, 3: 27}]
## A dictionary whose values are lists
H = {"cat": [1,2,3], "dog": [7,4,5]}
## A dictionary whose values are dictionaries
D = {"cat": {"name": "Socks", color: "black"}, "dog": {"name": "daisy", "color": "brown"}}
```

A set in Python is an unordered set of distinct values:

```
S = {"cat", "dog", "elephant"} ## Create a literal set containing three strings
S.add("cat") ## Nothing changes
"camel" in S ## Returns False
S.add("camel") ## Now camel is in the set
"camel" in S ## Returns True
S.remove("dog") ## Dog is no longer in the set
```

Values in sets must be immutable, so tuples can be elements of sets, but lists cannot.

Standard set theoretic operations can be performed on sets:

```
In [57]: {1, 2, 3, 5} & {2, 3}
Out[57]: set([2, 3])
In [58]: {1, 2, 3, 5} - {2, 3}
Out[58]: set([1, 5])
In [59]: {1, 2, 3, 5} - {2, 4}
Out[59]: set([1, 3, 5])
In [60]: {1, 2, 3, 5} | {2, 4}
Out[60]: set([1, 2, 3, 4, 5])
```

All indexing in Python and in the standard Python libraries is
zero-based. The first element of the list `X`

is `X[0]`

not `X[1]`

,
and the upper left corner of the matrix `X`

is `X[0,0]`

. This applies
to both standard Python data structures, and NumPy data structures.

Negative indices take values form the end of the list, e.g. `X[-1]`

is
the last element in the list, `X[-2]`

is the second to last element in
the list, and so on.

Slice indexing starts with the first value in the slice, and ends with
the position after the last value in the slice. So if ```
X =
[3,1,6,4,7,9,8]
```

, then `X[2:5]`

is `[6,4,7]`

.

You can leave out the first or last index of a slice, and the result will start from the beginning or the end of the list, respectively:

```
X = [3,5,7,2,9]
X[3:] ## Returns [2, 9]
X[:2] ## Returns [3, 5]
```

Slices can have a "stride", for example

```
X = range(20,130) ## All integers between 20 and 129
print X[10:20:2] ## Every second element of X between indices 10 and 20
print X[:20:2] ## Every second element of X from the beginning to index 20
print X[20::2] ## Every second element of X from index 20 to the end
print X[::2] ## Every second element of X
```

If you want to reverse a list, index it using `[::-1]`

, for example

```
L = ["a", "egg", 35]
print L[::-1]
```

Code that is not part of the Python core belongs to "packages" or
"modules". You can import packages and modules using the `import`

statement. There are several forms of the import statement:

```
## Import all of numpy, but everything in the package must
## be qualified with "numpy". For example, to call the
## "sum" function in numpy, use "numpy.sum(...)".
import numpy
## Import all of numpy, and locally rename the package as "np"
## Any numpy function can now be called by prepending "np" to the
## function call, e.g. "np.sum(...)" calls the numpy sum function
import numpy as np
## Import two functions from numpy. These functions can be
## called directly, e.g. as "sum(...)" or "prod(...)". Other
## values in numpy cannot be accessed.
from numpy import sum,prod
## Import the sum function from numpy and rename it as npsum.
## It can be called directly as "npsum(...)". The rest of
## numpy is not accessible.
from numpy import sum as npsum
## Import all of numpy into the current namespace. These
## values can now be accessed directly, e.g., to call the
## "prod" function in numpy, just use "prod(...)".
from numpy import *
```

Most of the time people do not like to use the `from numpy import *`

form,
since it can lead to name clashes.

A numpy array holds values of only one type. It can have one or more dimensions, but must be rectangular. NumPy arrays can be used with many arithmetic operations that are not defined for Python lists.

Here we create an array with 5 rows and 3 columns, initialized to hold all zeros, with each entry stored as an 8 byte floating point value:

```
X = np.zeros((5,3), dtype=np.float64)
```

We can also create an array with literal values:

```
In [58]: X = np.array([[3,2],[9,5],[4,4],[7,8],[2,0]], dtype=np.float64)
In [59]: X
Out[59]:
array([[ 3., 2.],
[ 9., 5.],
[ 4., 4.],
[ 7., 8.],
[ 2., 0.]])
```

You can slice rows and columns:

```
In [60]: X[2,:]
Out[60]: array([ 4., 4.])
In [61]: X[:,1]
Out[61]: array([ 2., 5., 4., 8., 0.])
```

You can slice submatrices:

```
In [62]: X[2:4,0:2]
Out[62]:
array([[ 4., 4.],
[ 7., 8.]])
```

You can slice based on indices, but you should be careful about how you do this:

```
In [65]: i1 = [2,3]
In [66]: i2 = [0,1]
In [67]: X[i1,:][:,i2]
Out[67]:
array([[ 4., 4.],
[ 7., 8.]])
```

This gives you something different:

```
In [68]: X[i1,i2]
Out[68]: array([ 4., 8.])
```

Indexing and slicing can get quite complicated. For more details, see here.

Elementwise operations on NumPy arrays are normally done on arrays
with the same shape. For example, if `X`

and `Y`

are both 5 x 2, then
`X * Y`

is also a 5 x 2 array, and the i,j element of `X * Y`

is equal
to the i,j element of `X`

times the i,j element of `Y`

(i.e. it is the
pointwise or elementwise product, not the linear algebraic product
which is given by `np.dot`

). This is called "vectorization", and
usually will dramatically improve the speed of your code, compared to
explicitly coding the operations using iteration.

In some circumstances it is also possible to apply elementwise operations to arrays with different shapes or different numbers of dimensions. This is called "broadcasting". The general rules for broadcasting are a bit complex and are explained in detail here. At this point, we will just talk about broadcasting involving 1-dimensional and 2-dimensional arrays.

First we discuss the broadcasting rules for an operation involving one
2-dimensional array and one 1-dimensional array . If you multiply an
m x n array `X`

with an n-vector `Y`

, you get the pointwise product of
each row of `X`

with `Y`

, e.g.

```
In [26]: X = np.array([[1,3],[2,4],[3,2]])
In [27]: Y = np.array([2,3])
In [28]: X * Y
Out[28]:
array([[ 2, 9],
[ 4, 12],
[ 6, 6]])
In [29]: X * np.outer(np.ones(3), Y)
Out[29]:
array([[ 2, 9],
[ 4, 12],
[ 6, 6]])
```

We would obtain the same answer using `Y * X`

, and we could also do
this with any binary arithmetic operation. One useful application of
this is to center or standardize the columns of a matrix:

```
X -= X.mean(0) ## Center the columns of X
X /= X.std(0) ## Also standardize the columns
```

Next we discuss the broadcasting rules for two 2-dimensional arrays.
If we multiply a `m x n`

matrix `X`

by a `m x 1`

matrix `Y`

, we get
the pointwise product of each column of `X`

with `Y`

:

```
In [46]: X = np.array([[1,4],[3,2],[5,2]])
In [47]: Y = np.array([[2,],[3,],[1,]])
In [48]: X * Y
Out[48]:
array([[2, 8],
[9, 6],
[5, 2]])
```

Again, we would get the same result by taking `Y * X`

, and this can be
done with any of the binary arithmetic operations.

Sometimes it is useful to quickly convert a vector to a matrix. This can be done as follows:

```
X = np.array([3,2,5,6,7]) ## A 1-d array
C = X[:,None] ## A 2-d array with one column
R = X[None,:] ## A 2-d array with one row
```

This allows us to easily standardize the rows of a matrix:

```
In [49]: X = np.array([[3,4],[2,1],[9,8]])
In [50]: X -= X.mean(1)[:,None]
In [51]: X /= X.std(1)[:,None]
```

Python variables hold references, not values, hence the underlying data is not copied when you assign to a variable, or pass an argument to a function.

```
X = [1, 4, 5] ## Create a list
Y = X ## Assign the list to another variable
Y[1] = 99 ## Changes both Y and X
```

If you want a copy of a list, use the `list`

function

```
X = [1, 4, 5] ## Create a list
Y = list(X) ## Create a copy of X assign it to Y
Y[1] = 99 ## Changes Y, not X
```

But note that the `list`

function only copies one level deep:

```
A = [1,2] ## Create a list
B = [3,4] ## Another list
X = [A, B, 5] ## Create a list that includes A and B as elements
Y = list(X) ## Copy the top level of X
Y[0][0] = 0 ## Changes Y[0], X[0], and A
Y[0] = [0,0] ## Changes only Y[0]
Y[2] = 99 ## Changes only Y[2], since the value is a scalar
```

In NumPy, slices are references:

```
X = np.array([[3,4],[5,6],[7,8]]) ## Create a 3x2 array
Y = X[1,:] ## A reference to row 1 of X
Y[0] = 99 ## X also changes
```

If you want a copy in NumPy, call the `copy`

method explicitly

```
X = np.array([[3,4],[5,6],[7,8]]) ## Create a 3x2 array
Y = X[1,:].copy() ## A reference to row 1 of X
Y[0] = 99 ## X is unaffected by this
```

The operator "==" applied to basic Python lists, dictionaries, and sets compares whether the values held by two variables are identical. The following code print "True", even though A and B have separate storage in memory, and can be independently changed.

```
A = [2,4,1]
B = [2,4,1]
print A == B
```

If you want to test whether two variables refer to the same underlying object, use "is" instead of "==":

```
A = [2,3,4]
B = A
C = A
print B is C
print B == C
```

Equality testing for NumPy arrays behave differently. If you compare two NumPy arrays, you get the elementwise comparison:

```
A = np.array([1,3,4,5,7])
B = np.array([1,2,4,5,6])
print A == B ## Returns [ True False True True False]
```

Function definitions in Python are straightforward:

```
def myfunc(x):
"""Returns the argument incremented by 1."""
return x+1
```

Functions are "first-class", so can be passed as arguments and returned as values:

```
def do_twice(f):
"""Returns a function that composes f with itself."""
def g(x):
return f(f(x))
return g
myfunc_twice = do_twice(myfunc)
myfunc_twice(3) ## Returns 5
```

A "method" is a function that is attached to a specific object. We're not covering the object system here, so we won't get into how to define methods, but you will need to call methods that are defined in packages.

```
X = np.array([1,4,3,5,6]) ## Define an array
print X.sum() ## Call the array's sum method
print sum(X) ## Use Python's built-in sum method
```

You can define an anonymous function using `lambda`

:

```
double_twice = do_twice(lambda x: 2*x)
print double_twice(3) ## Prints 12
```

Iteration in Python is a quite general. Iteration over sequences works as expected:

```
## Iterate over a list
for x in [1,3,2,4,5]:
print x
## Iterate over a tuple
for x in (1,3,2,4,5):
print x
## You can omit the parentheses
for x in 1,3,2,4,5:
print x
```

A traditional "for" loop with an integer index looks like this:

```
for k in range(20):
do_something()
```

You can use `enumerate`

if you want to track both the value and its position:

```
for j,x in enumerate(("cat","dog","mouse")):
print j,x
```

You can splice together lists on the fly:

```
A = ["cat", "crocodile", "salamander"]
B = ["mammal", "reptile", "amphibian"]
for a,b in zip(A,B):
print "A " + a + " is a " + b
```

If you want to get the grammar right:

```
for a,b in zip(A,B):
print "A " + a + " is a" +\
[" ", "n "][b[0] in "aeiou"] + b
```

You can iterate over any "iterator", which is an object that defines a
method called "next" that returns the next element in a sequence of
values. You can get an iterator for lists and tuples by calling the
`__iter__`

method:

```
S = [3,1,2,5,4] ## A list
I = S.__iter__() ## An iterator to S
print I.next() ## Get the first value
print I.next() ## then the second value
print I.next() ## and so on...
```

A generator looks like a function, but ends in `yield`

instead of
`return`

. It produces an iterator that can be used in a loop or other
setting where iteration is taking place.

Here is a simple generator:

```
def integer_sequence(a, b):
"""Yields the integers between a and b."""
for i in range(a,b):
yield i
```

Here is a more complicated and useful generator:

```
def primes(maxnum):
"""
Yields the primes smaller than or equal to maxnum.
"""
Primes = [] ## Keep track of the primes we have seen so far
n = 2 ## Start here
while n <= maxnum:
isprime = True ## Is it a prime?
for p in Primes:
if p**2 > n:
break
if n % p == 0:
isprime = False
break
if isprime:
yield n
Primes.append(n)
n += 1
```

You can use this in a loop:

```
for k in primes(100):
print k
```

You can also use it in other settings where a sequence of values is expected:

```
sum(primes(100))
```

A "list comprehension" creates a list in one line using an embedded for loop. This gives us a list containing the squares of the integers between 0 and 19:

```
## A list consisting of all integers smaller than 20.
X = [k**2 for k in range(20)]
```

You can do more complicated things, like this:

```
## A list consisting of the squares of the odd integers smaller than 20.
X = [k**2 for k in range(20) if k % 2 == 1]
```

A "dictionary comprehension" is a similar concept, but for dictionaries:

```
## A map from even integers smaller than 100 to their squares
X = {k: k**2 for k in range(100) if k % 2 == 0}.
```

Finally, there is also a "set comprehension":

```
## A set consisting of integers smaller than 100 that
## are evenly divisible by 7.
X = {k for k in range(1000) if k % 7 == 0}
```

Python uses exceptions for error handling. In a lot of script writing, you don't need to understand much about exceptions. If an exception occurs the program will stop and some sort of error will be produced. You would then try to fix the error and run the script again.

Occasionally you might want to write exceptions yourself. For
example, suppose you need to convert a large number of strings to
numbers, and the data are not very clean. Then you could write a
function like the following that attempts to convert each string to a
number, and uses `nan`

to mark values that could not be converted.

```
def getfloat(x):
try:
return float(x)
except ValueError:
return float("nan")
except:
print "Cannot parse " + x
```

This would be an alternative to calling the built-in function `float`

directly, since such code would stop executing if a non-parseable
value is encountered.

When debugging, the code `1/0`

can be used as a breakpoint, since this
statement raises an exception and halts execution.

The type of an object is returned by the `type`

function.

Functions definitions should normally include a "docstring", e.g.

```
def f(x):
"""A function that returns 1 for all inputs."""
return 1
```

The docstring can be accessed as `f.__doc__`

, or in ipython by typing `help(f)`

.

Docstrings for a method can be accessed using an instance of the class
on which the method is defined. For example, if you want to access
the docstring for the append method on lists, you can use
`[].append.__doc__`

, where `[]`

is an empty list (any other list would
work too).

You can get dictionaries mapping the names of all the currently
defined local or global variables to their values by calling the
functions `locals`

or `globals`

.

You can get a list of all the methods defined for a given object with
the `dir`

function. For example, to get a list of all the methods
defined on a list, use `dir([])`

. You can use `dir()`

alone to see
all the currently defined values.