Two-group comparison of gene expression data

Top page

Contents:

Goals

We will conduct a statistical comparison of gene expression values between two groups of biological samples. Gene expression is a measure of the activity of a gene, as reflected in the number of RNA copies of the gene that are present in cells. A microarray assay can be used to measure gene expression for thousands of genes simultaneously. Genes that have different expression patterns in two conditions are said to be "differentially expressed".

The data set we analyze here contains measurements of the expression levels of 22,283 genes in peripheral blood mononuclear cells (PBMCs). Data for 127 subjects are included, 26 of whom have ulcerative colitis, and 59 of whom have Crohn's disease (the remaining subjects are healthy and will not be considered here). The goal is to identify genes that have different mean expression levels in the two disease groups.

The raw data are available here as accession number GDS1615 from the NCBI's GEO (Gene Expression Omnibus) site.

Reading in the data

We will write a code module for reading the data, called read_gene_expression.py, and a separate code module discussed below for the analysis.

The gene expression data are stored as a gzipped SOFT format file. SOFT format is a text format used at NCBI for storing gene expression files. The SOFT file format is documented here, but it's easier to just take a look at the file and see what we need to extract.

We'll start by importing some libraries, and writing a brief description of what the script does

import gzip
import numpy as np

"""
Read in a SOFT format data file.  The following values can be exported:

GID : A list of gene identifiers of length d
SID : A list of sample identifiers of length n
STP : A list of sample desriptions of length d
X   : A dxn array of gene expression values
"""

## Path to the data file
fname = "../Data/GDS1615_full.soft.gz"

For the analysis, we need to know which of the three groups (Crohn's disease, ulcerative colitis, and the normal samples) each sample belongs to. We will represent this as a dictionary called SIF that maps each sample identifier to its group label. The sample identifiers have the form GSM#### (for some number ####), so, for example, SIF['GSM76044'] is equal to 'ulcerative colitis'.

The relevant information is in the top section of the data file, where the lines start with "!". The last such line reads "!dataset_begin_table", so we'll read up to that line, then stop. The information we need is contained in consecutive pairs of lines, one starting with "!subject_description", and the next starting with "!subset_sample_id". Since we will encounter the "!subject_description" line first, we store that in a variable (called subject_description), and when we encounter the following "!subset_sample_id" line, we parse it and create key/value pairs in the SIF dictionary.

## Open the data file directly as a gzip file
with gzip.open(fname) as fid:

    ## The header part of the file contains information about the
    ## samples.  Read that information first.
    SIF = {}
    for line in fid:

        if line.startswith("!dataset_table_begin"):
            break
        elif line.startswith("!subset_description"):
            subset_description = line.split("=")[1].strip()
        elif line.startswith("!subset_sample_id"):
            subset_ids = line.split("=")[1].split(",")
            subset_ids = [x.strip() for x in subset_ids]
            for k in subset_ids:
                SIF[k] = subset_description

Next, we extract the column headers for the main data table and store them in the list SID. Most of these column headers are sample identifiers. Since these are the only columns we want to keep, we determine the indices of these columns, then remove the other column labels from SID. Later on, we will also select only these columns from the rows containing gene expression data.

    ## Next line is the column headers (sample id's)
    SID = fid.next().split("\t")

    ## The column indices that contain gene expression data
    I = [i for i,x in enumerate(SID) if x.startswith("GSM")]

    ## Restrict the column headers to those that we keep
    SID = [SID[i] for i in I]

Now it is easy to get an array containing the sample group labels, which will be in 1:1 correspondence with the columns of the gene expression data matrix to be constructed below.

    ## Get a list of sample labels
    STP = [SIF[k] for k in SID]

Next we are ready to read the gene expression data as a Numpy array called X, and also we will save the gene identifiers in the Python list called GID.

    ## Read the gene expression data as a list of lists, also get the gene
    ## identifiers
    GID,X = [],[]
    for line in fid:

        ## This is what signals the end of the gene expression data
        ## section in the file
        if line.startswith("!dataset_table_end"):
            break

        V = line.split("\t")

        ## Extract the values that correspond to gene expression measures
        ## and convert the strings to numbers
        x = [float(V[i]) for i in I]

        X.append(x)
        GID.append(V[0] + ";" + V[1])

## Convert the Python list of lists to a Numpy array.
X = np.array(X)

Finally, we will need index vectors telling us which columns of X contain samples from each of the two groups of interest.

## The indices of samples for the ulcerative colitis group
UC = [i for i,x in enumerate(STP) if x == "ulcerative colitis"]

## The indices of samples for the Crohn's disease group
CD = [i for i,x in enumerate(STP) if x == "Crohn's disease"]

This is the end of the read_gene_expression.py file. We place this code in a separate file from the analysis code discussed below, since we may have multiple analysis scripts, all of which will need to use this script to access the data.

Comparison of mean gene expression levels

The code for statistically comparing gene expression levels between two groups of samples will be placed in a script called gene_expression_comparison.py. We begin by importing the usual libraries, and we import our own read_gene_expression.py module described above. Note that we only import the values from our own module that are intended to be exported.

import numpy as np
from read_gene_expression import X,GID,STP,SID,UC,CD

The comparison will be based on Z-scores for the difference of mean values in the two groups. We usually log transform gene expression data since it is skewed and varies over several orders of magnitude.

## Convert X to log scale
XL = np.log(X) / np.log(2)

Calculation of Z-scores

Here is the code that constructs all of the Z-scores. Everything is vectorized for speed and brevity.

## Z-test statistics
MUC = XL[:,UC].mean(1) ## Mean of ulcerative colitis samples
MCD = XL[:,CD].mean(1) ## Mean of Crohn's disease samples
VUC = XL[:,UC].var(1)  ## Variance of ulcerative colitis samples
VCD = XL[:,CD].var(1)  ## Variance of Crohn's disease samples
nUC = len(UC)          ## Number of ulcerative colitis samples
nCD = len(CD)          ## Number of Crohn's disease samples
Z = (MUC - MCD) / np.sqrt(VUC/nUC + VCD/nCD)

Use of Z-scores to assess differential expression

If a gene is not differentially expressed, it has the same expected value in the two groups of samples. In this case, the Z-score will be standardized (i.e. will have mean zero and unit standard deviation). This data set contains data for 22,283 genes. So if none of the genes are differentially expressed, our Z-scores should approximately have zero mean and unit variance. We can check this as follows.

In [17]: Z.mean()
Out[17]: 0.17405380200720103

In [18]: Z.std()
Out[18]: 1.5018131199839895

Since the standard deviation is much greater than 1, there appear to be multiple genes for which the mean expression levels in the ulcerative colitis and Crohn's disease samples differ. Further, since the mean Z-score is positive, it appears that the dominant pattern is for genes to be expressed at a higher level in the ulcerative colitis compared to the Crohn's disease samples.

The conventional threshold for statistical significance is a p-value smaller than 0.05, which corresponds to the Z-score being greater than 2 in magnitude. Many genes meet this condition, however as we will see below this does not carry much evidence that these genes are differentially expressed.

In [19]: np.mean(np.abs(Z) > 2)
Out[19]: 0.17547008930574878

In [20]: np.mean(Z > 2)
Out[20]: 0.10819907552842975

In [21]: np.mean(Z < -2)
Out[21]: 0.067271013777319033

Differential expression in subsets of genes

Genes with low expression level are harder to measure accurately, thus we expect that fewer of these genes will meet a given statistical threshold for differential expression. Similarly, genes with low variance are potentially less likely to be associated with biological differences. We can assess these trends in our data. First, we will determine which genes are in the lower and upper halves of all genes based on either mean expression or expression variation.

## Split the sample in half based on marginal standard deviation of
## gene expression
SD = X.std(1)
isdl = np.flatnonzero(SD < np.median(SD))
isdh = np.flatnonzero(SD > np.median(SD))

## Split the sample in half based on the marginal mean gene expression
SD = X.mean(1)
imnl = np.flatnonzero(SD < np.median(SD))
imnh = np.flatnonzero(SD > np.median(SD))

Now we can look at the proportion of genes within each of these strata that have Z-score magnitude greater than two.

In [31]: np.mean(np.abs(Z[isdl]) > 2)
Out[31]: 0.13571492684678216

In [32]: np.mean(np.abs(Z[isdh]) > 2)
Out[32]: 0.21524100170541244

In [33]: np.mean(np.abs(Z[imnl]) > 2)
Out[33]: 0.13876671752984471

In [34]: np.mean(np.abs(Z[imnh]) > 2)
Out[34]: 0.21218921102234989

Controlling family-wise error rates with the Bonferroni method

A conservative way to identify a set of genes that are all very likely to be truly differentially expressed is to carry out the Z-tests using a Bonferroni correction. This involves using a more stringent threshold than 2 to declare a gene as differentially expressed. This analysis gives us a "family-wise error rate" of 5%, which means that we can be confident at the 95% level that every gene meeting the Z-score threshold is differentially expressed (i.e. we can be 95% confidence that there are no false positives).

Here we carry out this analysis, and save the names and Z-scores for the genes that are deemed differentially expressed to a file. There turn out to be 83 such genes in this analysis.

## Gaussian distribution CDF, pdf, quantile function, etc.
from scipy.stats.distributions import norm

## The Z-score threshold under a Bonferroni correction
zst = -norm.ppf(0.025/Z.shape[0])

## Write out some information about the genes that meet this condition
## to a file
ii = np.flatnonzero(np.abs(Z) > zst)
with open("bonferroni_genes.csv", "w") as OUT:
    for i in ii:
        OUT.write("%.4f,%s\n" % (Z[i], GID[i]))

We can see that the Z-score was required to be much larger than 2 to be considered differentially expressed.

In [29]: zst
Out[29]: 4.730120778084812

Source files