Daily maximum temperature data from NCDC

Top page

Contents:

Background

The National Climate Data Center provides data sets of various climate and weather measures. Here we will perform some simple analyses using their land-based temperature data, which is available here. Specifically, we will be using data from their "Quality Controlled Local Climatological Data" (QCLCD) data files.

The code

The QCLCD files are stored as text files at http://cdo.ncdc.noaa.gov/qclcd_ascii. Apparently in 2007 they switched their data file archive format from tar format to zip format. Therefore we will need to write code that can manipulate both types of archives.

Obtaining and manipulating the data files

We need to import quite a few libraries, and then connect to the web server to get a listing of the relevant file names (the script will download all files ending with .zip or .tar.gz).

import urllib
import os
import tarfile
import csv
import numpy as np
import datetime
import gzip
import zipfile
import StringIO

## URL of the data files
qclcd_url = "http://cdo.ncdc.noaa.gov/qclcd_ascii/"

## Get the directory contents
url = urllib.urlopen(qclcd_url)
d = url.read()
d = d.split("\"")
Files = [x for x in d if x.endswith(".tar.gz") or x.endswith(".zip")]

After downloading the archive files to a temporary location, we will leave these files intact and use the Python zipfile and tarfile libraries to programmatically extract the files of interest. The files inside the archives can be read as strings, then wrapped in a StringIO class so that they can be processed by NumPy as "file like objects".

The individual text files are in a comma separated values (csv) format, so after extracting the header from each file, we can use the NumPy loadtxt function to read the data into a NumPy array. We want our data set to include Ann Arbor, Michigan, and 999 other randomly chosen locations.

We also will make use of the Python datetime package to process the dates at which the temperature recordings were made.

## Set this variable to a path where temporary files can be stored
tmp_path = "/data/kshedden/tmp/qclcd"

## Fields to retain when reading the files
Fields = {"tar": ["Wban Number", "YearMonthDay", "Max Temp", "Dep from Normal"],
          "zip": ["WBAN", "YearMonthDay", "Tmax", "Depart"]}

## List of WBAN station numbers for which we will retrieve data
stations = []

## Maps from date to vectors containing the maximum temperature and
## temparature anomaly at each station
MXT = {}
ANL = {}

## Try to convert to float, return nan on failure
def convert_float(x):
    try:
        return float(x)
    except ValueError:
        return float("nan")

## Loop over the files on the server
for file in Files:

    ## Store the file here
    dpath = os.path.join(tmp_path, file)

    ## Retrieve the file if we don't already have it
    try:
        with open(dpath):
            print "Skipping download of " + file
            pass
    except IOError:
        ## Get the file
        print "Retrieving " + file
        urllib.urlretrieve(qclcd_url + file, dpath)

    if dpath.endswith(".tar.gz"):

        ## The numerical index of the file
        fix = int(file.split(".")[0])

        ## Open the tar archive for reading
        tf = tarfile.open(dpath)
        fid = tf.extractfile(str(fix) + "daily.txt")

    elif dpath.endswith(".zip"):

        ## The numerical index of the file
        fix = int(file[5:].split(".")[0])

        ## Open the zip archive
        gid = zipfile.ZipFile(dpath)

        ## Read the entire file, and wrap it in a stringio
        D = gid.read(str(fix) + "daily.txt")
        fid = StringIO.StringIO(D)

    ## Extract the data file, process it as a csv file
    R = csv.reader(fid)

    ## Read the header and remove whitespace
    H = R.next()
    H = [x.strip() for x in H]

    ## The columns we will extract
    col_ix = [H.index(field) for field in Fields[["zip", "tar"]["tar" in dpath]]]

    ## Read the data
    converters = {k: convert_float for k in col_ix}
    Z = np.loadtxt(fid, delimiter=",", converters=converters, usecols=col_ix)

    ## Select a random subset of stations from the first file
    if not stations:

        ## Always include Ann Arbor
        stations = [94889,]

        ## Choose 999 stations at random
        ii = np.random.permutation(Z.shape[0])[0:999]
        stations.extend(Z[ii,0].astype(np.int32).tolist())

    ## Enables fast station lookups
    stations_h = {station : k for k,station in enumerate(stations)}

    for z in Z:

        try:
            ix = stations_h[z[0]]
        except KeyError:
            continue

        s = str(int(z[1]))
        year = int(s[0:4])
        month = int(s[4:6])
        day = int(s[6:])
        dt = datetime.date(year, month, day)

        if dt not in MXT:
            MXT[dt] = np.nan*np.zeros(1000, dtype=np.float64)
            ANL[dt] = np.nan*np.zeros(1000, dtype=np.float64)

        MXT[dt][ix] = z[2]
        ANL[dt][ix] = z[3]

Finally, we write the data to a gzipped text file, in CSV format, which we can read and process in our analysis scripts.

## Save everything into a file
DT = MXT.keys()
DT.sort()

for J,Z in enumerate((MXT,ANL)):

    if J == 0:
        fid = gzip.open("../Data/qclcd_mxt.csv.gz", "w")
    else:
        fid = gzip.open("../Data/qclcd_anl.csv.gz", "w")

    fid.write("Date," + ",".join([str(x) for x in stations]) + "\n")

    for dt in DT:
        fid.write(dt.isoformat() + ",")
        fid.write(",".join(["%.2f" % x for x in Z[dt]]) + "\n")

    fid.close()

Plotting code

We first load the data file (that we generated above) into a NumPy array called Z, and we also store the date information in a Python list called DT. Note that the date in DT[i] is the date at which the maximum temperature values in row i of Z (Z[i,:]) occurred. The columns of Z corresond to different weather stations, with the first column(Z[:,0]) being Ann Arbor, Michigan.

import numpy as np
import gzip
import datetime
import csv
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
from statsmodels.nonparametric.smoothers_lowess import lowess

"""
This script generates plots of the daily maximum temperature in Ann
Arbor, Michigan.
"""


pdf = PdfPages("qclcd_plots.pdf")

dpath = "../Data/qclcd_mxt.csv.gz"

fid = gzip.open(dpath)
R = csv.reader(fid)

H = R.next()

## Read the data file, saving the dates as datetime.date objects,
## and saving the other data in a numpy array.
DT,Z = [],[]
for V in R:

    ## Parse the date
    U = V[0].split("-")
    U = [int(u) for u in U]
    DT.append(datetime.date(U[0], U[1], U[2]))

    ## The remaining data are numerical
    Z.append([float(v) for v in V[1:]])

Z = np.array(Z)

We first make a simple plot showing the maximum daily temperatures in Ann Arbor, Michigan, over the entire duration that is present in the data set.

## A time plot of Ann Arbor's maximum daily temperature
plt.clf()
plt.plot(DT, Z[:,0], '-', alpha=0.6)
xt = plt.gca().get_xticklabels()
for u in xt:
    u.set_rotation(-90)
plt.ylabel("Maximum daily temperature", size=17)
pdf.savefig()

plt.clf()

Next we want to prepare to do some analyses in which we align the data by month and day. We do this by changing the dates so that the year for all values is set to 1992 (chosen because it was a leap year).

## All years in the data set, sorted
Y = np.array([d.year for d in DT])
YU = list(set(Y))
YU.sort()

## Save the dates after modifying so that they have a common 
## year (to enable overplotting of the years).
DTY = []

## Loop over the years
for y in YU:

    ## Select the data for a single year
    ii = np.flatnonzero(Y == y)

    ## Choose a leap year, else Feb 29th raises an exception
    DT1 = [datetime.date(1992, DT[i].month, DT[i].day) for i in ii]

    DTY.extend(DT1)

    ## Plot the data for year y
    plt.plot(DT1, Z[ii,0], '-', color='black', alpha=0.5)

Next we want to use the Statsmodels lowess function to estimate a smooth curve relating temperature to time (month and day within a year). Since lowess expects numerical data for both the independent and dependent variables, we use the toordinal method of the Python time objects to convert the time objects into integers. Finally, we plot the individual data series for the different years on top of each other, and plot the smooth conditional mean function in a heavier colored line.

## Convert the dates to numbers for smoothing
DTO = np.array([d.toordinal() for d in DTY])

## Sort by time
ii = np.argsort(DTO)
DTO = DTO[ii]
DTY = [DTY[i] for i in ii]
YY = Z[ii,0]

## The lowess function doesn't accept nan
ii = np.flatnonzero(np.isfinite(YY))
YY = YY[ii]
DTO = DTO[ii]
DTY = [DTY[i] for i in ii]

## Do the smoothing
lf = lowess(YY, DTO, frac=0.25)

## Plot the smoothed curve on top of the raw data
plt.plot(lf[:,0], lf[:,1], '-', color='orange', lw=5, alpha=0.6)

## The year is fake, so we don't want to display it on the axes
fmt = DateFormatter("%b")
plt.gca().xaxis.set_major_formatter(fmt)

plt.title("Ann Arbor, %s to %s" % (min(DT).isoformat(), max(DT).isoformat()))
plt.ylabel("Maximum daily temperature", size=17)
pdf.savefig()

We also will use smoothing to assess the year-to-year variability in the daily maximum temperatures at each day of the year. To do this, we first calculate residuals based on the annual curve we fit earlier. We then smooth the squared residuals against the time index.

## Plot the residuals

## Residuals
R = YY - lf[:,1]

plt.clf()

plt.plot(DTY, R, '-', color='purple', alpha=0.6)

fmt = DateFormatter("%b")
plt.gca().xaxis.set_major_formatter(fmt)
plt.title("Residuals")

pdf.savefig()


## Use smoothing to estimate the variance function

## Squared residuals
R2 = R**2

## Smooth the squared residuals
lfr = lowess(R2, DTO, frac=0.3)

plt.clf()

plt.plot(DTY, np.sqrt(lfr[:,1]), '-', lw=5, color='purple')

fmt = DateFormatter("%b")
plt.gca().xaxis.set_major_formatter(fmt)
plt.ylabel("Max temperature SD", size=17)

pdf.savefig()

pdf.close()

Downloads