Contents:
The National Climate Data Center provides data sets of various climate and weather measures. Here we will perform some simple analyses using their land-based temperature data, which is available here. Specifically, we will be using data from their "Quality Controlled Local Climatological Data" (QCLCD) data files.
The QCLCD files are stored as text files at http://cdo.ncdc.noaa.gov/qclcd_ascii. Apparently in 2007 they switched their data file archive format from tar format to zip format. Therefore we will need to write code that can manipulate both types of archives.
We need to import quite a few libraries, and then connect to the web server to get a listing of the relevant file names (the script will download all files ending with .zip or .tar.gz).
import urllib
import os
import tarfile
import csv
import numpy as np
import datetime
import gzip
import zipfile
import StringIO
## URL of the data files
qclcd_url = "http://cdo.ncdc.noaa.gov/qclcd_ascii/"
## Get the directory contents
url = urllib.urlopen(qclcd_url)
d = url.read()
d = d.split("\"")
Files = [x for x in d if x.endswith(".tar.gz") or x.endswith(".zip")]
After downloading the archive files to a temporary location, we will
leave these files intact and use the Python zipfile and tarfile
libraries to programmatically extract the files of interest. The
files inside the archives can be read as strings, then wrapped in a
StringIO
class so that they can be processed by NumPy as "file like
objects".
The individual text files are in a comma separated values (csv)
format, so after extracting the header from each file, we can use the
NumPy loadtxt
function to read the data into a NumPy array. We want
our data set to include Ann Arbor, Michigan, and 999 other randomly
chosen locations.
We also will make use of the Python datetime package to process the dates at which the temperature recordings were made.
## Set this variable to a path where temporary files can be stored
tmp_path = "/data/kshedden/tmp/qclcd"
## Fields to retain when reading the files
Fields = {"tar": ["Wban Number", "YearMonthDay", "Max Temp", "Dep from Normal"],
"zip": ["WBAN", "YearMonthDay", "Tmax", "Depart"]}
## List of WBAN station numbers for which we will retrieve data
stations = []
## Maps from date to vectors containing the maximum temperature and
## temparature anomaly at each station
MXT = {}
ANL = {}
## Try to convert to float, return nan on failure
def convert_float(x):
try:
return float(x)
except ValueError:
return float("nan")
## Loop over the files on the server
for file in Files:
## Store the file here
dpath = os.path.join(tmp_path, file)
## Retrieve the file if we don't already have it
try:
with open(dpath):
print "Skipping download of " + file
pass
except IOError:
## Get the file
print "Retrieving " + file
urllib.urlretrieve(qclcd_url + file, dpath)
if dpath.endswith(".tar.gz"):
## The numerical index of the file
fix = int(file.split(".")[0])
## Open the tar archive for reading
tf = tarfile.open(dpath)
fid = tf.extractfile(str(fix) + "daily.txt")
elif dpath.endswith(".zip"):
## The numerical index of the file
fix = int(file[5:].split(".")[0])
## Open the zip archive
gid = zipfile.ZipFile(dpath)
## Read the entire file, and wrap it in a stringio
D = gid.read(str(fix) + "daily.txt")
fid = StringIO.StringIO(D)
## Extract the data file, process it as a csv file
R = csv.reader(fid)
## Read the header and remove whitespace
H = R.next()
H = [x.strip() for x in H]
## The columns we will extract
col_ix = [H.index(field) for field in Fields[["zip", "tar"]["tar" in dpath]]]
## Read the data
converters = {k: convert_float for k in col_ix}
Z = np.loadtxt(fid, delimiter=",", converters=converters, usecols=col_ix)
## Select a random subset of stations from the first file
if not stations:
## Always include Ann Arbor
stations = [94889,]
## Choose 999 stations at random
ii = np.random.permutation(Z.shape[0])[0:999]
stations.extend(Z[ii,0].astype(np.int32).tolist())
## Enables fast station lookups
stations_h = {station : k for k,station in enumerate(stations)}
for z in Z:
try:
ix = stations_h[z[0]]
except KeyError:
continue
s = str(int(z[1]))
year = int(s[0:4])
month = int(s[4:6])
day = int(s[6:])
dt = datetime.date(year, month, day)
if dt not in MXT:
MXT[dt] = np.nan*np.zeros(1000, dtype=np.float64)
ANL[dt] = np.nan*np.zeros(1000, dtype=np.float64)
MXT[dt][ix] = z[2]
ANL[dt][ix] = z[3]
Finally, we write the data to a gzipped text file, in CSV format, which we can read and process in our analysis scripts.
## Save everything into a file
DT = MXT.keys()
DT.sort()
for J,Z in enumerate((MXT,ANL)):
if J == 0:
fid = gzip.open("../Data/qclcd_mxt.csv.gz", "w")
else:
fid = gzip.open("../Data/qclcd_anl.csv.gz", "w")
fid.write("Date," + ",".join([str(x) for x in stations]) + "\n")
for dt in DT:
fid.write(dt.isoformat() + ",")
fid.write(",".join(["%.2f" % x for x in Z[dt]]) + "\n")
fid.close()
We first load the data file (that we generated above) into a NumPy
array called Z
, and we also store the date information in a Python
list called DT
. Note that the date in DT[i]
is the date at which
the maximum temperature values in row i of Z
(Z[i,:]
) occurred.
The columns of Z
corresond to different weather stations, with the
first column(Z[:,0]
) being Ann Arbor, Michigan.
import numpy as np
import gzip
import datetime
import csv
from matplotlib.backends.backend_pdf import PdfPages
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
from statsmodels.nonparametric.smoothers_lowess import lowess
"""
This script generates plots of the daily maximum temperature in Ann
Arbor, Michigan.
"""
pdf = PdfPages("qclcd_plots.pdf")
dpath = "../Data/qclcd_mxt.csv.gz"
fid = gzip.open(dpath)
R = csv.reader(fid)
H = R.next()
## Read the data file, saving the dates as datetime.date objects,
## and saving the other data in a numpy array.
DT,Z = [],[]
for V in R:
## Parse the date
U = V[0].split("-")
U = [int(u) for u in U]
DT.append(datetime.date(U[0], U[1], U[2]))
## The remaining data are numerical
Z.append([float(v) for v in V[1:]])
Z = np.array(Z)
We first make a simple plot showing the maximum daily temperatures in Ann Arbor, Michigan, over the entire duration that is present in the data set.
## A time plot of Ann Arbor's maximum daily temperature
plt.clf()
plt.plot(DT, Z[:,0], '-', alpha=0.6)
xt = plt.gca().get_xticklabels()
for u in xt:
u.set_rotation(-90)
plt.ylabel("Maximum daily temperature", size=17)
pdf.savefig()
plt.clf()
Next we want to prepare to do some analyses in which we align the data by month and day. We do this by changing the dates so that the year for all values is set to 1992 (chosen because it was a leap year).
## All years in the data set, sorted
Y = np.array([d.year for d in DT])
YU = list(set(Y))
YU.sort()
## Save the dates after modifying so that they have a common
## year (to enable overplotting of the years).
DTY = []
## Loop over the years
for y in YU:
## Select the data for a single year
ii = np.flatnonzero(Y == y)
## Choose a leap year, else Feb 29th raises an exception
DT1 = [datetime.date(1992, DT[i].month, DT[i].day) for i in ii]
DTY.extend(DT1)
## Plot the data for year y
plt.plot(DT1, Z[ii,0], '-', color='black', alpha=0.5)
Next we want to use the Statsmodels lowess
function to estimate a
smooth curve relating temperature to time (month and day within a
year). Since lowess expects numerical data for both the independent
and dependent variables, we use the toordinal
method of the Python
time objects to convert the time objects into integers. Finally, we
plot the individual data series for the different years on top of each
other, and plot the smooth conditional mean function in a heavier
colored line.
## Convert the dates to numbers for smoothing
DTO = np.array([d.toordinal() for d in DTY])
## Sort by time
ii = np.argsort(DTO)
DTO = DTO[ii]
DTY = [DTY[i] for i in ii]
YY = Z[ii,0]
## The lowess function doesn't accept nan
ii = np.flatnonzero(np.isfinite(YY))
YY = YY[ii]
DTO = DTO[ii]
DTY = [DTY[i] for i in ii]
## Do the smoothing
lf = lowess(YY, DTO, frac=0.25)
## Plot the smoothed curve on top of the raw data
plt.plot(lf[:,0], lf[:,1], '-', color='orange', lw=5, alpha=0.6)
## The year is fake, so we don't want to display it on the axes
fmt = DateFormatter("%b")
plt.gca().xaxis.set_major_formatter(fmt)
plt.title("Ann Arbor, %s to %s" % (min(DT).isoformat(), max(DT).isoformat()))
plt.ylabel("Maximum daily temperature", size=17)
pdf.savefig()
We also will use smoothing to assess the year-to-year variability in the daily maximum temperatures at each day of the year. To do this, we first calculate residuals based on the annual curve we fit earlier. We then smooth the squared residuals against the time index.
## Plot the residuals
## Residuals
R = YY - lf[:,1]
plt.clf()
plt.plot(DTY, R, '-', color='purple', alpha=0.6)
fmt = DateFormatter("%b")
plt.gca().xaxis.set_major_formatter(fmt)
plt.title("Residuals")
pdf.savefig()
## Use smoothing to estimate the variance function
## Squared residuals
R2 = R**2
## Smooth the squared residuals
lfr = lowess(R2, DTO, frac=0.3)
plt.clf()
plt.plot(DTY, np.sqrt(lfr[:,1]), '-', lw=5, color='purple')
fmt = DateFormatter("%b")
plt.gca().xaxis.set_major_formatter(fmt)
plt.ylabel("Max temperature SD", size=17)
pdf.savefig()
pdf.close()