Contents:
The US Department of Transportation aggregates data collected by the states about their highway bridges. This effort is called the "National Bridge Inventory" (NBI). Bridges are periodically inspected, and traffic flows are recorded (if you question why this is important, see here).
Two important elements in the NBI data set are the average daily traffic counts, and information about when the bridges were constructed and improved. This is an administrative data set, meaning that the US federal government requires the states to submit information about their bridges for inclusion in the NBI database. Not all states are equally thorough about record keeping and data collection. The data should be expected to have some errors, omissions, and inconsistencies.
The traffic counts are obtained by placing traffic sensors on the bridges. This is done on an irregular basis for each bridge, so we do not have traffic count data for every bridge in every year. Also, the counts may be rounded to the 10's or 100's place.
This data set can be used to address many types of questions. The number of bridges per state, and the ages of the bridges can be used to understand the level of transportation infrastructure in each state and its condition. The traffic information can be used to understand how driving behavior has changed over time, and how driving density relates to infrastructure age. Finally, this dataset can be used to investigate variation in the compliance of different states to the NBI program, in terms of the frequency with which traffic counts are obtained, anomalies in the traffic data, and inconsistencies or anomalies in the bridge age and reconstruction data.
The data are stored as zip
archives, with all the data
for one year stored in a single archive file. The data can be
manually downloaded from the data download
page, but this is a
good opportunity to write an automated download script. The urllib
module does all the work, we just need to loop over the files and
retrieve them one-by-one.
import urllib
import os
"""
This script downloads the National Bridge inventory zip archives from
http://www.fhwa.dot.gov/bridge/nbi/ascii.cfm
and saves them in a temporary directory.
"""
## Set this variable to a path where temporary files can be stored
tmp_path = "/data/kshedden/tmp/nbi"
## Retrieve data from these years
first_year = 1992
last_year = 2011
## The base URL for the NBI files
nbi_url = "http://www.fhwa.dot.gov/bridge/nbi/"
for year in range(first_year, last_year):
file = "%d.zip" % year
## Store the file here
dpath = os.path.join(tmp_path, file)
## Don't download the file again if we already have it.
try:
with open(dpath):
print "Skipping %s" % file
continue
except IOError:
print "Downloading %s..." % file
urllib.urlretrieve(nbi_url + file, dpath)
Inside each zip archive file are state-level files. We could manually
unzip each archive, producing around 1000 files. Instead, we will use
the Python zipfile
package to extract the data programatically from
the archives.
The state-level files are stored in fixed width format, with the specification given here. We only need a few of these fields. The column locations of these fields, and some other constants, are defined near the top of our script.
import zipfile
import os
import pandas as pd
import StringIO
import numpy as np
import gzip
"""
Generate a single text data file from the National Bridge Inventory
zip archives.
The columns of the data file that is generated are:
Column 1: State abbreviation
Column 2: Bridge identifier
Column 3: Year the bridge was constructed
Columns 4-: Traffic levels
The years of the traffic levels range from year0 to yearmax, inclusive (see below).
When the traffic level was not observed, it is recorded as nan.
This script takes several hours to run.
"""
## Need to set this to a path that can be used to store temporary files
tmp_path = "/data/kshedden/tmp/nbi"
## This year is taken as the time origin
year0 = 1990
## Discard data after this year
yearmax = 2011
## From http://www.fhwa.dot.gov/bridge/nbi/format.cfm
State = [1, 3]
Structure_id = [4, 18]
Traffic = [165, 170]
Traffic_year = [171, 174]
Year_built = [157, 160]
Latitude = [130, 137]
Longitude = [138, 146]
Year_reconstructed = [362, 365]
item_5a = [19, 19]
item_49 = [223, 228]
item_112 = [374, 374]
item_42a = [200, 200]
colspecs = [State, Structure_id, Traffic, Traffic_year, Year_built, item_5a,\
item_49, item_112, item_42a, Year_reconstructed, Latitude,
Longitude]
## Convert to zero-based indexing.
for c in colspecs:
c[0] -= 1
## Use these column names for the columns in `colspecs`.
colnames = {0: "State", 1: "Structure_id", 2: "Traffic", 3: "Traffic_year",
4: "Year_built", 5: "item_5a", 6: "item_49", 7: "item_112",
8: "item_42a", 9: "Year_reconstructed", 10: "Latitude",
11: "Longitude"}
## State codes
ST = {"AL" : 14, "AK" : 20, "AZ" : 49, "AR" : 56, "CA" : 69, "CO" :
88, "CT" : 91, "DE" : 103, "DC" : 113, "FL" : 124, "GA" : 134,
"HI" : 159, "ID" : 160, "IL" : 175, "IN" : 185, "IA" : 197, "KS"
: 207, "KY" : 214, "LA" : 226, "ME" : 231, "MD" : 243, "MA" :
251, "MI" : 265, "MN" : 275, "MS" : 284, "MO" : 297, "MT" : 308,
"NE" : 317, "NV" : 329, "NH" : 331, "NJ" : 342, "NM" : 356, "NY"
: 362, "NC" : 374, "ND" : 388, "OH" : 395, "OK" : 406, "OR" :
410, "PA" : 423, "RI" : 441, "SC" : 454, "SD" : 468, "TN" : 474,
"TX" : 486, "UT" : 498, "VT" : 501, "VA" : 513, "WA" : 530, "WV"
: 543, "WI" : 555, "WY" : 568}
## Reverse the state codes map
STR = {ST[k]: k for k in ST}
def my_float(x):
"""Convert a string to a float, returning Nan if this fails."""
try:
return float(x)
except ValueError:
return float("nan")
Pandas has a function called read_fwf
that can read fixed width
format files. However it can only read from a "file like object",
whereas the zip archive contents are returned as strings. By wrapping
this string in a StringIO
object, we can pass it into read_fwf
.
## TR[state][bridge] is a vector of annual traffic counts starting in year 1992.
## Missing data are denoted nan.
TR = {}
## YB[state][bridge] is the year that bridge `bridge` in state `state` as built.
YB = {}
## LatLon[state][bridge] is the location of bridge `bridge` in state `state`.
LatLon = {}
## Loop over the zip archives
L = os.listdir(tmp_path)
L = [x for x in L if x.endswith(".zip")]
for file in L:
print "Processing " + file + ":"
## The path to the zip archive
dpath = os.path.join(tmp_path, file)
## Open the zip archive
fid = zipfile.ZipFile(dpath)
## Get a list of files in the archive
IL = fid.infolist()
## Loop over the files, and process them one by one
for ilj,il in enumerate(IL):
print " %.0f%% " % np.floor(100*ilj/float(len(IL)))
## The name of a file in the archive
fname = il.filename
## Read the entire file, and wrap it in a stringio
D = fid.read(fname)
sio = StringIO.StringIO(D)
## Put the data into a pandas data frame
D = pd.read_fwf(sio, colspecs=colspecs, header=None)
## Give the columns useful names
D.rename(columns=colnames, inplace=True)
## Require certain items to be numbers.
for vn in "item_5a","Traffic":
D[vn] = D[vn].map(my_float)
D[vn] = D[vn].astype(np.float64)
The NBI database includes bridges that are not highway bridges. Here, we include only the highway bridges in our analysis. The following code accomplishes this selection.
## Select only highway bridges, from
## Note on using the data (http://www.fhwa.dot.gov/bridge/nbi/ascii.cfm):
##
## After downloading the data you will need to distinguish the subset
## of "highway bridges". To obtain this set items 5a=1; 49>=6.1meters;
## 112=Y and 42a=1 or 4 or 5 or 6 or 7 or 8.
ix1 = [x == 1 for x in D["item_5a"]]
ix2 = D["item_49"] >= 61
ix3 = D["item_112"] == "Y"
ix4 = [x in (1,4,5,6,7,8) for x in D["item_42a"]]
ix = ix1 & ix2 & ix3 & ix4
D = D.loc[ix,:]
## Drop any rows with missing values
D = D.dropna()
Now we can loop through the rows of the data frame and store the
values of interest in the dictionaries TR
(for traffic data) and
YB
for each bridge's year of construction.
## Loop over the records in a file
for ix in D.index:
## The state for the current record
try:
state = STR[D.loc[ix,"State"]]
except KeyError:
continue
## The first time this state was encountered
if state not in TR:
TR[state] = {}
YB[state] = {}
LatLon[state] = {}
## The first time this bridge was encountered
structure_id = D.loc[ix,"Structure_id"]
## The first time this structure was encountered
if structure_id not in TR[state]:
TR[state][structure_id] = [float("nan"),]*(yearmax - year0 + 1)
## Be prepared for non-paresable year data
try:
YB[state][structure_id] = float(D.loc[ix,"Year_built"])
except ValueError:
YB[state][structure_id] = float("nan")
lat = str(D.loc[ix,"Latitude"])
lon = str(D.loc[ix,"Longitude"])
latlon = lat + ";" + lon
LatLon[state][structure_id] = latlon
## Store the traffic data if possible
try:
year = int(D.loc[ix,"Traffic_year"])
except ValueError:
continue
if year >= year0 and year <= yearmax:
traffic = D.loc[ix,"Traffic"]
try:
TR[state][structure_id][year-year0] = traffic
except ValueError:
continue
Now that we have all of the data of interest stored in dictionaries, we can write it out to a flat file.
## Save all the results
with gzip.open("../Data/nbi_data.csv.gz", "w") as fid:
fid.write("State,Structure id,Year built,Latitude,Longitude,")
fid.write(",".join([str(y) for y in range(year0, yearmax+1)]) + "\n")
## We will write out the data for the states in alphabetical order
States = TR.keys()
States.sort()
for state in States:
## Alphabetically sort the bridge id's, file easier to
## manually scan.
K = TR[state].keys()
K.sort()
for k in K:
## Some states may be missing year of construction data
yb = float("nan")
try:
yb = YB[state][k]
except KeyError:
pass
latlon = ""
try:
latlon = LatLon[state][k]
except KeyError:
pass
latlon = latlon.split(";")
fid.write("%s,%s,%.0f,%s,%s," % (state, k, yb, latlon[0], latlon[1]))
fid.write(",".join(["%.0f" % x for x in TR[state][k]]) + "\n")
Our analysis code is in the script nbi_analysis.py. We start by reading in the data and constructing some summary statistics. The summary statistics are written out to a file in CSV format.
import numpy as np
from nbi_load_data import TR,BI
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
## The state labels
ST = TR.keys()
## Map from each state name to the number of bridges
## in that state
NB = {state: TR[state].shape[0] for state in ST}
## Map from each state name to the median year that bridges were built
## in that state
MDYB = {state: np.median(BI[state][:,0]) for state in ST}
## Map from each state name to the median number of traffic data points
## per bridge in that state
MTC = {state: np.median(np.isfinite(TR[state]).sum(1)) for state in ST}
## Create a file containing some summary statistics
with open("nbi_summaries.csv", "w") as OUT:
OUT.write("State,#bridges,Median year built,Median #traffic points/bridge\n")
for st in ST:
OUT.write("%s,%d,%d,%d\n" % (st, NB[st], MDYB[st], MTC[st]))
Next we comnstruct a series of plots. The first shows the distribution of the years in which bridges were built within each state, using side by side box plots.
pdf = PdfPages("nbi_plots.pdf")
##
## Boxplot of the years in which bridges were built in each state
##
## Sort the states by median year of bridges being built
YB = [MDYB[st] for st in ST]
ii = np.argsort(YB)
## Reorder the data by median year
F = [BI[state][:,0] for state in ST]
F = [F[i] for i in ii]
## Reorder the state names accordingly
STi = [ST[i] for i in ii]
## We need an unusually wide figure
plt.figure(figsize=(10,5))
plt.clf()
plt.boxplot(F)
plt.gca().set_xticklabels([])
for i,st in enumerate(STi):
plt.text(i+1, 1830 - 10 * (i % 2), st, ha='center')
plt.ylabel("Year built", size=17)
pdf.savefig()
Next, we make a simple dot plot of the number of bridges in each state (note that most bridges are counted twice, once for the traffic flow in each direction).
##
## Dot plot of the number of bridges per state
##
## Sort the states by the number of bridges
NBV = [NB[st] for st in ST]
ii = np.argsort(NBV)
NBV = [NBV[i] for i in ii]
STi = [ST[i] for i in ii]
## An unusually wide figure
plt.figure(figsize=(10,5))
plt.clf()
plt.gca().set_xticklabels([])
for i,st in enumerate(STi):
## Guide lines
plt.plot([i,i], [0,80000], '-', color='lightgrey')
## Horizontal axis labels
plt.text(i, -3000 - 3000 * (i % 2), st, ha='center')
## Plot the data points
plt.plot(range(len(NBV)), NBV, 'o')
plt.xlim(-1, 51)
plt.ylabel("Number of bridges", size=17)
pdf.savefig()
Here we plot the number of new bridges built each year. Note that the spikes on the decade points probably reflet some states coding the years at which older bridges were built to the nearest decade.
##
## Plot the number of bridges built per year in the US
##
## Aggregate the counts, exclude unless between 1850 and 2012
FY = np.zeros(2012-1850)
for st in ST:
for h in BI[st]:
if h[0] > 1850 and h[0] < 2012:
FY[h[0] - 1850] += 1
plt.clf()
plt.plot(range(1850,2012), FY, '-', lw=5, color='orange')
plt.xlabel("Year", size=17)
plt.ylabel("Number of new bridges", size=17)
pdf.savefig()
Next we plot the traffic count per year:
##
## Plot the total traffic count per year
##
## Aggregate the traffic counts
TT = np.zeros(2012-1990)
for st in ST:
for j,v in enumerate(TR[st].T): ## Iterate over columns
ii = np.flatnonzero(np.isfinite(v))
TT[j] += v[ii].sum()
plt.clf()
plt.plot(range(1990,2012), TT, '-', lw=5, color='orange')
plt.xlabel("Year", size=17)
plt.ylabel("Traffic count", size=17)
pdf.savefig()
And finally we make a scatterplot showing the observed traffic counts on individual bridges in 1995 and 2000 (for bridges that had counts recorded in both of those years).
##
## Plot the total traffic count per year
##
## Aggregate the traffic counts
Q = []
for st in ST:
## 1995 and 2000
V = TR[st][:,[5,10]]
ii = np.flatnonzero(np.isfinite(V).sum(1) == 2)
V = V[ii,:]
Q.append(V)
Q = np.concatenate(Q, axis=0)
## Take a random subset to keep the file size down
ii = np.random.permutation(Q.shape[0])[0:10000]
Q = Q[ii,:]
plt.figure(figsize=(7,5))
plt.clf()
plt.axes([0.15,0.12,0.8,0.8])
plt.grid(True)
plt.plot(Q[:,0], Q[:,1], 'o', lw=5, alpha=0.5, color='orange')
plt.xlabel("1995", size=17)
plt.ylabel("2000", size=17)
pdf.savefig()
pdf.close()