US National Bridge Inventory

Contents:

Background
Downloading the data
Extracting the data
Some simple analyses
Links
Downloads

Background

The US Department of Transportation aggregates data collected by the states about their highway bridges. This effort is called the "National Bridge Inventory" (NBI). Bridges are periodically inspected, and traffic flows are recorded (if you question why this is important, see here).

Two important elements in the NBI data set are the average daily traffic counts, and information about when the bridges were constructed and improved. This is an administrative data set, meaning that the US federal government requires the states to submit information about their bridges for inclusion in the NBI database. Not all states are equally thorough about record keeping and data collection. The data should be expected to have some errors, omissions, and inconsistencies.

The traffic counts are obtained by placing traffic sensors on the bridges. This is done on an irregular basis for each bridge, so we do not have traffic count data for every bridge in every year. Also, the counts may be rounded to the 10's or 100's place.

This data set can be used to address many types of questions. The number of bridges per state, and the ages of the bridges can be used to understand the level of transportation infrastructure in each state and its condition. The traffic information can be used to understand how driving behavior has changed over time, and how driving density relates to infrastructure age. Finally, this dataset can be used to investigate variation in the compliance of different states to the NBI program, in terms of the frequency with which traffic counts are obtained, anomalies in the traffic data, and inconsistencies or anomalies in the bridge age and reconstruction data.

Downloading the data

The data are stored as zip archives, with all the data for one year stored in a single archive file. The data can be manually downloaded from the data download page, but this is a good opportunity to write an automated download script. The urllib module does all the work, we just need to loop over the files and retrieve them one-by-one.

import urllib
import os

"""
This script downloads the National Bridge inventory zip archives from

http://www.fhwa.dot.gov/bridge/nbi/ascii.cfm

and saves them in a temporary directory.
"""

## Set this variable to a path where temporary files can be stored
tmp_path = "/data/kshedden/tmp/nbi"

## Retrieve data from these years
first_year = 1992
last_year = 2011

## The base URL for the NBI files
nbi_url = "http://www.fhwa.dot.gov/bridge/nbi/"


for year in range(first_year, last_year):

    file = "%d.zip" % year

    ## Store the file here
    dpath = os.path.join(tmp_path, file)

    ## Don't download the file again if we already have it.
    try:
        with open(dpath):
            print "Skipping %s" % file
            continue
    except IOError:
        print "Downloading %s..." % file
        urllib.urlretrieve(nbi_url + file, dpath)

Extracting the data

Inside each zip archive file are state-level files. We could manually unzip each archive, producing around 1000 files. Instead, we will use the Python zipfile package to extract the data programatically from the archives.

The state-level files are stored in fixed width format, with the specification given here. We only need a few of these fields. The column locations of these fields, and some other constants, are defined near the top of our script.

import zipfile
import os
import pandas as pd
import StringIO
import numpy as np
import gzip

"""
Generate a single text data file from the National Bridge Inventory
zip archives.

The columns of the data file that is generated are:

Column 1: State abbreviation
Column 2: Bridge identifier
Column 3: Year the bridge was constructed
Columns 4-: Traffic levels

The years of the traffic levels range from year0 to yearmax, inclusive (see below).

When the traffic level was not observed, it is recorded as nan.

This script takes several hours to run.
"""

## Need to set this to a path that can be used to store temporary files
tmp_path = "/data/kshedden/tmp/nbi"

## This year is taken as the time origin
year0 = 1990

## Discard data after this year
yearmax = 2011

## From http://www.fhwa.dot.gov/bridge/nbi/format.cfm
State = [1, 3]
Structure_id = [4, 18]
Traffic = [165, 170]
Traffic_year = [171, 174]
Year_built = [157, 160]
Latitude = [130, 137]
Longitude = [138, 146]
Year_reconstructed = [362, 365]
item_5a = [19, 19]
item_49 = [223, 228]
item_112 = [374, 374]
item_42a = [200, 200]

colspecs = [State, Structure_id, Traffic, Traffic_year, Year_built, item_5a,\
            item_49, item_112, item_42a, Year_reconstructed, Latitude,
            Longitude]

## Convert to zero-based indexing.
for c in colspecs:
    c[0] -= 1

## Use these column names for the columns in `colspecs`.
colnames = {0: "State", 1: "Structure_id", 2: "Traffic", 3: "Traffic_year",
            4: "Year_built", 5: "item_5a", 6: "item_49", 7: "item_112",
            8: "item_42a", 9: "Year_reconstructed", 10: "Latitude",
            11: "Longitude"}


## State codes
ST = {"AL" : 14, "AK" : 20, "AZ" : 49, "AR" : 56, "CA" : 69, "CO" :
      88, "CT" : 91, "DE" : 103, "DC" : 113, "FL" : 124, "GA" : 134,
      "HI" : 159, "ID" : 160, "IL" : 175, "IN" : 185, "IA" : 197, "KS"
      : 207, "KY" : 214, "LA" : 226, "ME" : 231, "MD" : 243, "MA" :
      251, "MI" : 265, "MN" : 275, "MS" : 284, "MO" : 297, "MT" : 308,
      "NE" : 317, "NV" : 329, "NH" : 331, "NJ" : 342, "NM" : 356, "NY"
      : 362, "NC" : 374, "ND" : 388, "OH" : 395, "OK" : 406, "OR" :
      410, "PA" : 423, "RI" : 441, "SC" : 454, "SD" : 468, "TN" : 474,
      "TX" : 486, "UT" : 498, "VT" : 501, "VA" : 513, "WA" : 530, "WV"
      : 543, "WI" : 555, "WY" : 568}

## Reverse the state codes map
STR = {ST[k]: k for k in ST}


def my_float(x):
    """Convert a string to a float, returning Nan if this fails."""
    try:
        return float(x)
    except ValueError:
        return float("nan")

Pandas has a function called read_fwf that can read fixed width format files. However it can only read from a "file like object", whereas the zip archive contents are returned as strings. By wrapping this string in a StringIO object, we can pass it into read_fwf.

## TR[state][bridge] is a vector of annual traffic counts starting in year 1992.
## Missing data are denoted nan.
TR = {}

## YB[state][bridge] is the year that bridge `bridge` in state `state` as built.
YB = {}

## LatLon[state][bridge] is the location of bridge `bridge` in state `state`.
LatLon = {}

## Loop over the zip archives
L = os.listdir(tmp_path)
L = [x for x in L if x.endswith(".zip")]
for file in L:

    print "Processing " + file + ":"

    ## The path to the zip archive 
    dpath = os.path.join(tmp_path, file)

    ## Open the zip archive
    fid = zipfile.ZipFile(dpath)

    ## Get a list of files in the archive
    IL = fid.infolist()

    ## Loop over the files, and process them one by one
    for ilj,il in enumerate(IL):

        print "    %.0f%% " % np.floor(100*ilj/float(len(IL)))

        ## The name of a file in the archive
        fname = il.filename

        ## Read the entire file, and wrap it in a stringio
        D = fid.read(fname)
        sio = StringIO.StringIO(D)

        ## Put the data into a pandas data frame
        D = pd.read_fwf(sio, colspecs=colspecs, header=None)

        ## Give the columns useful names
        D.rename(columns=colnames, inplace=True)

        ## Require certain items to be numbers.
        for vn in "item_5a","Traffic":
            D[vn] = D[vn].map(my_float)
            D[vn] = D[vn].astype(np.float64)

The NBI database includes bridges that are not highway bridges. Here, we include only the highway bridges in our analysis. The following code accomplishes this selection.

        ## Select only highway bridges, from
        ## Note on using the data (http://www.fhwa.dot.gov/bridge/nbi/ascii.cfm):
        ##
        ## After downloading the data you will need to distinguish the subset
        ## of "highway bridges". To obtain this set items 5a=1; 49>=6.1meters;
        ## 112=Y and 42a=1 or 4 or 5 or 6 or 7 or 8.
        ix1 = [x == 1 for x in D["item_5a"]]
        ix2 = D["item_49"] >= 61
        ix3 = D["item_112"] == "Y" 
        ix4 = [x in (1,4,5,6,7,8) for x in D["item_42a"]]
        ix = ix1 & ix2 & ix3 & ix4
        D = D.loc[ix,:]

        ## Drop any rows with missing values
        D = D.dropna()

Now we can loop through the rows of the data frame and store the values of interest in the dictionaries TR (for traffic data) and YB for each bridge's year of construction.

        ## Loop over the records in a file
        for ix in D.index:

            ## The state for the current record
            try:
                state = STR[D.loc[ix,"State"]]
            except KeyError:
                continue

            ## The first time this state was encountered
            if state not in TR:
                TR[state] = {}
                YB[state] = {}
                LatLon[state] = {}

            ## The first time this bridge was encountered
            structure_id = D.loc[ix,"Structure_id"]

            ## The first time this structure was encountered
            if structure_id not in TR[state]:
                TR[state][structure_id] = [float("nan"),]*(yearmax - year0 + 1)

                ## Be prepared for non-paresable year data
                try:
                    YB[state][structure_id] = float(D.loc[ix,"Year_built"])
                except ValueError:
                    YB[state][structure_id] = float("nan")

                lat = str(D.loc[ix,"Latitude"])
                lon = str(D.loc[ix,"Longitude"])
                latlon = lat + ";" + lon
                LatLon[state][structure_id] = latlon

            ## Store the traffic data if possible
            try:
                year = int(D.loc[ix,"Traffic_year"])
            except ValueError:
                continue
            if year >= year0 and year <= yearmax:
                traffic = D.loc[ix,"Traffic"]
                try:
                    TR[state][structure_id][year-year0] = traffic
                except ValueError:
                    continue

Now that we have all of the data of interest stored in dictionaries, we can write it out to a flat file.

## Save all the results
with gzip.open("../Data/nbi_data.csv.gz", "w") as fid:

    fid.write("State,Structure id,Year built,Latitude,Longitude,")
    fid.write(",".join([str(y) for y in range(year0, yearmax+1)]) + "\n")

    ## We will write out the data for the states in alphabetical order
    States = TR.keys()
    States.sort()

    for state in States:

        ## Alphabetically sort the bridge id's, file easier to
        ## manually scan.
        K = TR[state].keys()
        K.sort()

        for k in K:

            ## Some states may be missing year of construction data
            yb = float("nan")
            try:
                yb = YB[state][k]
            except KeyError:
                pass

            latlon = ""
            try:
                latlon = LatLon[state][k]
            except KeyError:
                pass

            latlon = latlon.split(";")
            fid.write("%s,%s,%.0f,%s,%s," % (state, k, yb, latlon[0], latlon[1]))
            fid.write(",".join(["%.0f" % x for x in TR[state][k]]) + "\n")

Some simple analyses

Our analysis code is in the script nbi_analysis.py. We start by reading in the data and constructing some summary statistics. The summary statistics are written out to a file in CSV format.

import numpy as np
from nbi_load_data import TR,BI
import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages

## The state labels
ST = TR.keys()

## Map from each state name to the number of bridges
## in that state
NB = {state: TR[state].shape[0] for state in ST}

## Map from each state name to the median year that bridges were built
## in that state
MDYB = {state: np.median(BI[state][:,0]) for state in ST}

## Map from each state name to the median number of traffic data points 
## per bridge in that state
MTC = {state: np.median(np.isfinite(TR[state]).sum(1)) for state in ST}

## Create a file containing some summary statistics
with open("nbi_summaries.csv", "w") as OUT:
    OUT.write("State,#bridges,Median year built,Median #traffic points/bridge\n")
    for st in ST:
        OUT.write("%s,%d,%d,%d\n" % (st, NB[st], MDYB[st], MTC[st]))

Next we comnstruct a series of plots. The first shows the distribution of the years in which bridges were built within each state, using side by side box plots.

pdf = PdfPages("nbi_plots.pdf")

##
## Boxplot of the years in which bridges were built in each state
##

## Sort the states by median year of bridges being built
YB = [MDYB[st] for st in ST]
ii = np.argsort(YB)

## Reorder the data by median year
F = [BI[state][:,0] for state in ST]
F = [F[i] for i in ii]

## Reorder the state names accordingly
STi = [ST[i] for i in ii]

## We need an unusually wide figure
plt.figure(figsize=(10,5))
plt.clf()
plt.boxplot(F)

plt.gca().set_xticklabels([])
for i,st in enumerate(STi):
    plt.text(i+1, 1830 - 10 * (i % 2), st, ha='center')

plt.ylabel("Year built", size=17)
pdf.savefig()

Next, we make a simple dot plot of the number of bridges in each state (note that most bridges are counted twice, once for the traffic flow in each direction).

##
## Dot plot of the number of bridges per state
##

## Sort the states by the number of bridges
NBV = [NB[st] for st in ST]
ii = np.argsort(NBV)
NBV = [NBV[i] for i in ii]
STi = [ST[i] for i in ii]

## An unusually wide figure
plt.figure(figsize=(10,5))
plt.clf()

plt.gca().set_xticklabels([])
for i,st in enumerate(STi):

    ## Guide lines
    plt.plot([i,i], [0,80000], '-', color='lightgrey')

    ## Horizontal axis labels
    plt.text(i, -3000 - 3000 * (i % 2), st, ha='center')

## Plot the data points
plt.plot(range(len(NBV)), NBV, 'o')

plt.xlim(-1, 51)
plt.ylabel("Number of bridges", size=17)
pdf.savefig()

Here we plot the number of new bridges built each year. Note that the spikes on the decade points probably reflet some states coding the years at which older bridges were built to the nearest decade.

##
## Plot the number of bridges built per year in the US
##

## Aggregate the counts, exclude unless between 1850 and 2012
FY = np.zeros(2012-1850)
for st in ST:
    for h in BI[st]:
        if h[0] > 1850 and h[0] < 2012:
            FY[h[0] - 1850] += 1

plt.clf()
plt.plot(range(1850,2012), FY, '-', lw=5, color='orange')
plt.xlabel("Year", size=17)
plt.ylabel("Number of new bridges", size=17)
pdf.savefig()

Next we plot the traffic count per year:

##
## Plot the total traffic count per year
##

## Aggregate the traffic counts
TT = np.zeros(2012-1990)
for st in ST:
    for j,v in enumerate(TR[st].T): ## Iterate over columns
        ii = np.flatnonzero(np.isfinite(v))
        TT[j] += v[ii].sum()

plt.clf()
plt.plot(range(1990,2012), TT, '-', lw=5, color='orange')
plt.xlabel("Year", size=17)
plt.ylabel("Traffic count", size=17)
pdf.savefig()

And finally we make a scatterplot showing the observed traffic counts on individual bridges in 1995 and 2000 (for bridges that had counts recorded in both of those years).

##
## Plot the total traffic count per year
##

## Aggregate the traffic counts
Q = []
for st in ST:

    ## 1995 and 2000
    V = TR[st][:,[5,10]]

    ii = np.flatnonzero(np.isfinite(V).sum(1) == 2)
    V = V[ii,:]
    Q.append(V)

Q = np.concatenate(Q, axis=0)

## Take a random subset to keep the file size down
ii = np.random.permutation(Q.shape[0])[0:10000]
Q = Q[ii,:]

plt.figure(figsize=(7,5))
plt.clf()
plt.axes([0.15,0.12,0.8,0.8])
plt.grid(True)
plt.plot(Q[:,0], Q[:,1], 'o', lw=5, alpha=0.5, color='orange')
plt.xlabel("1995", size=17)
plt.ylabel("2000", size=17)
pdf.savefig()

pdf.close()

US National Bridge Inventory

Background

Downloading the data

Extracting the data

Some simple analyses

Links

Downloads