Web Scraping

Thank you for visiting my notes. I've noticed a large uptick in views lately which is terrific! However, these notes are quite old (2017) and some details may be outdated. I have a more recent class (2023) which covers much of the same information and is hopefully more up-to-date.

If you are using these notes, I'd love to hear from you and how you are using them! Please reach out to jerrick (at) umich (dot) edu.

Disclaimer

As web scraping involves pulling data directly off a website, its replicable success depends entirely on the webpage in question not changing! The last modification to this document was made March 21, 2017, at which point all these scripts worked. Please contact me if any of the pages change and break the code.

Introduction

Web scraping is a general term for any sort of procedure that retrieves data stored on the web. This can be a simple as downloading a csv file that’s hosted online (E.g. read.csv("www.website.com/data.csv")). However, we’ll be discussing slightly more interesting versions, where the data is embedded in a website and we want to extract it. Typically this data will already be a table or list, though in theory we could work with more complicated structures.

The real trick of web scraping is that there is no silver bullet. Any web scraping procedure will work only with the page it was designed for1. These notes will cover several general approaches which can be used, but if you are scraping your own data, you’ll need to start from scratch, figure out which method is appropriate, and implement it (often from scratch).

Manual scraping

The most basic form of scraping (basic in terms of required tools, not simplicity) involves grabbing the entire webpage and extracting the pieces needed. There are two components - obtaining the text of the webpage and extracting only the relevant information. The first piece is very straightforward, but the second can, in some situations, be extremely time-consuming. Modern web design is much more opaque than classic design, not to mention often very poorly done, such that even if the data is clearly laid out on screen, it may not be as clearly laid in the HTML.

Copy and Pasting

No, this isn’t a joke. If the data is in a clean table, and small enough that you can hightlight it all, copying and pasting often produces a text document that can be read in easily. For example, Wikipedia tables are easy to approach this way. Consider Comparison of file systems. If you highlight the General Information table, paste it in a plaintext editor2, you could load it in R with

read.csv("comparison_of_file_systems.txt", sep = "\t")

Note that the format is what’s known as tsv (tab separated values), which is similar to csv (comma separated values). Recall that “\t” is the escape character for tab.

If the data is not clearly presented (or the webpage is “modern”), or if the data is at all large, this method won’t work.

Obtaining the webpage

There are various ways to get the HTML creating a website. The easiest is with the readLines function. This function is similar to read.csv in that it takes a source of text, but instead of trying to force the results into a table, it stores each line of the file as a separate string, in a character vector.

For this example, we’ll use hockey-reference.com’s list of NHL career leaders in goals, http://www.hockey-reference.com/leaders/goals_career.html.

page <- readLines("http://www.hockey-reference.com/leaders/goals_career.html")
head(page)
## [1] "<!DOCTYPE html>"                                                                                                                            "<html data-version=\"klecko-\" data-root=\"/home/hr/build\" itemscope itemtype=\"http://schema.org/WebSite\" lang=\"en\" class=\"no-js\" >"
## [3] "<head>"                                                                                                                                     "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0, maximum-scale=2.0\" />"
## [5] "    <link rel=\"dns-prefetch\" href=\"https://d2p3bygnnzw9w3.cloudfront.net/req/201703134\" />"                                             "    <link href=\"https://d2p3bygnnzw9w3.cloudfront.net\" rel=\"preconnect\" crossorigin>"

Extracting the relevant information

There are many ways to do this extraction. This is only one. At this point this is merely an exercise in data cleaning, so whatever tools you use to accomplish this are fine.

If you start looking at page, you’ll notice a lot of the beginning information is general web stuff - javascript, web SEO tools, the site header itself. We can drop all of that. The top scoring player is Wayne Gretzky. Let’s find which entry he’s listed at.

library(stringr)
str_which(page, "[Gg]retzky")
## [1]   39   45  504 1267 1518 1627

I search for both “gretzky” and “Gretzky” because occasionally capitalization is due to CSS formatting instead of actual saved capitals. In this case it doesn’t matter, but I include it as best practices.

We see that Greztky was found in 6 places. Look at the entries around each and see if we’ve found the right one.

page[37:41]
## [1] "    <meta itemprop=\"name\"          content=\"Hockey Reference\">"
## [2] "    <meta itemprop=\"alternateName\" content=\"HkRef\">"
## [3] "    <meta name=\"Description\" content=\"1. Wayne Gretzky (894), 2. Gordie Howe (801), 3. Jaromir Jagr (763), 4. Brett Hull (741), 5. Marcel Dionne (731), 6. Phil Esposito (717), 7. Mike Gartner (708), 8. Mark Messier (694), 9. Steve Yzerman (692), 10. Mario Lemieux (690)\">"
## [4] "<meta property=\"fb:app_id\"     content=\"212945115395487\">"
## [5] "    <meta property=\"og:url\"          content=\"http://www.hockey-reference.com/leaders/goals_career.html\">"
# page[43:47] Not run to save space, since I know it's in the next one...
page[504:510]
## [1] "</tr></thead><tbody><tr><td class=\"right\">1.</td><td><a href=\"/players/g/gretzwa01.html\">Wayne Gretzky</a>*</td>" "<td>1979-99</td><td class=\"right\">894</td>"                                                                         "</tr>"
## [4] "<tr><td class=\"right\">2.</td><td><a href=\"/players/h/howego01.html\">Gordie Howe</a>*</td>"                        "<td>1946-80</td><td class=\"right\">801</td>"                                                                         "</tr>"
## [7] "<tr><td class=\"right\">3.</td><td><strong><a href=\"/players/j/jagrja01.html\">Jaromir Jagr</a></strong></td>"

Let’s do the same thing with the last entry, Keith Primeau:

str_which(page, "[Pp]rimeau")
## [1] 1251
page[1251:1253]
## [1] "<tr><td class=\"right\">&nbsp</td><td><a href=\"/players/p/primeke02.html\">Keith Primeau</a></td>" "<td>1990-06</td><td class=\"right\">266</td>"                                                       "</tr>"

Now we can see that all the data we need is between rows 504 and 1252.

page <- page[504:1252]

Let’s look at the data in more detail.

page[1:9]
## [1] "</tr></thead><tbody><tr><td class=\"right\">1.</td><td><a href=\"/players/g/gretzwa01.html\">Wayne Gretzky</a>*</td>" "<td>1979-99</td><td class=\"right\">894</td>"                                                                         "</tr>"
## [4] "<tr><td class=\"right\">2.</td><td><a href=\"/players/h/howego01.html\">Gordie Howe</a>*</td>"                        "<td>1946-80</td><td class=\"right\">801</td>"                                                                         "</tr>"
## [7] "<tr><td class=\"right\">3.</td><td><strong><a href=\"/players/j/jagrja01.html\">Jaromir Jagr</a></strong></td>"       "<td>1990-17</td><td class=\"right\">763</td>"                                                                         "</tr>"

We note a few things.

  • Each entry in the table is three strings.
  • The third string (</tr>) in each entry is uselss.
  • The first string in each entry contains the ranking (useless) and the player’s name.
  • The second string in each entry contains the years played and the total goals.
  • Aside from the data itself (and the very first entry), each string across entries is identical.

So this suggests a very natural strategy:

  1. Drop the useless third string
  2. Link the first and second strings
  3. Remove the common text
  4. Split into the proper three variables.

Drop the useless third string

Here’s two easy ways to do this.

page2 <- page[seq_len(length(page)) %% 3 > 0]
# WRONG - see below
#page3 <- str_subset(page, "[^</tr>$]") 
#identical(page2, page3)
#page <- page2

The first approach creates a logical vector which is TRUE whenever the entry modulo 3 is non-zero, then extracts just those rows.

The second approach uses a regular expression which does not match </tr> at the end of a line. Note:

  • The $ to check the end of a line is necessary because the first string of each entry starts with </tr> as well.
  • None of <, / or > are special characters and don’t need to be escaped, but it wouldn’t hurt to do it to be safe - "[^\\<\\/tr\\>$]" would work as well (but don’t escape ^ which means “not” or $ which means end of line, as both should be treated as their special characters instead of strings).

Update: The 2nd version as above won’t work. Regular expression’s “can’t”3 do inverse matching, that is, matching NOT a string. It can match not a character (e.g. "[^a]" matches any string containing at least one non-“a” character).

The correct 2nd version is to use str_detect and !:

page3 <- page[!str_detect(page, "^</tr>$")]
identical(page2, page3)
## [1] TRUE
page <- page2

End Update

Separate the first and second strings

We have alternating rows, so cast it as a matrix, filling it in by row.

pagedata <- matrix(page, ncol = 2, byrow = TRUE)
head(pagedata)
##      [,1]                                                                                                                   [,2]
## [1,] "</tr></thead><tbody><tr><td class=\"right\">1.</td><td><a href=\"/players/g/gretzwa01.html\">Wayne Gretzky</a>*</td>" "<td>1979-99</td><td class=\"right\">894</td>"
## [2,] "<tr><td class=\"right\">2.</td><td><a href=\"/players/h/howego01.html\">Gordie Howe</a>*</td>"                        "<td>1946-80</td><td class=\"right\">801</td>"
## [3,] "<tr><td class=\"right\">3.</td><td><strong><a href=\"/players/j/jagrja01.html\">Jaromir Jagr</a></strong></td>"       "<td>1990-17</td><td class=\"right\">763</td>"
## [4,] "<tr><td class=\"right\">4.</td><td><a href=\"/players/h/hullbr01.html\">Brett Hull</a>*</td>"                         "<td>1986-06</td><td class=\"right\">741</td>"
## [5,] "<tr><td class=\"right\">5.</td><td><a href=\"/players/d/dionnma01.html\">Marcel Dionne</a>*</td>"                     "<td>1971-89</td><td class=\"right\">731</td>"
## [6,] "<tr><td class=\"right\">6.</td><td><a href=\"/players/e/esposph01.html\">Phil Esposito</a>*</td>"                     "<td>1963-81</td><td class=\"right\">717</td>"

Remove the common text

Let’s consider the first column, which contains the players name.

pagedata[1:3, 1]
## [1] "</tr></thead><tbody><tr><td class=\"right\">1.</td><td><a href=\"/players/g/gretzwa01.html\">Wayne Gretzky</a>*</td>" "<tr><td class=\"right\">2.</td><td><a href=\"/players/h/howego01.html\">Gordie Howe</a>*</td>"                        "<tr><td class=\"right\">3.</td><td><strong><a href=\"/players/j/jagrja01.html\">Jaromir Jagr</a></strong></td>"

We see that the first row has some extraneous tags (it’s not uncommon for the first and/or last rows to have a bit extra, make sure to check this). Regardless, the players name is always between html\"> and </a>. We can use str_replace to replace those bits with blanks.

pagedata[, 1] <- str_replace(pagedata[, 1], "^[:print:]+html\">", "")
pagedata[, 1] <- str_replace(pagedata[, 1], "</a>[:print:]+$", "")
head(pagedata[, 1])
## [1] "Wayne Gretzky" "Gordie Howe"   "Jaromir Jagr"  "Brett Hull"    "Marcel Dionne" "Phil Esposito"

Using ^ and $ to mark the beginning and end of the lines here isn’t neccesary, but it provides a bit of protection against accidentally matching in the middle of a string.

Alternatively, we could have done this by removing all the <...> tags:

pagedata2 <- matrix(page, ncol = 2, byrow = TRUE)
tmp <- str_replace_all(pagedata2[, 1], "<[^<>]+>", "")
# head(str_replace_all(tmp, "\\*|[0-9].", "")) # WRONG, see update

Update: A student pointed out that this actually isn’t identical. There are a few ties, such as 22 and 23. When this occurs, the page displays 22. for the first player and a blank for the second. If you look at the code, it’s not actually a blank, but &nbsp, which is a Non-Breaking SPace.

Also, there’s a mistake, as I want to match multiple numbers, and I need to escape the . (a . matches any character, including non-printable [such as \t for tab] so is a more inclusive version of [:print:]). A fixed version of the line is

tmp <- str_replace_all(tmp, "\\*|[0-9]+\\.|&nbsp", "")
identical(tmp, pagedata[, 1])
## [1] TRUE

End Update

The first line matches any <...> tags. By default regexp is “greedy”, so it would have matched the entire string! By ensuring that there are some non-<> characters between them, we restrict to the actual <...> tags. The second removes the extra information, the rank (listed as 1.) and the * which represents Hall of Fame members.

The second column is much the same, except with the added complication that we want to extract two pieces of information from it.

pagedata[, 2] <- str_replace(pagedata[, 2], "^<td>", "")
pagedata[, 2] <- str_replace(pagedata[, 2], "</td>[:print:]+\">", ",")
pagedata[, 2] <- str_replace(pagedata[, 2], "</td>$", "")

I replace the middle piece with a comma to make my next step a bit easier, …

Split into the proper three variables

… so I can use str_split directly.

year_goals <- str_split(pagedata[, 2], ",")
goals <- data.frame(player = pagedata[, 1],
                    years = sapply(year_goals, "[", 1),
                    goals = as.numeric(sapply(year_goals, "[", 2)),
                    stringsAsFactors = FALSE)
str(goals)
## 'data.frame':    250 obs. of  3 variables:
##  $ player: chr  "Wayne Gretzky" "Gordie Howe" "Jaromir Jagr" "Brett Hull" ...
##  $ years : chr  "1979-99" "1946-80" "1990-17" "1986-06" ...
##  $ goals : num  894 801 763 741 731 717 708 694 692 690 ...
head(goals)
##          player   years goals
## 1 Wayne Gretzky 1979-99   894
## 2   Gordie Howe 1946-80   801
## 3  Jaromir Jagr 1990-17   763
## 4    Brett Hull 1986-06   741
## 5 Marcel Dionne 1971-89   731
## 6 Phil Esposito 1963-81   717

A second example

Presented with minimal comments, extracting the National Occupation Employment and Wage Estimates table from https://www.bls.gov/oes/current/oes_nat.htm.

page <- readLines("https://www.bls.gov/oes/current/oes_nat.htm")

# Since there's more columns than before, lets extract and save the variable names
str_which(page, "[Oo]ccupation [Cc]ode")
## [1] 745
str_which(page, "[Mm]ean [Ww]age [RSE|rse]")
## [1] 754
colnames <- page[745:754]
str_which(page, "All Occupations")
## [1] 717 772

# First is links to subtables, so first row is 772
page[768:790]
##  [1] "</thead>"                                                                                      "<tbody>"                                                                                       "<tr>"                                                                                          "  <td class=\"bold\">00-0000</td>"
##  [5] "  <td class=\"bold\" id=\"00-0000\">All Occupations</td>"                                      "  <td>total</td>"                                                                              "  <td>137,896,660</td>"                                                                        "  <td>0.1%</td>"
##  [9] "  <td>1000.000</td>"                                                                           "  <td>$17.40</td>"                                                                             "  <td>$23.23</td>"                                                                             "  <td>$48,320</td>"
## [13] "  <td>0.1%</td>"                                                                               "</tr>"                                                                                         "<tr>"                                                                                          "  <td class=\"bold\">11-0000</td>"
## [17] "  <td class=\"bold\" id=\"11-0000\"><a HREF=\"oes110000.htm\">Management Occupations</a></td>" "  <td>major</td>"                                                                              "  <td>6,936,990</td>"                                                                          "  <td>0.2%</td>"
## [21] "  <td>50.306</td>"                                                                             "  <td>$47.38</td>"                                                                             "  <td>$55.30</td>"

# Going to start with 770 so each data point as the same entries (capturing 
# the "<tr>" and its following uninforative row)
str_which(page, "[Mm]aterial [Mm]oving [Ww]orkers, [Aa]ll [Oo]ther")
## [1] 13828
page[13828:13837]
##  [1] "<td><a HREF=\"oes537199.htm\">Material Moving Workers, All Other</a></td>" "  <td>detail</td>"                                                         "  <td>21,740</td>"                                                         "  <td>4.0%</td>"                                                           "  <td>0.158</td>"
##  [6] "  <td>$14.60</td>"                                                         "  <td>$17.82</td>"                                                         "  <td>$37,060</td>"                                                        "  <td>2.8%</td>"                                                           "</tr>"
page <- page[770:13837]

# Drop "<tr>" and "</tr>" lines.
page <- str_subset(page, "[^<[/]?tr>$]")

# Get rid of all <td> tags
page <- str_replace_all(str_trim(page), "<[/]?td[^>]*>", "")

# 10 columns of data
mat <- matrix(page, ncol = 10, byrow = TRUE)

# Column 7 and 8 sometimes have (4) or (5), which correspond to special 
# cases, which I'll remove in this context.
mat[, 7:8] <- str_replace(mat[, 7:8], "^[:print:]+href[:print:]+$", "NA")
# NA's make them easier to deal with later.

# Column 2 is sometimes a link
mat[, 2] <- str_replace_all(mat[, 2], "<[/]?a[^>]*>", "")

# Convert to data.frame, then to numeric.
saldata <- data.frame(mat, stringsAsFactors = FALSE)
saldata$X4 <- as.numeric(str_replace_all(saldata$X4, ",", ""))
saldata$X5 <- as.numeric(str_replace(saldata$X5, "%", ""))/100
saldata$X6 <- as.numeric(saldata$X6)
saldata$X7 <- as.numeric(str_replace_all(saldata$X7, "\\$", ""))
## Warning: NAs introduced by coercion
saldata$X8 <- as.numeric(str_replace_all(saldata$X8, "\\$", ""))
## Warning: NAs introduced by coercion
saldata$X9 <- as.numeric(str_replace_all(saldata$X9, "[\\$|,]", ""))
## Warning: NAs introduced by coercion
saldata$X10 <- as.numeric(str_replace(saldata$X10, "%", ""))/100

# Get the names
colnames <- str_replace_all(colnames, "<[/]?th>", "")
colnames[2] <- str_replace(colnames[2], " \\([:print:]+\\)", "")
names(saldata) <- colnames
str(saldata)
## 'data.frame':    1089 obs. of  10 variables:
##  $ Occupation code          : chr  "00-0000" "11-0000" "11-1000" "11-1011" ...
##  $ Occupation title         : chr  "All Occupations" "Management Occupations" "Top Executives" "Chief Executives" ...
##  $ Level                    : chr  "total" "major" "minor" "detail" ...
##  $ Employment               : num  1.38e+08 6.94e+06 2.44e+06 2.39e+05 2.15e+06 ...
##  $ Employment RSE           : num  0.001 0.002 0.002 0.007 0.003 0.013 0.005 0.026 0.006 0.01 ...
##  $ Employment per 1,000 jobs: num  1000 50.31 17.69 1.73 15.56 ...
##  $ Median hourly wage       : num  17.4 47.4 48.5 84.2 47 ...
##  $ Mean hourly wage         : num  23.2 55.3 59.7 89.3 57.4 ...
##  $ Annual mean wage         : num  48320 115020 124210 185850 119460 ...
##  $ Mean wage RSE            : num  0.001 0.001 0.002 0.004 0.002 0.012 0.003 0.017 0.004 0.006 ...
head(saldata)
##   Occupation code                Occupation title  Level Employment Employment RSE Employment per 1,000 jobs Median hourly wage Mean hourly wage Annual mean wage Mean wage RSE
## 1         00-0000                 All Occupations  total  137896660          0.001                  1000.000              17.40            23.23            48320         0.001
## 2         11-0000          Management Occupations  major    6936990          0.002                    50.306              47.38            55.30           115020         0.001
## 3         11-1000                  Top Executives  minor    2439900          0.002                    17.694              48.53            59.71           124210         0.002
## 4         11-1011                Chief Executives detail     238940          0.007                     1.733              84.19            89.35           185850         0.004
## 5         11-1021 General and Operations Managers detail    2145140          0.003                    15.556              46.99            57.44           119460         0.002
## 6         11-1031                     Legislators detail      55820          0.013                     0.405                 NA               NA            42530         0.012

Using rvest

Now that we’ve gone over all that, let’s replicate the results in just a few lines of code. The package rvest by Hadley Wickham automates a lot of this. Note though that a) it is not infallible so you still might need to do the above, and b) it will not do the cleaning we did above.

library(rvest)

If this works, the basic structure of how it will work is

  1. Grab the website using read_html.
  2. Find the correct label for the table you want by either inspecting the source code or using html_nodes(..., "table").
  3. Extract the table you want using html_nodes(..., "table name")[#].
  4. Convert to a data frame using html_table. (This returns a list, so you’ll want html_table(...)[[1]].)

Hockey (again)

Let’s replicate the hockey results.

page <- read_html("http://www.hockey-reference.com/leaders/goals_career.html")

From looking at the page, we expect to see two tables.

html_nodes(page, "table")
## {xml_nodeset (2)}
## [1] <table class="suppress_glossary suppress_csv sortable stats_table" id="stats_career_NHL" data-cols-to-freeze="0">\n<caption>NHL Leaders Table</caption>\n<thead><tr>\n<th class="right">Rank</th>\n<th class="left">Player</th>\n<th>Years</th>\n<th>G</th>\n</tr></thead>\n<tbody>\n<tr>\n<td class="right">1.</td>\n<td>\n<a href="/players/g/gretzwa01.html">Wayne Gretzky</a>*</td>\n<td>1979-99</td> ...
## [2] <table class="suppress_glossary suppress_csv sortable stats_table" id="stats_career_WHA" data-cols-to-freeze="0">\n<caption>WHA Leaders Table</caption>\n<thead><tr>\n<th class="right">Rank</th>\n<th class="left">Player</th>\n<th>Years</th>\n<th>G</th>\n</tr></thead>\n<tbody>\n<tr>\n<td class="right">1.</td>\n<td><a href="/players/t/tardima01.html">Marc Tardif</a></td>\n<td>1973-79</td>\n<td ...

Thankfully, this is a simple page, and we see only two tables. The id argument shows us that we have the NHL and WHA tables, so we can extract either.

nhl <- html_table(html_nodes(page, "table")[1])[[1]]
str(nhl)
## 'data.frame':    250 obs. of  4 variables:
##  $ Rank  : chr  "1." "2." "3." "4." ...
##  $ Player: chr  "Wayne Gretzky*" "Gordie Howe*" "Jaromir Jagr" "Brett Hull*" ...
##  $ Years : chr  "1979-99" "1946-80" "1990-17" "1986-06" ...
##  $ G     : int  894 801 763 741 731 717 708 694 692 690 ...
head(nhl)
##   Rank         Player   Years   G
## 1   1. Wayne Gretzky* 1979-99 894
## 2   2.   Gordie Howe* 1946-80 801
## 3   3.   Jaromir Jagr 1990-17 763
## 4   4.    Brett Hull* 1986-06 741
## 5   5. Marcel Dionne* 1971-89 731
## 6   6. Phil Esposito* 1963-81 717

If desired, we can clean up from here.

Salary data (again)

For the salary data, a similar workflow suffices.

page <- read_html("https://www.bls.gov/oes/current/oes_nat.htm")
html_nodes(page, "table")
## {xml_nodeset (2)}
## [1] <table id="main-content-table"><tr>\n<td id="secondary-nav-td">\r\n\t\t\t<div id="secondary-nav">\r\n\r\r\n<!-- 2COL LAYOUT COL1 END -->\r\r\n\r\r\n<!-- SECONDARY NAV BEGIN -->\t\t\r\r\n<div class="secondary-nav">\r\n  <ul class="level1">\n<li class="menuhead"><a href="/oes/home.htm">BROWSE OES</a></li>\r\n    <li><a href="/oes/home.htm">OES HOME</a></li>\r\n    <li class="submenu">\n<a hre ...
## [2] <table class="display sortable_datatable fixed-headers">\n<thead>\n<tr>\n<th>Occupation code</th>\r\n<th>Occupation title (click on the occupation title to view its profile)</th>\r\n<th>Level</th>\r\n<th>Employment</th>\r\n<th>Employment RSE</th>\r\n<th>Employment per 1,000 jobs</th>\r\n<th>Median hourly wage</th>\r\n<th>Mean hourly wage</th>\r\n<th>Annual mean wage</th>\r\n<th>Mean wage RS ...

There are only two tables listed here. We could examine each and pick the right one, or look at the page source and figure out whether “main-content-table” or “display sortable_datatable fixed-headers” is correct. From either, the second is what we want.

salary <- html_table(html_nodes(page, "table")[2])[[1]]
str(salary)
## 'data.frame':    1090 obs. of  10 variables:
##  $ Occupation code                                                     : chr  "" "00-0000" "11-0000" "11-1000" ...
##  $ Occupation title (click on the occupation title to view its profile): chr  "" "All Occupations" "Management Occupations" "Top Executives" ...
##  $ Level                                                               : chr  "" "total" "major" "minor" ...
##  $ Employment                                                          : chr  "" "137,896,660" "6,936,990" "2,439,900" ...
##  $ Employment RSE                                                      : chr  "" "0.1%" "0.2%" "0.2%" ...
##  $ Employment per 1,000 jobs                                           : num  NA 1000 50.31 17.69 1.73 ...
##  $ Median hourly wage                                                  : chr  "" "$17.40" "$47.38" "$48.53" ...
##  $ Mean hourly wage                                                    : chr  "" "$23.23" "$55.30" "$59.71" ...
##  $ Annual mean wage                                                    : chr  "" "$48,320" "$115,020" "$124,210" ...
##  $ Mean wage RSE                                                       : chr  "" "0.1%" "0.1%" "0.2%" ...
head(salary)
##   Occupation code Occupation title (click on the occupation title to view its profile)  Level  Employment Employment RSE Employment per 1,000 jobs Median hourly wage Mean hourly wage Annual mean wage Mean wage RSE
## 1                                                                                                                                               NA
## 2         00-0000                                                      All Occupations  total 137,896,660           0.1%                  1000.000             $17.40           $23.23          $48,320          0.1%
## 3         11-0000                                               Management Occupations  major   6,936,990           0.2%                    50.306             $47.38           $55.30         $115,020          0.1%
## 4         11-1000                                                       Top Executives  minor   2,439,900           0.2%                    17.694             $48.53           $59.71         $124,210          0.2%
## 5         11-1011                                                     Chief Executives detail     238,940           0.7%                     1.733             $84.19           $89.35         $185,850          0.4%
## 6         11-1021                                      General and Operations Managers detail   2,145,140           0.3%                    15.556             $46.99           $57.44         $119,460          0.2%

Note that this is looking a lot less clean than the hockey data. There’s a blank row in the front (usually we can pass header = TRUE to html_table, but it doesn’t work here because of the NA in one of the columns), and we need to convert a lot from character to numeric.

Failure of rvest

rvest is not perfect. Here’s an example of a page that fails. The Goodreads site strives to be like the IMDB for books. Here’s a page that lists the most recent history books added to the site: https://www.goodreads.com/genres/new_releases/history. Notice that the list is stored as images, not text. However, if we examine the html closely, we do see the book information is stored as text, which appears if you mouse-over a book.

rvest fails us here4:

books <- read_html("https://www.goodreads.com/genres/new_releases/history")
html_nodes(books, "table")
## {xml_nodeset (0)}

However, a manual scrape works fine:

page <- readLines("https://www.goodreads.com/genres/new_releases/history")
## Warning in readLines("https://www.goodreads.com/genres/new_releases/history"): incomplete final line found on 'https://www.goodreads.com/genres/new_releases/history'

The warning here doens’t bother us, as the data we need is stored in the middle of the html, not the last line.

Looking at the HTML, we see that each mouseover is inside a <script>. Each script includes the words “readable bookTitle” and that there are 52 total books.

length(str_which(page, "readable bookTitle"))
## [1] 54

Looking at the mouseover, we can extract the title, author, rating, number of ratings. The year is irrelevant here because these are all new books, and the description we’ll skip.

books <- str_subset(page, "readable bookTitle")
books <- str_replace(books, "^[:print:]+readable bookTitle[^>]+>", "")
books <- str_replace(books, "<[:print:]+authorName[^>]+>", "@@@")

I use “@@@” as the split between title and author because titles may have commas in them. You can use anything here that you’re sure won’t be in a title.

It tricky to clean up around the rating, so lets instead extact it.

ratings <- as.numeric(str_extract(books, "[0-9]\\.[0-9]{2}"))
numratings <- str_extract(books, "[0-9]+ ratings")
numratings <- as.numeric(str_replace(numratings, " ratings", ""))
books <- str_replace(books, "<[:print:]+$", "")
bookdata <- data.frame(title = sapply(str_split(books, "@@@"), "[", 1),
                       author = sapply(str_split(books, "@@@"), "[", 2),
                       rating = ratings,
                       numrating = numratings,
                       stringsAsFactors = FALSE)
str(bookdata)
## 'data.frame':    54 obs. of  4 variables:
##  $ title    : chr  "Age of Anger: A History of the Present" "Cannibalism: A Perfectly Natural History" "The Book Thieves: The Nazi Looting of Europe&#39;s Libraries and the Race to Return a Literary Inheritance" "The Inkblots: Hermann Rorschach, His Iconic Test, and the Power of Seeing" ...
##  $ author   : chr  "Pankaj Mishra" "Bill Schutt" "Anders Rydell" "Damion Searls" ...
##  $ rating   : num  3.75 4.1 3.93 3.85 3.82 4.28 3.64 4.1 4.14 3.89 ...
##  $ numrating: num  145 229 98 67 179 133 69 94 146 54 ...

Data Spread Over Multiple pages

Often data will not be limited to a single page, but spread across many. This could be a separate page per year/date or just a limit on the number of results listed per page.

As an example, let’s try and collect the salary data for all employees of Tennessee public schools: https://www.tbr.edu/hr/salaries.

We see the first 50 individuals listed here with no evidence of how to list more individuals on the single page. However, if we go to the second page of results (the pages are listed below the data), we see the URL changes to https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=1. This is common in webpage design. Each of those arguments (firstname= and lastname= etc) could be non-blank (try it: restrict the results by using lastname=smith).

The most interesting argument for our purposes is page=1. Changing that to page=0 gives us the first page of results5, and clicking to the the “last” results gives page=323. We can therefore use rvest and scrape all these pages simultaneously.

tmp <- lapply(0:323, function(i) {
  url <- str_c("https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=", i)
  page <- read_html(url)
  html_table(html_nodes(page, "table"))[[1]]
})
tnsal <- do.call("rbind", tmp)
str(tnsal)
## 'data.frame':    16151 obs. of  7 variables:
##  $ Institution: chr  "Middle TN State University" "Walters State Comm College" "Southwest TN Comm College" "Chattanooga State Comm. Coll." ...
##  $ Last Name  : chr  "Aaron" "Aarons" "Abadie" "Abbott" ...
##  $ First Name : chr  "Joshua" "Andrew" "Cynthia" "Joyce" ...
##  $ Job Title  : chr  "Associate Professor" "Associate Professor" "Associate Professor" "Admissions Records Clerk Pt" ...
##  $ Department : chr  "Management" "Industrial Technology" "Business and Legal Studies" "Customer Response Center" ...
##  $ Salary     : chr  "$100,350" "$59,583" "$47,868" "$21,684" ...
##  $ FTE        : num  1 1 1 0.8 1 1 1 1 1 1 ...
tail(tnsal, 1)
##        Institution Last Name First Name                      Job Title       Department  Salary FTE
## 16151 TCAT Dickson   Zwingle    Cynthia Administrative Support Assoc 6 Student Services $40,825   1

Note that this can be quite slow if the website you are loading is slow (even an impercetible sluggishness on the site will blow up here; each page takes just short of 800ms to load, over 324 pages this becomes over 4 minutes).

APIs and JSON

API

An API (Application Programming Interface) is a set methods to access data which is not publically available as a complete data set. For example, if you wanted to get access to Google search results, you’d use the Google JSON/Atom Custom Search API or if you wanted to get a list of Yelp businesses, you’d use the Yelp Fusion API. These are most commonly used by programmers (e.g. if you use any third-party apps to connect to resources such as Facebook, Twitter, etc; the developers of those apps use APIs to access your information) but are also useful to extract data.

A lot of APIs are private - either they’re only available to people who are authorized to use them, or they’re used internally by development teams. However, a good number are public, but may require purchasing access or at least registering to get an API key. There’s a useful list of public APIs maintained by Todd Motto. A lot of the truely open APIs are done by fans and volunteers as opposed to the official company released ones.

We’ve technically already seen a (basic and informal) API. The Tennessee salary url, “https://www.tbr.edu/hr/salaries?firstname=&lastname=&department=&jobtitle=&institution=&page=1” is an example. We could create an R function to generate that URL and extract just the data we need:

scrapeTN <- function(firstname = "",
                     lastname = "",
                     department = "",
                     jobtitle = "",
                     institution = "",
                     page = 1) {
  url <- str_c("https://www.tbr.edu/hr/salaries?",
               "firstname=", firstname,
               "&lastname=", lastname,
               "&department=", department,
               "&jobtitle=", jobtitle,
               "&institution=", institution,
               "&page=", page)
  page <- read_html(url)
  html_table(html_nodes(page, "table"))[[1]]
}
scrapeTN(lastname = "Horton")
##                  Institution  Last Name First Name                     Job Title                         Department  Salary FTE
## 1  East Tennessee State Univ Fox-Horton      Julie           Assistant Professor                    Degree Programs $46,000   1
## 2      University of Memphis     Horton      Cathy             Cares Counselor I Ctr Acad Retention and Enrich Srvs $38,250   1
## 3      University of Memphis     Horton    Stephen            Research Scientist                               CERI $58,100   1
## 4      University of Memphis     Horton      Tracy   Asst Dir Empl Rel And Engmt                    Human Resources $61,200   1
## 5      University of Memphis     Horton    Whitney    Undergraduate Adms Analyst                         Admissions $27,540   1
## 6 Jackson State Comm College     Horton      Sonya Clerk, Learning Resource Cent             Technology Access Fees $21,419   1
## 7      TN Technological Univ     Horton      Laura                   Coordinator                Disability Services $35,540   1
## 8 Tennessee State University     Horton     Dianne             Records Associate                            Records $28,053   1

Rate Limits

Note that a lot of API’s have rate limits - the number of requests you can send in a particular window of time (e.g. 100 requests per hour). Before spending a lot of time scraping, make sure you know the rate limit and structure your requests appropriately. You may need to space your scraping out over a couple of days.

JSON

A lot of formal APIs return data in a format known as JSON (JavaScript Object Notation). It is very similar to lists in R. An example modified from Wikipedia:

{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "children": [],
  "spouse": null
}

We don’t have to parse this ourselves, thanks to the jsonlite package by Jeroen Ooms and others. We will obtain the JSON data from an API query and use the fromJSON function.

Example - Magic: The Gathering Cards

Let’s look at a complete example. The examples are limited to those with completely open APIs, so the sources are somewhat limited.

Magic The Gathering is a “trading card game” that’s been around since the 1990’s. It involves two players crafting a deck of cards, from which a hand is drawn, and on each turn a player can play creatures or spells to attempt to reduce the life total of their opponent. There are over 15,000 different cards printed.

The Magic: The Gathering Developers site http://magicthegathering.io/ provides an unofficial API to the cards. The general URL is “https://api.magicthegathering.io/v1/cards” and it follows with query parameters such as “power=gt5” for cards with attack power greater than 5. The full list of query parameters is available in the documentation.

Building an appropriate query can be done exactly as before. The only change is we pass the results through fromJSON rather than read_html. Here we obtain a list of all Green6 creatures with mana cost 17 and are rare8 and banned9.

url <- str_c("http://api.magicthegathering.io/v1/cards?",
             "legality=banned&",
             "type=Creature&",
             "rarity=rare&",
             "colors=green&",
             "cmc=1")
library(jsonlite)
mtgbanned <- fromJSON(url)[[1]]
dim(mtgbanned)
## [1] 56 32
mtgbanned[1,1:10]
##                name manaCost cmc colors colorIdentity            type    types subtypes rarity set
## 1 Birds of Paradise      {G}   1  Green             G Creature — Bird Creature     Bird   Rare LEA

If you read the documentation, you will see that the rate limit is 5000 requests per hour which is plenty high, but each request is limited to 100 cards returned. Similar to the Tennessee salary example, to obtain them all you’d need to iterate over query parameter page= until the returned entry was empty:

url <- str_c(url, "&page=1000")
fromJSON(url)[[1]]
## list()

Example - Currency conversion

The site fixer.io provides an API to obtain daily currency conversion rates. The URL differs depending on the date desired; current results are “http://api.fixer.io/latest” whereas historical data is “http://api.fixer.io/2005-04-20”. Query parameters include base= to get the comparison (default is “EUR”.)

f <- fromJSON("http://api.fixer.io/2013-10-20?base=USD")
f[[1]]
## [1] "USD"
f[[2]]
## [1] "2013-10-18"
head(f[[3]])
## $AUD
## [1] 1.0357
##
## $BGN
## [1] 1.4293
##
## $BRL
## [1] 2.152
##
## $CAD
## [1] 1.0296
##
## $CHF
## [1] 0.902
##
## $CNY
## [1] 6.097

  1. Almost Surely in my experience

  2. NOT Word

  3. It does via something called “negative lookahead” which is far more complicated than we need for this context, so we’ll just say it’s true that it can’t.

  4. It may be possible that there is a complicated secondary way to use rvest, specifically html_text. But at that point, rvest provides no benefit over doing it manually since we’re still cleaning up everything ourselves.

  5. Like a lot of computer programming, this is using zero-based numbering.

  6. There are five “colors of magic”, red, green, blue, black and white.

  7. Each card takes “mana” to cost.

  8. In terms of how common the card is when opening packs.

  9. Not legal for tournament play - generally overpowered cards.

Josh Errickson