A relational database is a set of rectangular data frames called tables linked by keys relating one table to another. Software implementations of such data structures are known as relational database management systems (RDBMS). Most RDBMS use structured query language or SQL (“sequel” or “S-Q-L”) to modify or search the relational database.
Here is an example of a relational database. “Primary” keys are ones which uniquely identify rows of a particular table; “foreign” keys simply refer to “primary” keys in other tables. A key can contain multiple variables.
SQL provides a syntax for interfacing with relational data. It is largely a declarative language in that we use SQL to specify what we wish to accomplish, leaving the how to the RDBMS. While there are standards for SQL implementations put out by the International Organization for Standardization (ISO) and the American National Standards Institute (ANSI), there are several open source and commercial implementations that each have unique features.
I will try to focus on the commonalities, but will be using an SQLite engine in R for providing examples. One unique feature of SQLite is that it does not follow the client-server model. In this model, a physical computer storing the data and executing queries within the RDMBS, the server, is separate from the machine requesting the queries known as the client.
The client-server model is popular in business, health care, and other domains as it allows security and monitoring of how the data is queried. It is also popular for many large open data projects (i.e ensemble) where it is beneficial for data to be centrally maintained and frequently accessed on the fly.
For the examples in class, we will use SQL to access smaller data-sets for which there are more efficient approaches. In real world scenarios, databases can be extremely large (multiple gigabytes or larger) that would be difficult to access directly in R.
There are a wild number of SQL-related packages in R - some for connecting to databases, some for sending SQL queries. We will primarily use the “DBI” package. There may be others which are better for interfacing with particular SQL databases, but generally the SQL syntax should not differ across packages.
The RSQLite package contains the backend required to load up an SQLite database. The DBI package interfaces with any database, in this case the SQLite data.
# Packageslibrary(DBI) # For interfacing with a database# Import the SQLite database of the Lahman datalahman <-dbConnect(RSQLite::SQLite(), "data/lahman_1871-2022.sqlite")lahman
The basic structure of a SQL query contains a SELECT statement indicating which columns are desired and a FROM clause explaining where to find them (as we saw above).
dbGetQuery(lahman, "SELECT playerID FROM Batting LIMIT 5")
The string, "SELECT playerID FROM Batting LIMIT 5" is an SQL query. SELECT is the statement, and it has the basic syntax of
SELECT var1, var2 FROMtable
We add an additional clause, LIMIT 5 to reduce the amount of output. This is good practice when developing new queries as it prevents large wait times only to discover a bug.
You can use a wild card * to select all columns in a table:
dbGetQuery(lahman, "SELECT * FROM Batting LIMIT 5")
playerID yearID stint teamID lgID G G_batting AB R H 2B 3B HR RBI SB CS BB
1 aardsda01 2004 1 SFN NL 11 NA 0 0 0 0 0 0 0 0 0 0
2 aardsda01 2006 1 CHN NL 45 NA 2 0 0 0 0 0 0 0 0 0
3 aardsda01 2007 1 CHA AL 25 NA 0 0 0 0 0 0 0 0 0 0
4 aardsda01 2008 1 BOS AL 47 NA 1 0 0 0 0 0 0 0 0 0
5 aardsda01 2009 1 SEA AL 73 NA 0 0 0 0 0 0 0 0 0 0
SO IBB HBP SH SF GIDP G_old
1 0 0 0 0 0 0 NA
2 0 0 0 1 0 0 NA
3 0 0 0 0 0 0 NA
4 1 0 0 0 0 0 NA
5 0 0 0 0 0 0 NA
Note that by convention, the keywords in SQL queries are capitalized, but SQL is itself not case sensitive, so this works as well:
dbGetQuery(lahman, "select * from batting limit 5")
playerID yearID stint teamID lgID G G_batting AB R H 2B 3B HR RBI SB CS BB
1 aardsda01 2004 1 SFN NL 11 NA 0 0 0 0 0 0 0 0 0 0
2 aardsda01 2006 1 CHN NL 45 NA 2 0 0 0 0 0 0 0 0 0
3 aardsda01 2007 1 CHA AL 25 NA 0 0 0 0 0 0 0 0 0 0
4 aardsda01 2008 1 BOS AL 47 NA 1 0 0 0 0 0 0 0 0 0
5 aardsda01 2009 1 SEA AL 73 NA 0 0 0 0 0 0 0 0 0 0
SO IBB HBP SH SF GIDP G_old
1 0 0 0 0 0 0 NA
2 0 0 0 1 0 0 NA
3 0 0 0 0 0 0 NA
4 1 0 0 0 0 0 NA
5 0 0 0 0 0 0 NA
To obtain the number of rows in a table, we can use the COUNT() function:
dbGetQuery(lahman, "SELECT COUNT(*) FROM Batting")
[1] "AllstarFull - 5516 x 8"
[1] "Appearances - 112106 x 21"
[1] "AwardsManagers - 179 x 6"
[1] "AwardsPlayers - 6531 x 6"
[1] "AwardsShareManagers - 425 x 7"
[1] "AwardsSharePlayers - 6879 x 7"
[1] "Batting - 112184 x 24"
[1] "BattingPost - 16374 x 22"
[1] "CollegePlaying - 17350 x 3"
[1] "Fielding - 149365 x 18"
[1] "FieldingOF - 12028 x 6"
[1] "FieldingOFsplit - 35315 x 18"
[1] "FieldingPost - 15540 x 17"
[1] "HallOfFame - 4323 x 9"
[1] "HomeGames - 3200 x 9"
[1] "Managers - 3718 x 10"
[1] "ManagersHalf - 93 x 10"
[1] "Parks - 255 x 7"
[1] "People - 20676 x 25"
[1] "Pitching - 50402 x 30"
[1] "PitchingPost - 6538 x 30"
[1] "Salaries - 26428 x 5"
[1] "Schools - 1207 x 5"
[1] "SeriesPost - 378 x 9"
[1] "Teams - 3015 x 48"
[1] "TeamsFranchises - 120 x 4"
[1] "TeamsHalf - 52 x 10"
Limiting Clauses
Aside from LIMIT, there are more nuanced and powerful ways of extracting specific entries.
WHERE
We can use Boolean comparisons in a WHERE clause as shown in the example below. We find all player-seasons since 2000 in which the player was credited with an RBI 100 or more times.
(Note that the spacing here is stylistic - in both SQL and dbGetQuery splitting onto multiple lines and aligning on the first space do not affect the query, but do make it easier to read!)
We of course could use LIMIT instead of head() to keep it entirely inside the query. In addition, LIMIT will be much faster than head():
library(microbenchmark)microbenchmark(head =head(dbGetQuery(lahman, "SELECT playerID FROM batting")),limit =dbGetQuery(lahman, "SELECT playerID FROM batting LIMIT 6"))
Warning in microbenchmark(head = head(dbGetQuery(lahman, "SELECT playerID FROM
batting")), : less accurate nanosecond times to avoid potential integer
overflows
Unit: microseconds
expr min lq mean median uq max neval cld
head 13734.63 13792.9330 13995.0458 13844.7980 13914.396 16649.198 100 a
limit 101.27 102.8485 108.7402 104.9805 108.076 175.152 100 b
IN
To select on a column by testing against a set of fixed values use IN. We also rename the columns as we gather them, leaving the original data alone
SELECT nameGiven AS given, nameLast ASlast, birthYearFROMmasterWHERE nameLast IN ('Alou', 'Griffey')
Use single quotations (') instead of double (") - in some flavors of SQL, double quotes have different meaning (if they’re used at all), whereas it is consistent that single quotes start and end strings. In addition, this avoids conflict if you use double quotes to start the string in R, and single quotes in the actual query.
dbGetQuery(lahman, "SELECT nameGiven AS given, nameLast AS last, birthYear FROM People WHERE nameLast IN ('Alou', 'Griffey')")
given last birthYear
1 Felipe Rojas Alou 1935
2 Jesus Maria Rojas Alou 1942
3 Mateo Rojas Alou 1938
4 Moises Rojas Alou 1966
5 George Kenneth Griffey 1950
6 George Kenneth Griffey 1969
LIKE
Use a LIKE predicate with a WHERE clause to get partial string matching. You can use % to match any sub-string.
SELECT nameGiven AS given, nameLast ASlast, birthYearFROM PeopleWHERE nameLast LIKE'%riff%'
## Find all players with last name containing a 'riff' sub-stringdbGetQuery(lahman, "SELECT nameGiven AS given, nameLast AS last, birthYear FROM People WHERE nameLast LIKE '%riff%' LIMIT 5")
given last birthYear
1 Arthur Joseph Griffin 1988
2 Alfredo Claudino Griffin 1957
3 Bartholomew Joseph Griffith 1896
4 Clark Calvin Griffith 1869
5 Robert Derrell Griffith 1943
Most SQL implementations also have a REGEXP or REGEXLIKE function that works with regular expressions, but SQLite requires a user definedregex() for its use so we skip it here.
Combining limiting clauses
Limiting WHERE clauses can be combined using AND and OR. Clauses can be negated using NOT.
## Find all players born in Puerto Rico during the 1990'sdbGetQuery(lahman, "SELECT nameGiven AS given, nameLast AS last, birthYear, birthCountry FROM People WHERE birthCountry == 'P.R.' AND birthYear LIKE '199%' LIMIT 5")
given last birthYear birthCountry
1 Ednel Javier Baez 1992 P.R.
2 Jose Orlando Berrios 1994 P.R.
3 Victor Manuel Caratini 1993 P.R.
4 Willi Rafael Castro 1997 P.R.
5 Alexander Claudio 1992 P.R.
## Find all players from Hawaii or AlaskadbGetQuery(lahman, "SELECT nameGiven AS given, nameLast AS last, birthYear, birthState FROM People WHERE birthState == 'HI' OR birthState == 'AK' LIMIT 5")
given last birthYear birthState
1 Benny Peter Agbayani 1971 HI
2 Dustin Kamakana Mai Ku'u Makualani Antolin 1989 HI
3 Anthony Lee Barnette 1983 AK
4 Chad Robert Bentz 1980 AK
5 Douglas Edmund Capilla 1952 HI
## Find all players from Hawaii or Alaska that aren't born in 1971dbGetQuery(lahman, "SELECT nameGiven AS given, nameLast AS last, birthYear, birthState FROM People WHERE (birthState == 'HI' OR birthState == 'AK') AND NOT birthyear == 1971 LIMIT 5")
given last birthYear birthState
1 Dustin Kamakana Mai Ku'u Makualani Antolin 1989 HI
2 Anthony Lee Barnette 1983 AK
3 Chad Robert Bentz 1980 AK
4 Douglas Edmund Capilla 1952 HI
5 Shawn Anthony Chacon 1977 AK
ORDER BY
Use an ORDER BY clause with a comma separated list of columns to arrange the table.
## Get all 100+ RBI seasons since 2010, orderedrbi100 <-dbGetQuery(lahman, "SELECT playerID, yearID, RBI FROM batting WHERE RBI >= 100 AND yearID >= 2010 ORDER BY yearID, -RBI")head(rbi100)
We can perform aggregations such as sums, means, and counts by using a GROUP BY clause.
Here we find the players with the most total RBI since 2010.
SELECT playerID, SUM(RBI) AS rbi_totalFROM BattingWHERE yearID >=2010GROUPBY playerIDORDERBY-rbi_total
## Count total RBIs since 2010 by playerdbGetQuery(lahman, "SELECT playerID, SUM(RBI) AS rbi_total FROM Batting WHERE yearID >= 2010 GROUP BY playerID ORDER BY -rbi_total LIMIT 10")
The operator defining a limiting clause on an aggregate variable is HAVING. It is essentially like WHERE except for operating on summary statistics rather than individual rows. In other words, HAVING refers to the output table specified in SELECT rather than the input table(s) specified using FROM.
In the query below, observe that the HAVING clause comes after the GROUP BY but before the ORDER BY.
## Players with 1000+ RBIs since 2010dbGetQuery(lahman, "SELECT playerID, SUM(RBI) AS rbi_total FROM Batting WHERE yearID >= 2010 GROUP BY playerIDHAVING rbi_total >= 1000 ORDER BY -rbi_total")
So far we have discussed working with single tables only. The SQL term for merging data from two or more tables is a ‘join’. All joins are based on the idea of equating rows that match on one or more variables. Here’s a nice visualization of the different types of joins. In this image, the A table is the primary table (e.g. FROM), and the B table is the table you are joining in.
We’ll demonstrate a few of these.
Inner Join
What if we wanted to supplement our earlier table showing players with 1000+ RBI since 2010 with information about those players? We could use an inner join of our RBI table with the “People” table to accomplish this.
JOINs are clauses (similar to FROM, GROUP BY, etc) but typically considered as “part of” the FROM. So in indentation here, I choose to indicate that relationship, though others may not.
SELECT p.nameFirst ASfirst, p.nameLast ASlast, p.birthState AS state, p.birthCountry AS country, SUM(b.RBI) AS rbi_totalFROM Batting AS bINNERJOIN People AS p ON b.playerID = p.playerIDWHERE b.yearID >=2010GROUPBY b.playerIDHAVING rbi_total >=1000ORDERBY-rbi_total
dbGetQuery(lahman, "SELECT p.nameFirst AS first, p.nameLast AS last, p.birthState AS state, p.birthCountry AS country, SUM(b.RBI) AS rbi_total FROM Batting AS b INNER JOIN People AS p ON b.playerID = p.playerID WHERE b.yearID >= 2010 GROUP BY b.playerIDHAVING rbi_total >= 1000 ORDER BY -rbi_total")
first last state country rbi_total
1 Nelson Cruz Monte Cristi D.R. 1144
2 Albert Pujols Distrito Nacional D.R. 1106
3 Miguel Cabrera Aragua Venezuela 1094
4 Paul Goldschmidt DE USA 1042
5 Freddie Freeman CA USA 1041
Note the renaming of the tables - FROM batting AS b and INNER JOIN People AS p. This allows us to preface variables names, e.g. p.nameFirst, to indicate which table to look for the variable. We could of course use People.nameFirst but it’s shorter to rename.
The ON clause determines the connecting variables between the two tables.
Left & Right (Outer) Joins
In a left join – sometimes called a left outer join – we add columns from the right table to the left table when matching rows are found. Rows from the left table with no matches from the right table are retained with columns from the right table filled in as NULL (i.e. NA). When there are multiple matches of a row from the left table to rows in the right table, these each become a row in the new table.
A right join is equivalent to a left join with the exception that the roles between right and left are reversed.
Left joins are particularly useful when the information in the right table is only applicable to a subset of the rows from the left table. As an example, suppose we would like to know which US colleges and universities have produced the most “Rookie of the Year Awards” given to the best debuting player(s) each season.
To get started, we first test a query to find the last college attended.
-- Last college attendedSELECT*FROM CollegePlayingGROUPBY playerIDHAVING yearID ==max(YearID)
# Query to find last college attendeddbGetQuery(lahman, "SELECT * FROM CollegePlaying GROUP BY playerIDHAVING yearID == max(YearID) LIMIT 5")
# Query to find Rookie of the Year AwardsdbGetQuery(lahman, "SELECT * FROM AwardsPlayers WHERE awardID LIKE 'Rookie%' LIMIT 5")
playerID awardID yearID lgID tie notes
1 robinja02 Rookie of the Year 1947 ML <NA> <NA>
2 darkal01 Rookie of the Year 1948 ML <NA> <NA>
3 sievero01 Rookie of the Year 1949 AL <NA> <NA>
4 newcodo01 Rookie of the Year 1949 NL <NA> <NA>
5 dropowa01 Rookie of the Year 1950 AL <NA> <NA>
Finally, we use a left join of the tables for Rookie of the Year awards and last college attended to match winners to their schools. We need a left join as many of the winners may never have played collegiate baseball, and we want to keep them, but we don’t want to keep colleges which never produced any winners.
SELECT roy.playerID AS playerID, roy.yearID ASyear, lgID AS league, schoolIDFROM AwardsPlayers AS royLEFTJOIN (SELECT*--Final College AttendedFROM CollegePlayingGROUPBY playerIDHAVING yearID ==MAX(YearID) ) AS c ON c.playerID = roy.playerIDWHERE awardID LIKE'Rookie%'
Note the nested structure here - Inside the LEFT JOIN, we write a separate SELECT statement. We could have done this in two steps: generate a new table (using CREATE TABLE finalcollege AS followed by the SELECT statement) and then directly use it in the clause, LEFT JOIN finalcollege AS c.
# Query to find last college for ROYdbGetQuery(lahman, "SELECT roy.playerID AS playerID, roy.yearID AS year, lgID AS league, schoolID FROM AwardsPlayers AS roy LEFT JOIN (SELECT * --Final College Attended FROM CollegePlaying GROUP BY playerID HAVING yearID == MAX(YearID) ) AS c ON c.playerID = roy.playerID WHERE awardID LIKE 'Rookie%' LIMIT 5")
playerID year league schoolID
1 robinja02 1947 ML ucla
2 darkal01 1948 ML ulala
3 sievero01 1949 AL <NA>
4 newcodo01 1949 NL <NA>
5 dropowa01 1950 AL uconn
To complete the example, we modify the query to display which schools have produced the most ROY awards in total.
SELECT schoolID, COUNT(c.playerID) AS ROY_awardsFROM AwardsPlayers royLEFTJOIN (SELECT*--Last College AttendedFROM CollegePlayingGROUPBY playerIDHAVING yearID ==MAX(YearID) ) c ON c.playerID = roy.playerIDWHERE awardID LIKE'Rookie%'AND schoolID ISNOTNULLGROUPBY schoolIDHAVING ROY_awards >1ORDERBY-ROY_awards
# Which schools have produced the most ROY?dbGetQuery(lahman, "SELECT schoolID, COUNT(c.playerID) AS ROY_awards FROM AwardsPlayers roy LEFT JOIN (SELECT * --Last College Attended FROM CollegePlaying GROUP BY playerID HAVING yearID == MAX(YearID) ) c ON c.playerID = roy.playerID WHERE awardID LIKE 'Rookie%' AND schoolID IS NOT NULL GROUP BY schoolIDHAVING ROY_awards > 1 ORDER BY -ROY_awards")
The previous example demonstrated the order of almost all clauses:
SELECTFROMJOINWHEREGROUPBYHAVINGORDERBYLIMIT
Another Example
The “sakila” database is a fake data set created by the MySQL team which simulates a very rich database of many tables. A map of its contents is:
It simulates the database of a Blockbuster-style rental store. It includes customer data, movie data, and rental data linking the two. An SQLite database containing the data can be downloaded from https://github.com/bradleygrant/sakila-sqlite3.
Let’s use this data to examine which actor or actress it the most “rented”. Looking at the tables, we have the “rental” table which contains records of every movie rented. From here, we are connected to the “inventory” table, which records which movie was actually rented. So first, let’s obtain a list of every movie ever rented.
SELECT i.film_idFROM rental AS rLEFTJOIN inventory AS i ON i.inventory_id = r.inventory_id
We use a left join here because there may be movies in the inventory that were never rented.
dbGetQuery(sakila, "SELECT i.film_id FROM rental AS r LEFT JOIN inventory AS i ON i.inventory_id = r.inventory_id LIMIT 5")
film_id
1 1
2 1
3 1
4 1
5 1
Now, we could join next to the “films” table, however, you may notice that both the “films” and “film_actor” have a film_id key, so we can completely bypass “films”.
SELECT fa.actor_idFROM film_actor AS faRIGHTJOIN (SELECT i.film_idFROM rental AS rLEFTJOIN inventory AS i ON i.inventory_id = r.inventory_id ) AS rr ON fa.film_id = rr.film_id
This time we use a right join as we don’t want to list any actors which were in movies that weren’t rented.
dbGetQuery(sakila, "SELECT fa.actor_id FROM film_actor AS fa RIGHT JOIN (SELECT i.film_id FROM rental AS r LEFT JOIN inventory AS i ON i.inventory_id = r.inventory_id ) AS rr ON fa.film_id = rr.film_id LIMIT 5")
actor_id
1 1
2 1
3 1
4 1
5 1
We now have a list of actor ID’s, next we just need to connect it to the actor names.
SELECTCOUNT(a.actor_id) AScount, a.first_name, a.last_nameFROM actor AS aRIGHTJOIN (SELECT fa.actor_idFROM film_actor AS faRIGHTJOIN (SELECT i.film_idFROM rental AS rLEFTJOIN inventory AS iON i.inventory_id = r.inventory_id ) AS rr ON fa.film_id = rr.film_id ) AS ff ON ff.actor_id = a.actor_idGROUPBY a.actor_idORDERby-count
dbGetQuery(sakila, "SELECT COUNT(a.actor_id) AS count, a.first_name, a.last_name FROM actor AS a RIGHT JOIN (SELECT fa.actor_id FROM film_actor AS fa RIGHT JOIN (SELECT i.film_id FROM rental AS r LEFT JOIN inventory AS i ON i.inventory_id = r.inventory_id ) AS rr ON fa.film_id = rr.film_id ) AS ff ON ff.actor_id = a.actor_id GROUP BY a.actor_id ORDER by -count LIMIT 5")
count first_name last_name
1 753 GINA DEGENERES
2 678 MATTHEW CARREY
3 674 MARY KEITEL
4 654 ANGELA WITHERSPOON
5 640 WALTER TORN