SQL in R

Introduction
- SQL in this class
sqldf package
SQL Queries
- Wildcard
- LIMIT
- ORDER BY
- WHERE
  - IN
  - LIKE
- Aggregated data
  - Counting data
Conclusion

Thank you for visiting my notes. I've noticed a large uptick in views lately which is terrific! However, these notes are quite old (2017) and some details may be outdated. I have a more recent class (2023) which covers much of the same information and is hopefully more up-to-date.

If you are using these notes, I'd love to hear from you and how you are using them! Please reach out to jerrick (at) umich (dot) edu.

Introduction

SQL is a database query language - a language designed specifically for interacting with a database. It offers syntax for extracting data, updating data, replacing data, creating data, etc. For our purposes, it will typically be used when accessing data off a server database. If the database isn’t too large, you can grab the entire data set and stick it in a data.frame. However, often the data are quite large so you interact with it piecemeal via SQL.

There are various database implementations (SQLite, Microsoft SQL Server, PostgreSQL, etc) which are database management software which use SQL to access the data. The method of connecting with each database may differ, but they support SQL (specifically they support ANSI SQL) and often extend it in subtle ways. This means that in general, SQL written to access a SQLite database may not work to access a PostgreSQL database.

Thankfully, most of these differences are on more fringe operations, and standard commands tend to be equivalent.

SQL in this class

The first (and in my opinion, biggest) hurdle in using SQL is accessing the actual data. If you are accessing data which someone else is hosting, they will (hopefully) have IT administrators who can assist you with this task. The exact method and settings used with differ greatly by project, so we will not be covering it in this class. Here’s a few packages that might be of use in connecting:

For these notes, we will cover only the syntax of SQL queries, using the sqldf package which enables SQL queries on a data.frame. We will not cover connecting to a SQL server, nor modifying an existing database, only extracting the data to analyze in R with normal methods.

sqldf package

library(sqldf)

The sqldf package is incredibly simple, from R’s point of view. There is a single function are concerned about: sqldf. Passed to this function is a SQL statement, such as

sqldf('SELECT age, circumference FROM Orange WHERE Tree = 1 ORDER BY circumference ASC')

## Warning: Quoted identifiers should have class SQL, use DBI::SQL() if the
## caller performs the quoting.

##    age circumference
## 1  118            30
## 2  484            58
## 3  664            87
## 4 1004           115
## 5 1231           120
## 6 1372           142
## 7 1582           145

(Note: The above warning is due to some compatibility issues between sqldf and RSQLite and shouldn’t affect anything.)

SQL Queries

There are a large number of SQL major commands¹. Queries are accomplished with the SELECT command. First a note about convention:

By convention, SQL syntax is written in all UPPER CASE and variable names/database names are written in lower case. Technically, the SQL syntax is case insensitive, so it can be written in lower case or otherwise. Note however that R is not case insensitive, so variable names and data frame names must have proper capitalization. Hence

sqldf("SELECT * FROM iris")
sqldf("select * from iris")

are equivalent, but this would fail (assuming you haven’t created a new object called “IRIS”):

sqldf("SELECT * from IRIS")

The basic syntax for SELECT is

SELECT variable1, variable2 FROM data

For example,

data(BOD)
BOD

##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

sqldf('SELECT demand FROM BOD')

##   demand
## 1    8.3
## 2   10.3
## 3   19.0
## 4   16.0
## 5   15.6
## 6   19.8

sqldf('SELECT Time, demand from BOD')

##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

A quick sidenote: SQL does not like variables with . in their name. If you have any, refer to the variable wrapped in quotes, such as

iris1 <- sqldf('SELECT Petal.Width FROM iris')

## Error in rsqlite_send_query(conn@ptr, statement): no such column: Petal.Width

iris2 <- sqldf('SELECT "Petal.Width" FROM iris')

Wildcard

A wild card can be passed to extract everything.

bod2 <- sqldf('SELECT * FROM BOD')
bod2

##   Time demand
## 1    1    8.3
## 2    2   10.3
## 3    3   19.0
## 4    4   16.0
## 5    5   15.6
## 6    7   19.8

LIMIT

To control the number of results returned, use LIMIT #.

sqldf('SELECT * FROM iris LIMIT 5')

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa

ORDER BY

To order variables, use the syntax

ORDER BY var1 {ASC/DESC}, var2 {ASC/DESC}

where the choice of ASC for ascending or DESC for descending is made per variable.

sqldf("SELECT * FROM Orange ORDER BY age ASC, circumference DESC LIMIT 5")

##   Tree age circumference
## 1    2 118            33
## 2    4 118            32
## 3    1 118            30
## 4    3 118            30
## 5    5 118            30

WHERE

Conditional statements can be added via WHERE:

sqldf('SELECT demand FROM BOD WHERE Time < 3')

##   demand
## 1    8.3
## 2   10.3

Both AND and OR are valid, along with paranthese to affect order of operations.

sqldf('SELECT * FROM rock WHERE (peri > 5000 AND shape < .05) OR perm > 1000')

##   area     peri    shape perm
## 1 5048  941.543 0.328641 1300
## 2 1016  308.642 0.230081 1300
## 3 5605 1145.690 0.464125 1300
## 4 8793 2280.490 0.420477 1300

There are few more complicated ways to use WHERE:

IN

WHERE IN is used similar to R’s %in%. It also supports NOT.

sqldf('SELECT * FROM BOD WHERE Time IN (1,7)')

##   Time demand
## 1    1    8.3
## 2    7   19.8

sqldf('SELECT * FROM BOD WHERE Time NOT IN (1,7)')

##   Time demand
## 1    2   10.3
## 2    3   19.0
## 3    4   16.0
## 4    5   15.6

LIKE

LIKE can be thought of as a weak regular expression command. It only allows the single wildcard % which matches any number of characters. For example, to extract the data where the feed ends with “bean”:

sqldf('SELECT * FROM chickwts WHERE feed LIKE "%bean" LIMIT 5')

##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean

sqldf('SELECT * FROM chickwts WHERE feed NOT LIKE "%bean" LIMIT 5')

##   weight    feed
## 1    309 linseed
## 2    229 linseed
## 3    181 linseed
## 4    141 linseed
## 5    260 linseed

Aggregated data

Select statements can create aggregated data using AVG, MEDIAN, MAX, MIN, and SUM as functions in the list of variables to select. The GROUP BY statement can be added to aggregate by groups. AS can name the

sqldf("SELECT AVG(circumference) FROM Orange")

##   AVG(circumference)
## 1           115.8571

sqldf("SELECT tree, AVG(circumference) AS meancirc FROM Orange GROUP BY tree")

##   Tree  meancirc
## 1    1  99.57143
## 2    2 135.28571
## 3    3  94.00000
## 4    4 139.28571
## 5    5 111.14286

Counting data

SELECT COUNT() returns the number of observations. Passing * or nothing returns total rows, passing a variable name returns the number of non-NA entries. AS works as well.

d <- data.frame(a = c(1,1,1), b = c(1,NA,NA))
d

##   a  b
## 1 1  1
## 2 1 NA
## 3 1 NA

sqldf("SELECT COUNT() as numrows FROM d")

##   numrows
## 1       3

sqldf("SELECT COUNT(b) FROM d")

##   COUNT(b)
## 1        1

Conclusion

This is only the tip of the iceberg which is SQL. There are far more advanced commands available, from DELETE or UPDATE to modify a database, to various JOIN commands for merging. If the database is large enough that you cannot store the entire dataset on your computer, you may need to learn more commands. For some tutorials into more advanced SQL see the following:

For example, see all the sections here.↩