SQL in R
Introduction
SQL is a database query language - a language designed specifically for interacting with a database. It offers syntax for extracting data, updating data, replacing data, creating data, etc. For our purposes, it will typically be used when accessing data off a server database. If the database isn’t too large, you can grab the entire data set and stick it in a data.frame
. However, often the data are quite large so you interact with it piecemeal via SQL.
There are various database implementations (SQLite, Microsoft SQL Server, PostgreSQL, etc) which are database management software which use SQL to access the data. The method of connecting with each database may differ, but they support SQL (specifically they support ANSI SQL) and often extend it in subtle ways. This means that in general, SQL written to access a SQLite database may not work to access a PostgreSQL database.
Thankfully, most of these differences are on more fringe operations, and standard commands tend to be equivalent.
SQL in this class
The first (and in my opinion, biggest) hurdle in using SQL is accessing the actual data. If you are accessing data which someone else is hosting, they will (hopefully) have IT administrators who can assist you with this task. The exact method and settings used with differ greatly by project, so we will not be covering it in this class. Here’s a few packages that might be of use in connecting:
For these notes, we will cover only the syntax of SQL queries, using the sqldf package which enables SQL queries on a data.frame
. We will not cover connecting to a SQL server, nor modifying an existing database, only extracting the data to analyze in R with normal methods.
sqldf package
library(sqldf)
The sqldf package is incredibly simple, from R’s point of view. There is a single function are concerned about: sqldf
. Passed to this function is a SQL statement, such as
sqldf('SELECT age, circumference FROM Orange WHERE Tree = 1 ORDER BY circumference ASC')
## Warning: Quoted identifiers should have class SQL, use DBI::SQL() if the
## caller performs the quoting.
## age circumference
## 1 118 30
## 2 484 58
## 3 664 87
## 4 1004 115
## 5 1231 120
## 6 1372 142
## 7 1582 145
(Note: The above warning is due to some compatibility issues between sqldf and RSQLite and shouldn’t affect anything.)
SQL Queries
There are a large number of SQL major commands1. Queries are accomplished with the SELECT
command. First a note about convention:
By convention, SQL syntax is written in all UPPER CASE and variable names/database names are written in lower case. Technically, the SQL syntax is case insensitive, so it can be written in lower case or otherwise. Note however that R is not case insensitive, so variable names and data frame names must have proper capitalization. Hence
sqldf("SELECT * FROM iris")
sqldf("select * from iris")
are equivalent, but this would fail (assuming you haven’t created a new object called “IRIS”):
sqldf("SELECT * from IRIS")
The basic syntax for SELECT is
SELECT variable1, variable2 FROM data
For example,
data(BOD)
BOD
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## 6 7 19.8
sqldf('SELECT demand FROM BOD')
## demand
## 1 8.3
## 2 10.3
## 3 19.0
## 4 16.0
## 5 15.6
## 6 19.8
sqldf('SELECT Time, demand from BOD')
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## 6 7 19.8
A quick sidenote: SQL does not like variables with .
in their name. If you have any, refer to the variable wrapped in quotes, such as
iris1 <- sqldf('SELECT Petal.Width FROM iris')
## Error in rsqlite_send_query(conn@ptr, statement): no such column: Petal.Width
iris2 <- sqldf('SELECT "Petal.Width" FROM iris')
Wildcard
A wild card can be passed to extract everything.
bod2 <- sqldf('SELECT * FROM BOD')
bod2
## Time demand
## 1 1 8.3
## 2 2 10.3
## 3 3 19.0
## 4 4 16.0
## 5 5 15.6
## 6 7 19.8
LIMIT
To control the number of results returned, use LIMIT #
.
sqldf('SELECT * FROM iris LIMIT 5')
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
ORDER BY
To order variables, use the syntax
ORDER BY var1 {ASC/DESC}, var2 {ASC/DESC}
where the choice of ASC
for ascending or DESC
for descending is made per variable.
sqldf("SELECT * FROM Orange ORDER BY age ASC, circumference DESC LIMIT 5")
## Tree age circumference
## 1 2 118 33
## 2 4 118 32
## 3 1 118 30
## 4 3 118 30
## 5 5 118 30
WHERE
Conditional statements can be added via WHERE
:
sqldf('SELECT demand FROM BOD WHERE Time < 3')
## demand
## 1 8.3
## 2 10.3
Both AND
and OR
are valid, along with paranthese to affect order of operations.
sqldf('SELECT * FROM rock WHERE (peri > 5000 AND shape < .05) OR perm > 1000')
## area peri shape perm
## 1 5048 941.543 0.328641 1300
## 2 1016 308.642 0.230081 1300
## 3 5605 1145.690 0.464125 1300
## 4 8793 2280.490 0.420477 1300
There are few more complicated ways to use WHERE
:
IN
WHERE IN
is used similar to R’s %in%
. It also supports NOT
.
sqldf('SELECT * FROM BOD WHERE Time IN (1,7)')
## Time demand
## 1 1 8.3
## 2 7 19.8
sqldf('SELECT * FROM BOD WHERE Time NOT IN (1,7)')
## Time demand
## 1 2 10.3
## 2 3 19.0
## 3 4 16.0
## 4 5 15.6
LIKE
LIKE
can be thought of as a weak regular expression command. It only allows the single wildcard %
which matches any number of characters. For example, to extract the data where the feed ends with “bean”:
sqldf('SELECT * FROM chickwts WHERE feed LIKE "%bean" LIMIT 5')
## weight feed
## 1 179 horsebean
## 2 160 horsebean
## 3 136 horsebean
## 4 227 horsebean
## 5 217 horsebean
sqldf('SELECT * FROM chickwts WHERE feed NOT LIKE "%bean" LIMIT 5')
## weight feed
## 1 309 linseed
## 2 229 linseed
## 3 181 linseed
## 4 141 linseed
## 5 260 linseed
Aggregated data
Select statements can create aggregated data using AVG
, MEDIAN
, MAX
, MIN
, and SUM
as functions in the list of variables to select. The GROUP BY
statement can be added to aggregate by groups. AS
can name the
sqldf("SELECT AVG(circumference) FROM Orange")
## AVG(circumference)
## 1 115.8571
sqldf("SELECT tree, AVG(circumference) AS meancirc FROM Orange GROUP BY tree")
## Tree meancirc
## 1 1 99.57143
## 2 2 135.28571
## 3 3 94.00000
## 4 4 139.28571
## 5 5 111.14286
Counting data
SELECT COUNT()
returns the number of observations. Passing *
or nothing returns total rows, passing a variable name returns the number of non-NA entries. AS
works as well.
d <- data.frame(a = c(1,1,1), b = c(1,NA,NA))
d
## a b
## 1 1 1
## 2 1 NA
## 3 1 NA
sqldf("SELECT COUNT() as numrows FROM d")
## numrows
## 1 3
sqldf("SELECT COUNT(b) FROM d")
## COUNT(b)
## 1 1
Conclusion
This is only the tip of the iceberg which is SQL. There are far more advanced commands available, from DELETE
or UPDATE
to modify a database, to various JOIN
commands for merging. If the database is large enough that you cannot store the entire dataset on your computer, you may need to learn more commands. For some tutorials into more advanced SQL see the following: