library(data.table)
<- fread("data/flights14.csv")
nyc14 class(nyc14)
[1] "data.table" "data.frame"
data.table
The data.table
package in R provides an extension of the data.frame
class that aims to be both more computationally and memory efficient. It is particularly well suited for large in-memory data sets and utilizes indexed keys to allow quick search, subset, and aggregate by group operations. The package also automatically employs multicore computations through its back end, enabling a degree of no-hassle parallelism.
The data.table
package also provides an expressive, compact syntax for working with data. However, compared to dplyr
, this syntax is less literate and may be more difficult for a non-expert to read and make sense of.
data.table
objectsThe data.table package provides a function fread()
for reading delimited files like read.table()
and readr::read_delim()
, but returns a data.table
object instead. As with the tbl_df
class from tibble, data.table
inherits from data.frame
. Here is an example using the 2014 New York City Flights data. It is available from download at https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv.
library(data.table)
<- fread("data/flights14.csv")
nyc14 class(nyc14)
[1] "data.table" "data.frame"
You can also create a data.table
like a data.frame
using:
<- 1e3
n data.table(a = 1:n,
b = rnorm(n),
c = sample(letters, n, replace = TRUE))
a b c
1: 1 -1.2522775 k
2: 2 0.6924447 i
3: 3 0.7121352 r
4: 4 1.2519223 k
5: 5 -0.3826398 r
---
996: 996 -1.8878485 q
997: 997 -1.8024806 w
998: 998 -1.3121417 e
999: 999 0.4838444 a
1000: 1000 -0.6714461 f
Note that similar to tibble, data.table
objects by default print only a subset of the data.
The syntax for the data.table
package is inspired by the bracket ([]
) notation for indexing matrices and data frames. At the same time, it aims to allow many common data operations (i.e. dplyr
verbs) to be expressed within these brackets.
The basic idea is DT[i, j, by]
where we:
i
statement,j
statementby
statement.Additional operations can also be expressed within the brackets. Remember that, even for a data.frame or matrix the left bracket [
is actually a function:
<- data.frame(a = 1:3, b = 4:6)
d 1:2, ] d[
a b
1 1 4
2 2 5
`[`(d, 1:2, )
a b
1 1 4
2 2 5
Basic subsetting of data.table
objects works similarly to data.frame
s.
1:2, ] nyc14[
year month day dep_time dep_delay arr_time arr_delay cancelled carrier
1: 2014 1 1 914 14 1238 13 0 AA
2: 2014 1 1 1157 -3 1523 13 0 AA
tailnum flight origin dest air_time distance hour min
1: N338AA 1 JFK LAX 359 2475 9 14
2: N335AA 3 JFK LAX 363 2475 11 57
Find all flights from LGA to DTW:
<- nyc14[origin == "LGA" & dest == "DTW", ] lga_dtw
Get the first and last rows of lga_dtw
:
c(1, .N)] lga_dtw[
year month day dep_time dep_delay arr_time arr_delay cancelled carrier
1: 2014 1 1 901 -4 1102 -11 0 DL
2: 2014 10 31 1106 -4 1325 15 0 MQ
tailnum flight origin dest air_time distance hour min
1: N917DL 181 LGA DTW 99 502 9 1
2: N511MQ 3592 LGA DTW 75 502 11 6
In the above, we used .N
to index the last row. This is a special symbol defined by data.table
to hold the number of rows or observations in the “current” group. “Current” here refers to the scope in which it is used; in this example, that is the entire data.table
.
Also, notice the difference from standard data.frame
or matrix sub-setting, in that we did not leave a blank for columns: lga_dta[c(1, .N)]
versus lga_dta[c(1, .N), ]
. With data.frame
s, a single argument to `[`
treats the object as a list and returns as expected.
1, ] nyc14[
year month day dep_time dep_delay arr_time arr_delay cancelled carrier
1: 2014 1 1 914 14 1238 13 0 AA
tailnum flight origin dest air_time distance hour min
1: N338AA 1 JFK LAX 359 2475 9 14
1] nyc14[
year month day dep_time dep_delay arr_time arr_delay cancelled carrier
1: 2014 1 1 914 14 1238 13 0 AA
tailnum flight origin dest air_time distance hour min
1: N338AA 1 JFK LAX 359 2475 9 14
<- as.data.frame(nyc14)
nyc14df 1, ] nyc14df[
year month day dep_time dep_delay arr_time arr_delay cancelled carrier
1 2014 1 1 914 14 1238 13 0 AA
tailnum flight origin dest air_time distance hour min
1 N338AA 1 JFK LAX 359 2475 9 14
head(nyc14df[1]) # don"t print too much!
year
1 2014
2 2014
3 2014
4 2014
5 2014
6 2014
You can also use the i
clause to order a data.table:
order(-month, -day, dep_time)] lga_dtw[
year month day dep_time dep_delay arr_time arr_delay cancelled carrier
1: 2014 10 31 718 -2 904 -9 0 DL
2: 2014 10 31 900 -5 1051 -22 0 DL
3: 2014 10 31 939 -6 1138 -2 0 MQ
4: 2014 10 31 1106 -4 1325 15 0 MQ
5: 2014 10 31 1113 8 1317 8 0 DL
---
3659: 2014 1 1 1302 17 1503 21 0 DL
3660: 2014 1 1 1350 -5 1600 5 0 MQ
3661: 2014 1 1 1628 -1 1829 -8 0 DL
3662: 2014 1 1 1920 130 2137 147 0 MQ
3663: 2014 1 1 2037 52 2242 52 0 MQ
tailnum flight origin dest air_time distance hour min
1: N320US 831 LGA DTW 74 502 7 18
2: N818DA 189 LGA DTW 79 502 9 0
3: N528MQ 3478 LGA DTW 80 502 9 39
4: N511MQ 3592 LGA DTW 75 502 11 6
5: N3758Y 2449 LGA DTW 74 502 11 13
---
3659: N331NW 1131 LGA DTW 98 502 13 2
3660: N839MQ 3340 LGA DTW 101 502 13 50
3661: N310NW 2231 LGA DTW 98 502 16 28
3662: N833MQ 3530 LGA DTW 103 502 19 20
3663: N856MQ 3603 LGA DTW 105 502 20 37
Get the departure and arrival times, flight number, and carrier for all flights from LGA to DTW:
== "LGA" & dest == "DTW",
nyc14[origin list(dep_time, arr_time, carrier, flight)]
dep_time arr_time carrier flight
1: 901 1102 DL 181
2: 555 745 DL 731
3: 1302 1503 DL 1131
4: 1628 1829 DL 2231
5: 849 1058 MQ 3478
---
3659: 1613 1757 DL 2231
3660: 939 1138 MQ 3478
3661: 1912 2104 MQ 3603
3662: 1346 1535 MQ 3631
3663: 1106 1325 MQ 3592
Notice the use of list()
to select columns. A synonym for list()
within data.table
is .()
to save typing and enhance readability:
== "LGA" & dest == "DTW",
nyc14[origin .(dep_time, arr_time, carrier, flight)]
dep_time arr_time carrier flight
1: 901 1102 DL 181
2: 555 745 DL 731
3: 1302 1503 DL 1131
4: 1628 1829 DL 2231
5: 849 1058 MQ 3478
---
3659: 1613 1757 DL 2231
3660: 939 1138 MQ 3478
3661: 1912 2104 MQ 3603
3662: 1346 1535 MQ 3631
3663: 1106 1325 MQ 3592
Columns can also be selected using a character vector of column names.
== "LGA" & dest == "DTW",
nyc14[origin c("dep_time", "arr_time", "carrier", "flight")]
dep_time arr_time carrier flight
1: 901 1102 DL 181
2: 555 745 DL 731
3: 1302 1503 DL 1131
4: 1628 1829 DL 2231
5: 849 1058 MQ 3478
---
3659: 1613 1757 DL 2231
3660: 939 1138 MQ 3478
3661: 1912 2104 MQ 3603
3662: 1346 1535 MQ 3631
3663: 1106 1325 MQ 3592
Note that a vector of variable names (which are not characters) will return a vector. E.g. DT[, var]
(or DT$var
) returns a vector similar to how data.frame$var
does. However, DT[, c(var1, var2)]
returns the equivalent of c(data.frame$var1, data.frame$var2)
which probably isn’t what you want.
We can deselect columns using negation (-
or !
).
ncol(nyc14)
[1] 17
<- nyc14[, -c("tailnum")]
nyc14b ncol(nyc14b)
[1] 16
<- nyc14b[, !c("cancelled", "year", "day", "hour", "min")]
nyc14b ncol(nyc14b)
[1] 11
Note that this only works with the character vector name.
-list(tailnum)] nyc14[,
Error in -list(tailnum): invalid argument to unary operator
-.(tailnum)] nyc14[,
Error in -list(tailnum): invalid argument to unary operator
The j
-clause can be used to compute with column variables like dplyr::summarize()
. Below, we find the mean and IQR of departure delays for flights between LGA and DTW during this period:
median = median(dep_delay),
lga_dtw[ , .(p25 = quantile(dep_delay, .25),
p75 = quantile(dep_delay, .75))]
median p25 p75
1: -3 -6 4
Note that both median
and quantile
are base R functions. Any function which takes in a vector and returns a scalar can be used in this fashion. Note also the use of .()
instead of list()
.
The j
-clause can also be used to compute with column variables much like dplyr::transmute()
. Here, we create new columns indicating whether the arrival or departure delays were greater than 15 minutes:
delay15 = dep_delay > 15 | arr_delay > 15)] nyc14[, .(
delay15
1: FALSE
2: FALSE
3: FALSE
4: FALSE
5: FALSE
---
253312: FALSE
253313: FALSE
253314: TRUE
253315: FALSE
253316: FALSE
To get behavior like dplyr::mutate()
we need reference semantics. This allows adding/updating/removing columns in-place.
The short version is: modifying columns of a data.frame
creates copies. This can be slow and/or memory exhaustive. By using the :=
operator, data.table avoids making any copies.
ncol(nyc14)
[1] 17
:= dep_delay > 30 | arr_delay > 30]
nyc14[, delay30 ncol(nyc14)
[1] 18
1:2, ] nyc14[
year month day dep_time dep_delay arr_time arr_delay cancelled carrier
1: 2014 1 1 914 14 1238 13 0 AA
2: 2014 1 1 1157 -3 1523 13 0 AA
tailnum flight origin dest air_time distance hour min delay30
1: N338AA 1 JFK LAX 359 2475 9 14 FALSE
2: N335AA 3 JFK LAX 363 2475 11 57 FALSE
We can see how much faster the data.table
approach is
<- fread("data/flights14.csv")
nyc2 <- fread("data/flights14.csv")
nyc3 library(microbenchmark)
microbenchmark(
baseR = {
$delay30 <- nyc2$dep_delay > 30 | nyc2$arr_delay > 30
nyc2
},data.table = {
:= dep_delay > 30 | arr_delay > 30]
nyc3[, delay30 })
Warning in microbenchmark(baseR = {: less accurate nanosecond times to avoid
potential integer overflows
Unit: milliseconds
expr min lq mean median uq max neval
baseR 1.653202 2.369903 5.102965 3.058661 4.337021 29.55916 100
data.table 1.084901 1.217105 1.620644 1.272066 1.354906 23.93174 100
identical(nyc2, nyc3)
[1] TRUE
Note that the creation of the new column was done “in place” - nothing was returned.
The :=
is a function, so you can use it in its functional form to make multiple modifications. E.g.
# Not evaluated
`:=`(new1 = xxx,
DT[, new2 = yyy,
new3 = zzz)
Technically you can also do DT[, c("new1", "new2", "new3") := list(xxx, yyy, zzz)]
but I find this notation difficult to read, especially as xxx
, yyy
and zz
get long.
To perform operations group-wise use a by
argument after the j
statement. Let’s find the percent of flights with delays of 30 minutes or more by carrier.
del30_pct = 100 * mean(delay30)), by = dest] nyc14[, .(
dest del30_pct
1: LAX 15.525842
2: PBI 19.123218
3: MIA 13.839645
4: SEA 16.671909
5: SFO 18.795666
---
105: ANC 23.076923
106: TVC 21.428571
107: HYA 9.333333
108: SBN 0.000000
109: DAL 0.000000
We can use a list to specify multiple grouping variables.
del30_pct = 100 * mean(delay30)), by = .(origin, dest)] nyc14[, .(
origin dest del30_pct
1: JFK LAX 13.89107
2: LGA PBI 17.20850
3: EWR LAX 19.47468
4: JFK MIA 14.07273
5: JFK SEA 14.49036
---
217: LGA AVL 0.00000
218: LGA GSP 33.33333
219: LGA SBN 0.00000
220: EWR SBN 0.00000
221: LGA DAL 0.00000
Pay attention to how the result is ordered - or rather isn’t ordered. Specifically, that the result is not alphabetically ordered. Rather, it retains the original ordering as much as possible to minimize memory usage.
head(nyc14[, dest], 20)
[1] "LAX" "LAX" "LAX" "PBI" "LAX" "LAX" "LAX" "LAX" "MIA" "SEA" "MIA" "SFO"
[13] "BOS" "LAX" "BOS" "ORD" "IAH" "AUS" "DFW" "ORD"
keyby
To order according to the values in the by
argument, use keyby
which sets a key with the data.table ordered by this key. More on keys can be found below.
<- nyc14[, .(del30_pct = 100 * mean(delay30)), by = dest]
delay_pct1 key(delay_pct1)
NULL
<- nyc14[, .(del30_pct = 100 * mean(delay30)), keyby = dest]
delay_pct2 key(delay_pct2)
[1] "dest"
cbind(delay_pct1, delay_pct2)
dest del30_pct dest del30_pct
1: LAX 15.525842 ABQ 25.53957
2: PBI 19.123218 ACK 10.10830
3: MIA 13.839645 AGS 0.00000
4: SEA 16.671909 ALB 27.81065
5: SFO 18.795666 ANC 23.07692
---
105: ANC 23.076923 TPA 19.42496
106: TVC 21.428571 TUL 32.83582
107: HYA 9.333333 TVC 21.42857
108: SBN 0.000000 TYS 33.58025
109: DAL 0.000000 XNA 14.77987
As with standard data.frame
indexing, we can compose data.table
bracketed expressions using chaining.
## Find max departure delay by flight among all flights from LGA to DTW
## Then, select flights within the shortest 10% of max_delay
== "LGA" & dest == "DTW",
nyc14[origin max_delay = max(dep_delay)),
.(= .(carrier, flight)
by
][, .(carrier, flight, max_delay,max_delay_q10 = quantile(max_delay, .1))
< max_delay_q10, -"max_delay_q10"] ][max_delay
carrier flight max_delay
1: DL 1107 -2
2: EV 5405 -5
3: DL 796 -8
4: EV 5596 -10
Unlike tidyverse’s mutate
which allows for variables to appear on both the RHS and LHS, data.table
doesn’t allow this - so instead we use this chaining.
<- data.table(a = 1:4)
dt ::mutate(dt,
dplyrb = a - 1,
c = b * 2)
a b c
1: 1 0 0
2: 2 1 2
3: 3 2 4
4: 4 3 6
b = a - 1, c = b * 2)] dt[, .(a,
Error in eval(jsub, SDenv, parent.frame()): object 'b' not found
b = a - 1)][, .(a, b, c = b * 2)] dt[, .(a,
a b c
1: 1 0 0
2: 2 1 2
3: 3 2 4
4: 4 3 6
:= a - 1][, c := b * 2]
dt[, b dt
a b c
1: 1 0 0
2: 2 1 2
3: 3 2 4
4: 4 3 6
If you prefer pipes |>
for clarity, you can use them by appending a _
before the opening bracket:
== "LGA" & dest == "DTW",
nyc14[origin max_delay = max(dep_delay)),
.(= .(carrier, flight)] |>
by
_[, .(carrier, flight, max_delay,max_delay_q10 = quantile(max_delay, .1))] |>
< max_delay_q10, -"max_delay_q10"] _[max_delay
carrier flight max_delay
1: DL 1107 -2
2: EV 5405 -5
3: DL 796 -8
4: EV 5596 -10
You could do the same thing with %>%
from magrittr replacing the _
with .
.
.SD
Recall that the special symbol .N
contains the number of rows in each subset defined using by
or keyby
.
== "DTW", .N, by = carrier] nyc14[dest
carrier N
1: DL 3095
2: EV 1584
3: MQ 1331
There is another special symbol .SD
which references the entire subset of data for each group. It is itself a data.table
. Consider trying to get the mean of each column of a toy dataset.
<- data.table(a = sample(1:100, 6),
dt b = sample(1:100, 6),
c = c(1, 1, 1, 2, 2, 2))
lapply(.SD, mean)] dt[,
a b c
1: 48.16667 29.83333 1.5
lapply(.SD, mean), by = c] dt[,
c a b
1: 1 34.33333 45.33333
2: 2 62.00000 14.33333
(Note the use of lapply
since we want a list
. What happens if we use sapply
? Why?)
Here’s a practical example. Let’s get the dimensions of each sub-table defined by carrier.
== "DTW",
nyc14[dest rows = nrow(.SD),
.(n = .N, # A second way to count the number of rows
cols = ncol(.SD),
class = class(.SD)[1]), # also return the class to see the `.SD` is a
# `data.table`
= carrier] by
carrier rows n cols class
1: DL 3095 3095 17 data.table
2: EV 1584 1584 17 data.table
3: MQ 1331 1331 17 data.table
As a reminder, any valid R expression can be placed in j
.
== "DTW", print(.SD[1:2]), by = carrier] nyc14[dest
year month day dep_time dep_delay arr_time arr_delay cancelled tailnum
1: 2014 1 1 901 -4 1102 -11 0 N917DL
2: 2014 1 1 555 -5 745 -7 0 N342NB
flight origin dest air_time distance hour min delay30
1: 181 LGA DTW 99 502 9 1 FALSE
2: 731 LGA DTW 93 502 5 55 FALSE
year month day dep_time dep_delay arr_time arr_delay cancelled tailnum
1: 2014 1 1 1225 10 1428 9 0 N14173
2: 2014 1 1 2055 -4 2305 11 0 N14186
flight origin dest air_time distance hour min delay30
1: 4118 EWR DTW 103 488 12 25 FALSE
2: 4247 EWR DTW 102 488 20 55 FALSE
year month day dep_time dep_delay arr_time arr_delay cancelled tailnum
1: 2014 1 1 849 -6 1058 -2 0 N818MQ
2: 2014 1 1 1920 130 2137 147 0 N833MQ
flight origin dest air_time distance hour min delay30
1: 3478 LGA DTW 103 502 8 49 FALSE
2: 3530 LGA DTW 103 502 19 20 TRUE
Empty data.table (0 rows and 1 cols): carrier
Notice that the grouping variable carrier
is not a column in .SD
.
We can pass an additional argument .SDcols
to the bracketing function to limit the columns in .SD
.
== "DTW",
nyc14[dest rows = nrow(.SD),
.(cols = ncol(.SD)),
= carrier,
by = c("origin", "dest", "flight", "dep_time")] .SDcols
carrier rows cols
1: DL 3095 4
2: EV 1584 4
3: MQ 1331 4
== "DTW",
nyc14[dest print(.SD),
= carrier,
by = c("origin", "dest", "flight", "dep_time")] .SDcols
origin dest flight dep_time
1: LGA DTW 181 901
2: LGA DTW 731 555
3: LGA DTW 1131 1302
4: JFK DTW 2184 1601
5: LGA DTW 2231 1628
---
3091: EWR DTW 825 1628
3092: LGA DTW 831 718
3093: LGA DTW 1131 1235
3094: LGA DTW 2131 1813
3095: LGA DTW 2231 1613
origin dest flight dep_time
1: EWR DTW 4118 1225
2: EWR DTW 4247 2055
3: EWR DTW 4381 1639
4: EWR DTW 5078 1250
5: EWR DTW 4246 1117
---
1580: EWR DTW 4297 842
1581: EWR DTW 4911 910
1582: EWR DTW 4246 731
1583: EWR DTW 4247 2133
1584: EWR DTW 4911 913
origin dest flight dep_time
1: LGA DTW 3478 849
2: LGA DTW 3530 1920
3: LGA DTW 3603 2037
4: LGA DTW 3689 1128
5: LGA DTW 3340 1350
---
1327: LGA DTW 3592 1058
1328: LGA DTW 3478 939
1329: LGA DTW 3603 1912
1330: LGA DTW 3631 1346
1331: LGA DTW 3592 1106
Empty data.table (0 rows and 1 cols): carrier
This can be useful in the j
statement because it allows us to use lapply
or any other function that returns a list to compute on multiple columns.
# What is the mean departure & arrival delay for each flight to DTW?
== "DTW",
nyc14[dest lapply(.SD, mean),
= .(origin, dest, carrier, flight),
by = c("arr_delay", "dep_delay")] .SDcols
origin dest carrier flight arr_delay dep_delay
1: LGA DTW DL 181 5.694118 11.8176471
2: LGA DTW DL 731 -2.637795 -0.6653543
3: LGA DTW DL 1131 5.774648 9.3450704
4: JFK DTW DL 2184 -8.000000 -4.0000000
5: LGA DTW DL 2231 10.861486 17.9054054
---
168: EWR DTW DL 2509 -12.037037 -2.5555556
169: EWR DTW EV 5283 -14.947368 -1.9473684
170: EWR DTW EV 4886 -18.750000 -6.0000000
171: LGA DTW EV 5596 -12.000000 -10.0000000
172: EWR DTW EV 4911 -10.000000 -3.5000000
We could have instead defined something like `:=`(arr_mean = mean(arr_delay), dep_delay = mean(dep_delay))
but as the number of elements get larger, using .SDcols
is cleaner.
Columns can also be specified as ranges in .SDcols
.
== "DTW", lapply(.SD, mean),
nyc14[dest = .(origin, dest, carrier, flight),
by = arr_delay:dep_delay
.SDcols ]
origin dest carrier flight arr_delay arr_time dep_delay
1: LGA DTW DL 181 5.694118 1120.9706 11.8176471
2: LGA DTW DL 731 -2.637795 754.5906 -0.6653543
3: LGA DTW DL 1131 5.774648 1450.4366 9.3450704
4: JFK DTW DL 2184 -8.000000 1815.0000 -4.0000000
5: LGA DTW DL 2231 10.861486 1872.9527 17.9054054
---
168: EWR DTW DL 2509 -12.037037 1412.0000 -2.5555556
169: EWR DTW EV 5283 -14.947368 1239.1579 -1.9473684
170: EWR DTW EV 4886 -18.750000 1398.2500 -6.0000000
171: LGA DTW EV 5596 -12.000000 1018.0000 -10.0000000
172: EWR DTW EV 4911 -10.000000 1087.0000 -3.5000000
Because .SDcols
takes a character vector it is often useful to construct it programmatically from the names()
of the data.table object.
<- names(nyc14)[ grep("delay", names(nyc14)) ]
delay_cols <-
delay_stats == "DTW",
nyc14[dest c(lapply(.SD, mean),
lapply(.SD, sd)),
= .(carrier),
keyby = delay_cols]
.SDcols delay_stats
carrier dep_delay arr_delay delay30 dep_delay arr_delay delay30
1: DL 9.406785 3.074960 0.1350565 37.47738 40.83548 0.3418392
2: EV 17.391414 13.052399 0.2468434 43.13941 46.21437 0.4313110
3: MQ 4.903080 4.510143 0.1367393 26.47499 30.55465 0.3437011
<- c(key(delay_stats),
new_names paste(delay_cols, "mean", sep = "_"),
paste(delay_cols, "sd", sep = "_"))
setnames(delay_stats, new_names)
delay_stats
carrier dep_delay_mean arr_delay_mean delay30_mean dep_delay_sd arr_delay_sd
1: DL 9.406785 3.074960 0.1350565 37.47738 40.83548
2: EV 17.391414 13.052399 0.2468434 43.13941 46.21437
3: MQ 4.903080 4.510143 0.1367393 26.47499 30.55465
delay30_sd
1: 0.3418392
2: 0.4313110
3: 0.3437011
In this example, note that setnames()
like all set*
functions in data.table updates in place by reference.
One of the goals of the data.table
package is to use memory efficiently. This is achieved in part by preferring “shallow” copies by reference over “deep copies” by value when appropriate. When an object is copied by reference it shares physical memory address with the object it is copied from. This is more efficient, but may lead to confusion as changing the value in memory also changes what is pointed to by both objects.
In the example below, we create a data.table
DT1
and then assign it to DT2
. Typical R objects would be copied by value using “copy on modify” semantics, but DT2
is copied by reference. We can ask for a copy by value explicitly using copy()
.
<- data.table(a = 5:1, b = letters[5:1])
DT1 <- DT1 # Copy by reference
DT2 <- copy(DT1) # Copy by value
DT3 rbind(address(DT1),
address(DT2),
address(DT3))
[,1]
[1,] "0x127504200"
[2,] "0x127504200"
[3,] "0x12749c200"
:= 2 * a] # Create a new column
DT1[, c DT1
a b c
1: 5 e 10
2: 4 d 8
3: 3 c 6
4: 2 b 4
5: 1 a 2
DT2
a b c
1: 5 e 10
2: 4 d 8
3: 3 c 6
4: 2 b 4
5: 1 a 2
DT3
a b
1: 5 e
2: 4 d
3: 3 c
4: 2 b
5: 1 a
rbind(address(DT1),
address(DT2),
address(DT3))
[,1]
[1,] "0x127504200"
[2,] "0x127504200"
[3,] "0x12749c200"
After updating DT1
to include a new column, C
, the column appears in DT2
as well because DT1
and DT2
refer to the same object. This is in stark constrast to the majority of R which does lazy evaluation - it references by copy until an object is modified, then creates a copy by value.
<- data.frame(a = 5:1, b = letters[5:1])
df <- df
df2 rbind(address(df),
address(df2))
[,1]
[1,] "0x140473b08"
[2,] "0x140473b08"
$c <- 2*df$a
df df
a b c
1 5 e 10
2 4 d 8
3 3 c 6
4 2 b 4
5 1 a 2
df2
a b
1 5 e
2 4 d
3 3 c
4 2 b
5 1 a
rbind(address(df),
address(df2))
[,1]
[1,] "0x127a98488"
[2,] "0x140473b08"
In the last example above we used reference semantics to create a new column in DT1
without copying the other other columns and reassigning to a new DT1
object.
One way in which this is useful is to modify subsets of a data.table
without re-allocating the entire thing. As an example, let’s truncate all arr_delay
below 0.
range(nyc14$arr_delay)
[1] -112 1494
The tracemem
function tracks an object and prints a message whenever it is copied. Additionally, it tells us the memory location of an object.
tracemem(nyc14$arr_delay)
[1] "<0x149a18000>"
< 0, arr_delay := 0]
nyc14[arr_delay range(nyc14$arr_delay)
[1] 0 1494
untracemem(nyc14$arr_delay)
# Ending memory location
address(nyc14$arr_delay)
[1] "0x149a18000"
So what happened here - we modified a data.table
object in place, without copying it. Let’s see what would happen on a data.frame
.
<- as.data.frame(nyc14)
nyc14df tracemem(nyc14df$arr_delay)
[1] "<0x108d28000>"
$arr_delay[nyc14df$arr_delay < 0] <- 0 nyc14df
tracemem[0x108d28000 -> 0x109ed8000]: eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main
tracemem[0x109ed8000 -> 0x10a1c0000]: eval eval eval_with_user_handlers withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir in_input_dir eng_r block_exec call_block process_group.block process_group withCallingHandlers withCallingHandlers handle_error process_file <Anonymous> <Anonymous> execute .main
untracemem(nyc14$arr_delay)
address(nyc14df$arr_delay)
[1] "0x10a1c0000"
There are two copies of the memory in this simple operation.
We can also delete columns by reference using NULL
:
"month" := NULL]
nyc14[, # i.e. nyc14$month = NULL
It turns out that this is a substantially faster operation than using negative indexing.
microbenchmark(
copy = nyc14b <- copy(nyc14),
null = {
<- copy(nyc14)
nyc14b "year" := NULL]
nyc14b[,
},negindex = {
<- copy(nyc14)
nyc14b -"year"]
nyc14b[,
} )
Unit: microseconds
expr min lq mean median uq max neval
copy 644.479 1298.572 2248.583 1489.017 2992.221 31107.89 100
null 773.055 1453.430 4234.325 1643.629 3596.418 33203.60 100
negindex 4277.079 5696.130 8742.005 6958.520 8383.373 37336.04 100
It’s hard to demonstrate here because of the need to include the copy()
inside, but according to the documentation, setting a column to NULL
actually takes identically 0 time. Try running this a few times on your own - you’ll see that most times, the copy
and null
have nearly identical timings
We can use this with by
to accomplish tasks such as adding a column showing the maximum departure delay by flight.
:= max(dep_delay), by = .(carrier, flight)][] nyc14[, max_dep_delay
year day dep_time dep_delay arr_time arr_delay cancelled carrier
1: 2014 1 914 14 1238 13 0 AA
2: 2014 1 1157 -3 1523 13 0 AA
3: 2014 1 1902 2 2224 9 0 AA
4: 2014 1 722 -8 1014 0 0 AA
5: 2014 1 1347 2 1706 1 0 AA
---
253312: 2014 31 1459 1 1747 0 0 UA
253313: 2014 31 854 -5 1147 0 0 UA
253314: 2014 31 1102 -8 1311 16 0 MQ
253315: 2014 31 1106 -4 1325 15 0 MQ
253316: 2014 31 824 -5 1045 1 0 MQ
tailnum flight origin dest air_time distance hour min delay30
1: N338AA 1 JFK LAX 359 2475 9 14 FALSE
2: N335AA 3 JFK LAX 363 2475 11 57 FALSE
3: N327AA 21 JFK LAX 351 2475 19 2 FALSE
4: N3EHAA 29 LGA PBI 157 1035 7 22 FALSE
5: N319AA 117 JFK LAX 350 2475 13 47 FALSE
---
253312: N23708 1744 LGA IAH 201 1416 14 59 FALSE
253313: N33132 1758 EWR IAH 189 1400 8 54 FALSE
253314: N827MQ 3591 LGA RDU 83 431 11 2 FALSE
253315: N511MQ 3592 LGA DTW 75 502 11 6 FALSE
253316: N813MQ 3599 LGA SDF 110 659 8 24 FALSE
max_dep_delay
1: 156
2: 284
3: 848
4: 89
5: 248
---
253312: 385
253313: 42
253314: 240
253315: 68
253316: 121
The last set of empty brackets above is a short-hand for a subsequent call to print(nyc14)
.
Above we used “indexing” in a generic sense to mean “subsetting”. What we mean by “indexing” here is more specific and technical: we create an indexed data table by designating specific columns as keys and sorting the table by these keys to create more efficient look-ups and aggregations. This is similar to keys in SQL, or how Stata stored what variables a dataset was sorted on.
We saw earlier the keyby=
argument for grouping i
and j
operations. The “key” we generate performs a similar role to row names in a data.frame
- a way to refer to a row by name rather than position.
Earlier we saw a key being added by keyby=
. The more explicit way to add a key is with the setkey()
function.
setkey(nyc14, origin) #also, setkeyv(nyc14, "origin") if character is preferred.
key(nyc14)
[1] "origin"
After a key has been set, we can subset in the i
-statement using lists:
"LGA")] nyc14[.(
year day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum
1: 2014 1 722 -8 1014 0 0 AA N3EHAA
2: 2014 1 553 -7 739 0 0 AA N3KHAA
3: 2014 1 623 -7 815 0 0 AA N3BSAA
4: 2014 1 652 -8 833 0 0 AA N560AA
5: 2014 1 738 -2 940 15 0 AA N3GMAA
---
84429: 2014 31 609 24 843 0 0 UA N16709
84430: 2014 31 1459 1 1747 0 0 UA N23708
84431: 2014 31 1102 -8 1311 16 0 MQ N827MQ
84432: 2014 31 1106 -4 1325 15 0 MQ N511MQ
84433: 2014 31 824 -5 1045 1 0 MQ N813MQ
flight origin dest air_time distance hour min delay30 max_dep_delay
1: 29 LGA PBI 157 1035 7 22 FALSE 89
2: 301 LGA ORD 142 733 5 53 FALSE 65
3: 303 LGA ORD 143 733 6 23 FALSE 126
4: 305 LGA ORD 139 733 6 52 FALSE 263
5: 307 LGA ORD 145 733 7 38 FALSE 140
---
84429: 1714 LGA IAH 198 1416 6 9 FALSE 53
84430: 1744 LGA IAH 201 1416 14 59 FALSE 385
84431: 3591 LGA RDU 83 431 11 2 FALSE 240
84432: 3592 LGA DTW 75 502 11 6 FALSE 68
84433: 3599 LGA SDF 110 659 8 24 FALSE 121
rather than having to specifically refer to the column origin
, e.g. origin == "LGA"
, as we’ve been doing.
We can have more than one column contribute to the order used to form the key.
# key by origin and destination
setkey(nyc14, origin, dest)
key(nyc14)
[1] "origin" "dest"
"LGA", "ATL")] nyc14[.(
year day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum
1: 2014 1 1810 10 2054 10 0 DL N930DL
2: 2014 1 1657 -3 1940 0 0 DL N965DL
3: 2014 1 1255 -5 1521 0 0 DL N994DL
4: 2014 1 1558 -1 1835 0 0 DL N955DL
5: 2014 1 603 3 815 0 0 DL N392DA
---
6921: 2014 31 1708 -2 1933 0 0 WN N434WN
6922: 2014 31 1533 -2 1748 0 0 WN N797MX
6923: 2014 31 1259 4 1538 0 0 WN N298WN
6924: 2014 31 929 -1 1158 0 0 WN N243WN
6925: 2014 31 2025 -5 2252 0 0 WN N913WN
flight origin dest air_time distance hour min delay30 max_dep_delay
1: 61 LGA ATL 126 762 18 10 FALSE 325
2: 221 LGA ATL 129 762 16 57 FALSE 375
3: 781 LGA ATL 121 762 12 55 FALSE 341
4: 847 LGA ATL 130 762 15 58 FALSE 416
5: 904 LGA ATL 116 762 6 3 FALSE 164
---
6921: 397 LGA ATL 114 762 17 8 FALSE 156
6922: 419 LGA ATL 115 762 15 33 FALSE 168
6923: 538 LGA ATL 116 762 12 59 FALSE 4
6924: 706 LGA ATL 112 762 9 29 FALSE 43
6925: 2969 LGA ATL 112 762 20 25 FALSE 189
Note that each element of the list
defined by .()
corresponds to a key - in this example, it is equivalent to origin == "LGA" & dest == "ATL"
. You can combine the list
with c()
as well.
c("LGA", "EWR"), "ATL")] nyc14[.(
year day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum
1: 2014 1 1810 10 2054 10 0 DL N930DL
2: 2014 1 1657 -3 1940 0 0 DL N965DL
3: 2014 1 1255 -5 1521 0 0 DL N994DL
4: 2014 1 1558 -1 1835 0 0 DL N955DL
5: 2014 1 603 3 815 0 0 DL N392DA
---
11103: 2014 31 741 -4 958 0 0 DL N926AT
11104: 2014 31 1521 50 1752 43 0 UA N15712
11105: 2014 31 555 -5 805 0 0 UA N14731
11106: 2014 31 1159 -9 1426 0 0 UA N817UA
11107: 2014 31 811 11 1027 0 0 UA N33203
flight origin dest air_time distance hour min delay30 max_dep_delay
1: 61 LGA ATL 126 762 18 10 FALSE 325
2: 221 LGA ATL 129 762 16 57 FALSE 375
3: 781 LGA ATL 121 762 12 55 FALSE 341
4: 847 LGA ATL 130 762 15 58 FALSE 416
5: 904 LGA ATL 116 762 6 3 FALSE 164
---
11103: 807 EWR ATL 108 746 7 41 FALSE 197
11104: 1554 EWR ATL 113 746 15 21 TRUE 344
11105: 1614 EWR ATL 111 746 5 55 FALSE 4
11106: 606 EWR ATL 119 746 11 59 FALSE 224
11107: 1162 EWR ATL 109 746 8 11 FALSE 275
This is all flights with origin
LGA or EWR, and destination ATL.
To only refer to the first key, pass a single argument to the list
:
"LGA")] nyc14[.(
year day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum
1: 2014 6 1059 -6 1332 0 0 EV N760EV
2: 2014 7 1122 2 1352 1 0 EV N197PQ
3: 2014 11 1033 0 1245 0 0 EV N391CA
4: 2014 1 1810 10 2054 10 0 DL N930DL
5: 2014 1 1657 -3 1940 0 0 DL N965DL
---
84429: 2014 29 630 0 849 9 0 MQ N530MQ
84430: 2014 29 1454 -5 1658 0 0 MQ N517MQ
84431: 2014 30 626 -4 817 0 0 MQ N542MQ
84432: 2014 30 1452 -7 1703 0 0 MQ N514MQ
84433: 2014 31 625 -5 829 0 0 MQ N501MQ
flight origin dest air_time distance hour min delay30 max_dep_delay
1: 5624 LGA AGS 110 678 10 59 FALSE 19
2: 5625 LGA AGS 111 678 11 22 FALSE 2
3: 5632 LGA AGS 102 678 10 33 FALSE 0
4: 61 LGA ATL 126 762 18 10 FALSE 325
5: 221 LGA ATL 129 762 16 57 FALSE 375
---
84429: 3547 LGA XNA 174 1147 6 30 FALSE 105
84430: 3553 LGA XNA 162 1147 14 54 FALSE 163
84431: 3547 LGA XNA 154 1147 6 26 FALSE 105
84432: 3553 LGA XNA 157 1147 14 52 FALSE 163
84433: 3547 LGA XNA 165 1147 6 25 FALSE 105
To refer only to the second key, you need to get all the first keys.
unique(origin), "LGA")] nyc14[.(
year day dep_time dep_delay arr_time arr_delay cancelled carrier tailnum
1: NA NA NA NA NA NA NA <NA> <NA>
2: NA NA NA NA NA NA NA <NA> <NA>
3: NA NA NA NA NA NA NA <NA> <NA>
flight origin dest air_time distance hour min delay30 max_dep_delay
1: NA EWR LGA NA NA NA NA NA NA
2: NA JFK LGA NA NA NA NA NA NA
3: NA LGA LGA NA NA NA NA NA NA
unique(origin)
returns a vector of all origins, so it is essentially matching the origin
key to any input.
We can combine this with j
and by
statements.
# Find the median departure delay for all flights to DTW
unique(origin), "DTW"),
nyc14[.(med_dep_delay = as.numeric(median(dep_delay)), n = .N),
.(= .(origin, dest, flight)] |>
by order(origin, med_dep_delay, -n)] _[
origin dest flight med_dep_delay n
1: EWR DTW 3810 -12.0 1
2: EWR DTW 5823 -10.0 1
3: EWR DTW 355 -8.0 1
4: EWR DTW 4132 -8.0 1
5: EWR DTW 4886 -7.5 4
---
168: LGA DTW 2352 32.0 1
169: LGA DTW 2801 36.0 1
170: LGA DTW 2458 76.0 4
171: LGA DTW 5437 148.0 1
172: LGA DTW 2099 243.0 1
In data.table
when we designate columns as keys, the rows are re-ordered by reference in increasing order. This physically reorders the rows but uses the same locations in memory for the columns.
First, let’s do a toy example. Let’s generate a large data set, and time subsetting the data with and without a key.
<- 2e8
n <- data.table(group = sample(1:26, n, replace = TRUE),
DT x = rnorm(n))
print(object.size(DT), units = "GB", digits = 2)
2.24 Gb
Let’s subset a single group.
system.time(DT[group == 9])
user system elapsed
0.901 0.221 1.128
Next, set a key and repeat subsetting.
setkey(DT, group)
system.time(DT[.(9), ])
user system elapsed
0.038 0.007 0.045
Note that the speed-up here is due to the key, not the way we index:
system.time(DT[group == 9])
user system elapsed
0.026 0.005 0.032
Here, even with the non-key approach to indexing, because we do have group
keyed, we still see the speed improvement.
Setting the key does add some time:
setkey(DT, NULL)
system.time(setkey(DT, group))
user system elapsed
0.942 0.073 1.015
<- 2e8
n <- data.table(group = sample(1:26, n, replace = TRUE))
DT <- DT[ , .(count = rpois(.N, group)), by = group]
DT 1:2,] DT[
group count
1: 3 4
2: 3 4
print(object.size(DT), units = "GB", digits = 2)
1.49 Gb
group
identifies group membership, and count
is a Poisson random variable associated with each observation.
To test, we’ll calculate the average count
within each group
to obtain \(\hat{\lambda}\), focusing only on the first and last group.
First, an approach without keys.
## Unkeyed approach
key(DT)
NULL
<- system.time({
tm1 <- DT[group == 1 | group == 26,
ans1 lambda_hat = mean(count)),
.(= group]
by })
Next, we’ll set the key and use the keyed approach.
<- system.time({
tm_key setkey(DT, group)
})key(DT)
[1] "group"
# keyed approach
<- system.time({
tm2 <- DT[.(c(1, 26)),
ans2 lambda_hat = mean(count)),
.(= group]
by })
ans1
group lambda_hat
1: 1 0.9999069
2: 26 25.9993898
ans2
group lambda_hat
1: 1 0.9999069
2: 26 25.9993898
rbind(naive = tm1, addkey = tm_key, keyed = tm2)[, "elapsed"]
naive addkey keyed
1.225 1.469 0.121
So while adding the key was slow; the actual processing time of the oepration is an order of magnitude smaller. For a single operation, perhaps keying isn’t worth it, but the more operations you do, the more you gain.
Let’s compare against base R.
<- system.time({
tm3 <- DT[DT$group == 1 | DT$group == 26]
DTsmall <- aggregate(DTsmall$count, by = list(DTsmall$group), FUN = mean)
ans3
}) ans3
Group.1 x
1 1 0.9999069
2 26 25.9993898
And the tidyverse.
<- tibble::as_tibble(DT)
dtibble <- system.time({
tm4 |>
dtibble ::filter(group == 1 | group == 26) |>
dplyr::group_by(group) |>
dplyr::summarize(lambda_hat = mean(count)) |>
dplyr::ungroup() -> ans4
dplyr
}) ans4
# A tibble: 2 × 2
group lambda_hat
<int> <dbl>
1 1 1.00
2 26 26.0
rbind(naive = tm1, addkey = tm_key, keyed = tm2,
baseR = tm3, tidy = tm4)[, "elapsed"]
naive addkey keyed baseR tidy
1.225 1.469 0.121 3.078 1.778
rm(DT) # clean up the large object
data.table
’sThere is often a need to join information stored across two or more data frames. In R we have previously used dplyr::left_join()
or similar *_join()
functions for this. In base R two data.frames can be joined using the S3 generic merge()
which dispatches the merge.data.frame()
method.
merge()
The data.table
package defines a merge.data.table()
method. In addition to the tables to join, there are two key parameters: by
and all
.
We use by
to specify which columns to join on.
== "JFK",
nyc14[ origin
.N,|>
.(carrier)] merge(x=_, nycflights13::airlines,
by = "carrier", all = TRUE) |>
order(-N)] _[
carrier N name
1: B6 34220 JetBlue Airways
2: DL 18860 Delta Air Lines Inc.
3: AA 11923 American Airlines Inc.
4: MQ 5444 Envoy Air
5: UA 3924 United Air Lines Inc.
6: VX 3138 Virgin America
7: US 2645 US Airways Inc.
8: EV 1069 ExpressJet Airlines Inc.
9: HA 260 Hawaiian Airlines Inc.
10: 9E NA Endeavor Air Inc.
11: AS NA Alaska Airlines Inc.
12: F9 NA Frontier Airlines Inc.
13: FL NA AirTran Airways Corporation
14: OO NA SkyWest Airlines Inc.
15: WN NA Southwest Airlines Co.
16: YV NA Mesa Airlines Inc.
The by
variables must exist in both tables. If not, use by.x
and by.y
instead.
We can specify inner (all=FALSE
), left (all.x=TRUE
), right (all.y=TRUE
), or full (all=TRUE
) joins using the all*
parameters.
The resulting merge contains all columns from both tables with duplicate names not used in by
renamed using a suffix, i.e. col.x
or col.y
.
<- data.table(id = 0:4, letter = letters[26 - 0:4])
x <- data.table(id = 1:5, letter = LETTERS[26 - 1:5])
y merge(x, y, by = "id", all = TRUE)
id letter.x letter.y
1: 0 z <NA>
2: 1 y Y
3: 2 x X
4: 3 w W
5: 4 v V
6: 5 <NA> U
[]
There may be times where you wish to perform some computation using columns from two tables without the need for an explicit merge first.
In these cases you can use the DT1[DT2, ]
syntax for joins.
= "id"] x[y, , on
id letter i.letter
1: 1 y Y
2: 2 x X
3: 3 w W
4: 4 v V
5: 5 <NA> U
= "id"] y[x, , on
id letter i.letter
1: 0 <NA> z
2: 1 Y y
3: 2 X x
4: 3 W w
5: 4 V v
= "id"] x[y, .(id, letter), on
id letter
1: 1 y
2: 2 x
3: 3 w
4: 4 v
5: 5 <NA>
= "id", nomatch = 0L] x[y, .(id, letter), on
id letter
1: 1 y
2: 2 x
3: 3 w
4: 4 v
If we are matching on set keys, we do not need to provide these as on
.
setkey(x, id)
x[y]## Fails because we have no key set for y
y[x] setkey(y, id)
y[x]
Here is the previous flights example using this syntax.
<- data.table(nycflights13::airlines)
airlines == "JFK",
nyc14[origin
.N,= carrier] |>
keyby = 0L] |>
_[airlines, nomatch order(-N)] _[
carrier N name
1: B6 34220 JetBlue Airways
2: DL 18860 Delta Air Lines Inc.
3: AA 11923 American Airlines Inc.
4: MQ 5444 Envoy Air
5: UA 3924 United Air Lines Inc.
6: VX 3138 Virgin America
7: US 2645 US Airways Inc.
8: EV 1069 ExpressJet Airlines Inc.
9: HA 260 Hawaiian Airlines Inc.