The “Tidyverse” is a series of R packages developed primarily by Hadley Wickham and his team at Posit (formerly RStudio). In its own words, it is an “opinionated collection of R packages designed for data science”.
Proponents of the tidyverse (so-named because one of the original packages was tidyr) argue that it provides a consistent “grammar” of statistics that is easier for new users to understand. Whether this is true or not remains to be seen.
The primary package in the tidyverse is dplyr which we will be going over. Additionally the tibble package introduces the tibble, which is an extension of a data.frame. There are a number of other packages which are more niche:
tidyr: Reshaping data (wide to long)
readr: Reading in CSV data
purrr: Functional programming
stringr: String manipulation
forcats: factor manipulation
Finally, the ggplot2 predates anything about the tidyverse, but none-the-less is now considered part of the tidyverse. We will be covering ggplot2 in a separate set of notes.
In addition to these formal tidyverse packages, you will find many packages written by other authors which interact with the tidyverse. These typically aren’t as “opinionated” and can be used with or without the rest of the tidyverse. For example,
haven: Reading and writing data from Stata, SAS and SPSS
lubridate: Working with datetime variables
rvest: Web-scraping
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
When loading the meta-library tidyverse, the main above packages also get loaded, as seen in that note.
Piping
The tidyverse is heavily invested in the idea of “piping”. The “pipe” operator is formally defined in the magittr package.
x <-rnorm(10)mean(x)
[1] 0.3076673
x %>% mean
[1] 0.3076673
x %>%mean()
[1] 0.3076673
The left side of the pipe gets included as the first argument of the right side function. Additional arguments can be passed as needed.
x[1] <-NAx %>%mean(na.rm =TRUE)
[1] 0.4145994
The object can be passed into different slots with the .:
There are a lot of differences between %>% and |>, which this stackoverflow answer goes into great detail about, but in most situations, they will function identically.
Of note is that |> is substantially faster, primarily because it does simple substitution: x |> mean() simply processes mean(x) without any additional processing. %>% does a lot of additional processing, which does enable some other features, but those features are not commonly used.
Should you use pipes?
There is nothing pipes can do that cannot be accomplished without their use. The choice between using pipes is (speed-considerations of %>% vs |> aside) entirely a personal code style choice.
dplyr
We will be using the 2009 RECS data to demonstrate the functionality of dplyr. We’ll approach this as a case study in which we set out to answer the question:
Which state has the highest proportion of single-family attached homes?
There are five main functions that dplyr uses. There are, of course, many more, but these are the most common ones.
select() picks variables based on their names.
filter() picks cases based on their values.
arrange() changes the ordering of the rows.
mutate() adds new variables that are functions of existing variables
summarize() reduces multiple values down to a single summary.
Data cleaning
Let’s begin by creating a clean and tidy data frame with the necessary variables. We’ll need to keep the two variables of interest and the sample weight. Later we will also make use of the replicate weights to compute standard errors.
Here we read in the data, either from a local file or directly from the web.
Note the use of readr rather than read.csv to stick within the tidyverse. recs_tib is now a tibble. We will go into more detail later about tibbles, for now they are mostly just data.frames.
Next, we’ll use select() to drop all but a subset of variables. We’ll need to keep “REPORTABLE_DOMAIN” which records the State, “TYPEHUQ” which records the type of houses, and “NWEIGHT” which records the weight for the record which we’ll need to use later. (Sampling weights is a massive topic outside of scope for this class; for now just understand that by using these weights in our analysis [e.g. weighted means or weighted least squares], we can obtain estimates which are appropriate for the entire US population.)
# A tibble: 12,083 × 3
state type weight
<chr> <chr> <dbl>
1 MO SingleFamilyDetached 2472.
2 CA SingleFamilyDetached 8599.
3 CT, ME, NH, RI, VT ApartmentMany 8970.
4 IN, OH SingleFamilyDetached 18004.
5 CT, ME, NH, RI, VT SingleFamilyAttached 6000.
6 IA, MN, ND, SD SingleFamilyDetached 4232.
7 NY SingleFamilyDetached 7862.
8 FL SingleFamilyDetached 6297.
9 PA SingleFamilyAttached 12157.
10 MO SingleFamilyDetached 3242.
# ℹ 12,073 more rows
It probably would have been cleaner to write those functions externally. They certainly would be easier to test.
Aggregating by group
Recall that we are interested in computing the proportion of each housing type by state. We can do this using a split-apply-combine paradigm. We split the data by a grouping variable, apply a function to each split of the data, then combine the results back into a single dataset.
In dplyr the group_by function handles the split step, typically summarize handles the apply step, and ungroup (optionally) handles the combine step.
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
recs_type_state_sum
# A tibble: 134 × 3
# Groups: state [27]
state type homes
<chr> <chr> <dbl>
1 AK, HI, OR, WA ApartmentFew 374743.
2 AK, HI, OR, WA ApartmentMany 946196.
3 AK, HI, OR, WA MobileHome 384298.
4 AK, HI, OR, WA SingleFamilyAttached 189645.
5 AK, HI, OR, WA SingleFamilyDetached 2833057.
6 AL, KY, MS ApartmentFew 183983.
7 AL, KY, MS ApartmentMany 201344.
8 AL, KY, MS MobileHome 422086.
9 AL, KY, MS SingleFamilyAttached 192720.
10 AL, KY, MS SingleFamilyDetached 3637141.
# ℹ 124 more rows
Pay close attention to the change in grouping. When summarize() is called we lose the most nested group.
Finally we can optionally ungroup. The reason it is optional is that a lot of functions are not aware of the grouping, so it rarely is wrong to simply leave it grouped. However, there are issues that can occur when leaving something grouped, so for safety I recommend always ungrouping.
`summarise()` has grouped output by 'state'. You can override using the
`.groups` argument.
recs_types_state_sum
# A tibble: 134 × 3
state type homes
<chr> <chr> <dbl>
1 AK, HI, OR, WA ApartmentFew 374743.
2 AK, HI, OR, WA ApartmentMany 946196.
3 AK, HI, OR, WA MobileHome 384298.
4 AK, HI, OR, WA SingleFamilyAttached 189645.
5 AK, HI, OR, WA SingleFamilyDetached 2833057.
6 AL, KY, MS ApartmentFew 183983.
7 AL, KY, MS ApartmentMany 201344.
8 AL, KY, MS MobileHome 422086.
9 AL, KY, MS SingleFamilyAttached 192720.
10 AL, KY, MS SingleFamilyDetached 3637141.
# ℹ 124 more rows
Reshaping and formatting results for presentation
To proceed, let’s reshape the data to have one row per state. We can do this using the tidyr::pivot_wider() function. The tidyr package is designed for
However, to refer to these non-syntactically valid names, you need to use the backticks.
tb$`123`
[1] 4 5 6
select(tb, `my data`)
# A tibble: 3 × 1
`my data`
<int>
1 7
2 8
3 9
Lazy evaluation
Tibbles are created sequentially rather than in parallel:
df <-data.frame(a =1:3)df$b <- df$a +2df
a b
1 1 3
2 2 4
3 3 5
tb <-tibble(a =1:3,b = a +2)tb
# A tibble: 3 × 2
a b
<int> <dbl>
1 1 3
2 2 4
3 3 5
row.names
Tibbles do not support row names.
df
a b
1 1 3
2 2 4
3 3 5
tb
# A tibble: 3 × 2
a b
<int> <dbl>
1 1 3
2 2 4
3 3 5
row.names(df)
[1] "1" "2" "3"
row.names(tb)
[1] "1" "2" "3"
row.names(df) <- letters[21:23]
row.names(tb) <- letters[21:23]
Warning: Setting row names on a tibble is deprecated.
df
a b
u 1 3
v 2 4
w 3 5
tb
# A tibble: 3 × 2
a b
* <int> <dbl>
1 1 3
2 2 4
3 3 5
Watch out for this - it can lead to weird bugs if you try and use row names.
Recycling vectors
data.frames can recycle vectors as normal. Tibbles only recycle length-1 vectors. Imagine we’re trying to create a data set containing each pairwise combination of “temperature” and “direction”
temperature <-c("low", "medium", "high")setting <-c("forward", "backwards")results <-rnorm(6)df <-data.frame(temperature, setting, results)df
temperature setting results
1 low forward -0.59933897
2 medium backwards -0.06798187
3 high forward 0.32391186
4 low backwards 1.04543315
5 medium forward 0.07345601
6 high backwards 1.85675372
tibble(temperature, setting, results)
Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 3: Existing data.
• Size 2: Column at position 2.
ℹ Only values of size one are recycled.
tb <-as_tibble(df)tb
# A tibble: 6 × 3
temperature setting results
<chr> <chr> <dbl>
1 low forward -0.599
2 medium backwards -0.0680
3 high forward 0.324
4 low backwards 1.05
5 medium forward 0.0735
6 high backwards 1.86
Subsetting
Subsetting a data.frame with [] can yield a vector or a data.frame, where-as a tibble always subsets to a tibble.
(Tibbles support drop = TRUE if you do want it to return a vector.)
Additionally, tibbles do not support partial-matching with $
names(df)
[1] "temperature" "setting" "results"
df$temp
[1] "low" "medium" "high" "low" "medium" "high"
names(tb)
[1] "temperature" "setting" "results"
tb$temp
Warning: Unknown or uninitialised column: `temp`.
NULL
Printing tibbles
The most visually distinguishing difference between tibbles and data.frames is how much it prints by default.
data(starwars)starwars
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
As you can see, a large number of columns and rows were suppressed from the output. If we were to convert this to a data.frame and print, it would display the entire results
# not evaluated!as.data.frame(starwars)
The print function can control tibbles performance:
print(starwars, n =3, width =50)
# A tibble: 87 × 14
name height mass hair_color skin_color
<chr> <int> <dbl> <chr> <chr>
1 Luke Skywalk… 172 77 blond fair
2 C-3PO 167 75 <NA> gold
3 R2-D2 96 32 <NA> white, bl…
# ℹ 84 more rows
# ℹ 9 more variables: eye_color <chr>,
# birth_year <dbl>, sex <chr>, gender <chr>,
# homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Note that width controls the actual width of the output, not the number of columns.
Tidyverse vs base R
I personally restrict use of the tidyverse as much as possible. There are a number of reasons for this, a few include:
Tidyverse changes its API and deprecates functions very rapidly.
Tidyverse uses nonstandard evaluation frequently.
Tidyverse packages have no issue overloading function names which can lead to confusing results depending on the order in which packages are loaded.
It is often more complex to do basic operations in tidyverse than base R.
Debugging long piped operations is challenging (a pipe problem rather than a specific tidyverse problem).
Using the tidyverse adds a massive set of requirements to your analysis.
This second is a document which explains a lot of the issues with the tidyverse and why it isn’t necessarily the best way to learn R or move R forward: https://github.com/matloff/TidyverseSkeptic