Lecture 8: Tidy evaluation
Contents
Lecture 8: Tidy evaluation#
Lecture learning objectives:#
By then end of this lecture & worksheet 8, students should be able to:
Describe data masking as it relates to the
dplyr
functions. Explain the problems it solves for interactive programming and the problems it creates for programming in a non-interactive settingExplain what the
enquo()
function and the!!
operator do in R in the context of data masking as it relates to thedplyr
functionsUse the
{{
(read: curly curly) operator (abstracts quote-and-unquote into a single interpolation step), the:=
(read: walrus) operatpr in R, and...
(read: pass the dots) to write functions which wrap thedplyr
functions
library(gapminder)
library(tidyverse)
options(repr.matrix.max.rows = 5)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5 ✔ purrr 0.3.4
✔ tibble 3.1.4 ✔ dplyr 1.0.7
✔ tidyr 1.1.3 ✔ stringr 1.4.0
✔ readr 2.0.1 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
What Metaprogramming lets you do in R#
write
library(purrr)
instead oflibrary("purrr")
enable
plot(x, sin(x))
to automatically label the axes withx
andsin(x)
create a model object via
lm(y ~ x1 + x2, data = df)
and much much more (that you will see in Data Wrangling as we explore the tidyverse)
What is metaprogramming?#
Code that writes code/code that mutates code.
Our narrow focus on metaprogramming for this course:#
Tidy evaluation
Why focus on tidy evaluation#
In the rest of MDS you will be relying on functions from the tidyverse to do a lot of:
data wrangling
statistics
data visualization
Tidy evaluation#
The functions from the tidyverse are beautiful to use interactively.
gapminder
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.0386 |
Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.7093 |
with base r:
gapminder[gapminder$country == "Canada" & gapminder$year == 1952, ]
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Canada | Americas | 1952 | 68.75 | 14785584 | 11367.16 |
In the tidyverse:
filter(gapminder, country == "Canada", year == 1952)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Canada | Americas | 1952 | 68.75 | 14785584 | 11367.16 |
How does that even work?#
When functions like
filter
are called, there is a delay in evaluation and the data frame is temporarily promoted as first class objects, we say the data masks the workspace
This is to allow the promotion of the data frame, such that it masks the workspace (global environment)
When this happens, R can then find the relevant columns for the computation
This is referred to as data masking
Back to our example:#
What is going on here?
code evaluation is delayed
the
filter
function quotes columnscountry
andyear
the
filter
function then creates a data mask (to mingle variables from the environment and the data frame)the columns
country
andyear
and unquoted and evaluated within the data mask
filter(gapminder, country == "Canada", year == 1952)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Canada | Americas | 1952 | 68.75 | 14785584 | 11367.16 |
Trade off of lovely interactivity of tidyverse functions…#
programming with them can be more challenging.#
Let’s try writing a function which wraps filter for gapminder:
filter_gap <- function(col, val) {
filter(gapminder, col == val)
}
filter_gap(country, "Canada")
Error: object 'country' not found
Traceback:
1. filter_gap(country, "Canada")
2. filter(gapminder, col == val) # at line 4 of file <text>
3. filter.tbl_df(gapminder, col == val)
4. filter_impl(.data, quo)
Why does filter
work with non-quoted variable names, but our function filter_gap
fail?
Defining functions using tidy eval’s enquo
and !!
:#
Use enquo
to quote the column names, and then !!
to unquote them in context.
filter_gap <- function(col, val) {
col <- enquo(col)
filter(gapminder, !!col == val)
}
filter_gap(country, "Canada")
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Canada | Americas | 1952 | 68.75 | 14785584 | 11367.16 |
Canada | Americas | 1957 | 69.96 | 17010154 | 12489.95 |
Canada | Americas | 1962 | 71.30 | 18985849 | 13462.49 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Canada | Americas | 2002 | 79.770 | 31902268 | 33328.97 |
Canada | Americas | 2007 | 80.653 | 33390141 | 36319.24 |
Defining functions by embracing column names: {{ }}
#
In the newest release of
rlang
, there has been the introduction of the{{
(pronounced “curly curly”) operator.Does the same thing as
enguo
and!!
but (hopefully) easier to use.
filter_gap <- function(col, val) {
filter(gapminder, {{col}} == val)
}
filter_gap(country, "Canada")
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Canada | Americas | 1952 | 68.75 | 14785584 | 11367.16 |
Canada | Americas | 1957 | 69.96 | 17010154 | 12489.95 |
Canada | Americas | 1962 | 71.30 | 18985849 | 13462.49 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Canada | Americas | 2002 | 79.770 | 31902268 | 33328.97 |
Canada | Americas | 2007 | 80.653 | 33390141 | 36319.24 |
(Optional) Creating functions that handle column names as strings:#
Sometimes you want to pass a column name into a function as a string (often useful when you are programming and have the column names as a character vector).
You can do this by using symbols + unquoting with sym
+ !!
:
# example of what we want to wrap: filter(gapminder, country == "Canada")
filter_gap <- function(col, val) {
col <- sym(col)
filter(gapminder, !!col == val)
}
filter_gap("country", "Canada")
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Canada | Americas | 1952 | 68.75 | 14785584 | 11367.16 |
Canada | Americas | 1957 | 69.96 | 17010154 | 12489.95 |
Canada | Americas | 1962 | 71.30 | 18985849 | 13462.49 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Canada | Americas | 2002 | 79.770 | 31902268 | 33328.97 |
Canada | Americas | 2007 | 80.653 | 33390141 | 36319.24 |
The walrus operator :=
is needed when assigning values#
:=
is needed when addinging values with tidyevaluation
group_summary <- function(data, group, col, fun) {
data %>%
group_by({{ group }}) %>%
summarise( {{ col }} := fun({{ col }}))
}
group_summary(gapminder, continent, gdpPercap, mean)
continent | gdpPercap |
---|---|
<fct> | <dbl> |
Africa | 2193.755 |
Americas | 7136.110 |
Asia | 7902.150 |
Europe | 14469.476 |
Oceania | 18621.609 |
Pass the dots when you can#
If you are only passing on variable to a tidyverse function, and that variable is not used in logical comparisons, or in variable assignment, you can get away with passing the dots:
sort_gap <- function(...) {
arrange(gapminder, ...)
}
sort_gap(year)
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Albania | Europe | 1952 | 55.230 | 1282697 | 1601.0561 |
Algeria | Africa | 1952 | 43.077 | 9279525 | 2449.0082 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Zambia | Africa | 2007 | 42.384 | 11746035 | 1271.2116 |
Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.7093 |
Notes on passing the dots#
the dots should be the last function argument (or you will not be able to use positional arguments)
they are useful because you can add multiple arguments
For example:
sort_gap <- function(..., x) {
print(x + 1)
arrange(gapminder, ...)
}
sort_gap(year, continent, country, 2)
Error in print(x + 1): argument "x" is missing, with no default
Traceback:
1. sort_gap(year, continent, country, 2)
2. print(x + 1) # at line 2 of file <text>
sort_gap <- function(x, ...) {
print(x + 1)
arrange(gapminder, ...)
}
sort_gap(1, year, continent, country)
[1] 2
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
<fct> | <fct> | <int> | <dbl> | <int> | <dbl> |
Algeria | Africa | 1952 | 43.077 | 9279525 | 2449.008 |
Angola | Africa | 1952 | 30.015 | 4232095 | 3520.610 |
Benin | Africa | 1952 | 38.223 | 1738315 | 1062.752 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.37 |
New Zealand | Oceania | 2007 | 80.204 | 4115771 | 25185.01 |
Pass the dots is not always the solution…#
square_diff_n_select <- function(data, ...) {
data %>%
mutate(... := (... - mean(...))^2) %>%
select(...)
}
square_diff_n_select(mtcars, mpg, mpg:hp)
Error: Problem with `mutate()` input `...`.
✖ object 'hp' not found
ℹ Input `...` is `(... - mean(...))^2`.
Traceback:
1. square_diff_n_select(mtcars, mpg, mpg:hp)
2. data %>% mutate(`:=`(..., (... - mean(...))^2)) %>% select(...) # at line 2-4 of file <text>
3. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
4. eval(quote(`_fseq`(`_lhs`)), env, env)
5. eval(quote(`_fseq`(`_lhs`)), env, env)
6. `_fseq`(`_lhs`)
7. freduce(value, `_function_list`)
8. function_list[[i]](value)
9. mutate(., `:=`(..., (... - mean(...))^2))
10. mutate.data.frame(., `:=`(..., (... - mean(...))^2))
11. mutate_cols(.data, ...)
12. withCallingHandlers({
. for (i in seq_along(dots)) {
. not_named <- (is.null(dots_names) || dots_names[i] ==
. "")
...
When passing in different column names to different functions, embrace mutliple column names#
square_diff_n_select <- function(data, col_to_change, col_range) {
data %>%
mutate({{ col_to_change }} := ({{ col_to_change }} - mean({{ col_to_change }}))^2) %>%
select({{col_range}})
}
square_diff_n_select(mtcars, mpg, mpg:hp)
mpg | cyl | disp | hp | |
---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | |
Mazda RX4 | 0.8269629 | 6 | 160 | 110 |
Mazda RX4 Wag | 0.8269629 | 6 | 160 | 110 |
Datsun 710 | 7.3407129 | 4 | 108 | 93 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Maserati Bora | 25.914463 | 8 | 301 | 335 |
Volvo 142E | 1.714463 | 4 | 121 | 109 |
Combining embracing with pass the dots:#
square_diff_n_select <- function(data, col_to_change, ...) {
data %>%
mutate({{ col_to_change }} := ({{ col_to_change }} - mean({{ col_to_change }}))^2) %>%
select(..., {{ col_to_change }})
}
square_diff_n_select(mtcars, mpg, drat, carb)
drat | carb | mpg | |
---|---|---|---|
<dbl> | <dbl> | <dbl> | |
Mazda RX4 | 3.90 | 4 | 0.8269629 |
Mazda RX4 Wag | 3.90 | 4 | 0.8269629 |
Datsun 710 | 3.85 | 1 | 7.3407129 |
⋮ | ⋮ | ⋮ | ⋮ |
Maserati Bora | 3.54 | 8 | 25.914463 |
Volvo 142E | 4.11 | 2 | 1.714463 |
Programming defensively with tidy evaluation#
You can embrace {{
the column names in an if
+ stop
statement to check user input when unquoted column names are used as function arguments.
First, we demonstrate how to check if a column is numeric using DATA_FRAME %>% pull({{ COLUMN_NAME }})
to access the column:
check_if_numeric <- function(data, col) {
is.numeric(data %>% pull({{ col }}))
}
check_if_numeric(gapminder, pop)
Next, we add a if
+ stop
to this, to throw an error in our square_diff_n_select
function when the column type is not what our function is designed to handle. Here our function works, as the lifeExp
column in the gapminder
data set is numeric.
square_diff_n_select <- function(data, col_to_change, ...) {
if (!is.numeric(data %>% pull({{ col_to_change }}))) {
stop('col_to_change must be numeric')
}
data %>%
mutate({{ col_to_change }} := ({{ col_to_change }} - mean({{ col_to_change }}))^2) %>%
select(..., {{ col_to_change }})
}
square_diff_n_select(gapminder, lifeExp, country, year)
country | year | lifeExp |
---|---|---|
<fct> | <int> | <dbl> |
Afghanistan | 1952 | 940.8599 |
Afghanistan | 1957 | 849.2818 |
Afghanistan | 1962 | 755.0097 |
⋮ | ⋮ | ⋮ |
Zimbabwe | 2002 | 379.6823 |
Zimbabwe | 2007 | 255.5982 |
Here our function throws an error, as the continent
column in the gapminder
data set is not numeric.
square_diff_n_select(gapminder, continent, country, year)
Error in square_diff_n_select(gapminder, continent, country, year): col_to_change must be numeric
Traceback:
1. square_diff_n_select(gapminder, continent, country, year)
2. stop("col_to_change must be numeric") # at line 3 of file <text>
What did we learn?#
data masking and its role in tidy evaluation
programming with tidy-evaluated functions by embracing column names
{{ }}
the walrus
:=
operator for assignment when programming with tidy-evaluated functionsmore useful examples of pass the dots
...
Attribution:#
Tidy evaluation by Lionel Henry & Hadley Wickham
Tidy eval in context talk by Jenny Bryan
Advanced R by Hadley Wickham