Lecture 8: Tidy evaluation#

Lecture learning objectives:#

By then end of this lecture & worksheet 8, students should be able to:

  • Describe data masking as it relates to the dplyr functions. Explain the problems it solves for interactive programming and the problems it creates for programming in a non-interactive setting

  • Explain what the enquo() function and the !! operator do in R in the context of data masking as it relates to the dplyr functions

  • Use the {{ (read: curly curly) operator (abstracts quote-and-unquote into a single interpolation step), the := (read: walrus) operatpr in R, and ... (read: pass the dots) to write functions which wrap the dplyr functions

library(gapminder)
library(tidyverse)
options(repr.matrix.max.rows = 5)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

 ggplot2 3.3.5      purrr   0.3.4
 tibble  3.1.4      dplyr   1.0.7
 tidyr   1.1.3      stringr 1.4.0
 readr   2.0.1      forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()

What Metaprogramming lets you do in R#

  • write library(purrr) instead of library("purrr")

  • enable plot(x, sin(x)) to automatically label the axes with x and sin(x)

  • create a model object via lm(y ~ x1 + x2, data = df)

  • and much much more (that you will see in Data Wrangling as we explore the tidyverse)

What is metaprogramming?#

Code that writes code/code that mutates code.

Our narrow focus on metaprogramming for this course:#

Tidy evaluation

Why focus on tidy evaluation#

In the rest of MDS you will be relying on functions from the tidyverse to do a lot of:

  • data wrangling

  • statistics

  • data visualization

Tidy evaluation#

The functions from the tidyverse are beautiful to use interactively.

gapminder
A tibble: 1704 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AfghanistanAsia195228.801 8425333779.4453
AfghanistanAsia195730.332 9240934820.8530
AfghanistanAsia196231.99710267083853.1007
ZimbabweAfrica200239.98911926563672.0386
ZimbabweAfrica200743.48712311143469.7093

with base r:

gapminder[gapminder$country == "Canada" & gapminder$year == 1952, ]
A tibble: 1 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
CanadaAmericas195268.751478558411367.16

In the tidyverse:

filter(gapminder, country == "Canada", year == 1952)
A tibble: 1 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
CanadaAmericas195268.751478558411367.16

How does that even work?#

  • When functions like filter are called, there is a delay in evaluation and the data frame is temporarily promoted as first class objects, we say the data masks the workspace

  • This is to allow the promotion of the data frame, such that it masks the workspace (global environment)

  • When this happens, R can then find the relevant columns for the computation

This is referred to as data masking

Back to our example:#

What is going on here?

  • code evaluation is delayed

  • the filter function quotes columns country and year

  • the filter function then creates a data mask (to mingle variables from the environment and the data frame)

  • the columns country and year and unquoted and evaluated within the data mask

filter(gapminder, country == "Canada", year == 1952)
A tibble: 1 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
CanadaAmericas195268.751478558411367.16

Trade off of lovely interactivity of tidyverse functions…#

programming with them can be more challenging.#

Let’s try writing a function which wraps filter for gapminder:

filter_gap <- function(col, val) {
    filter(gapminder, col == val)
}

filter_gap(country, "Canada")
Error: object 'country' not found
Traceback:

1. filter_gap(country, "Canada")
2. filter(gapminder, col == val)   # at line 4 of file <text>
3. filter.tbl_df(gapminder, col == val)
4. filter_impl(.data, quo)

Why does filter work with non-quoted variable names, but our function filter_gap fail?

Defining functions using tidy eval’s enquo and !!:#

Use enquo to quote the column names, and then !! to unquote them in context.

filter_gap <- function(col, val) {
    col <- enquo(col)
    filter(gapminder, !!col == val)
}

filter_gap(country, "Canada")
A tibble: 12 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
CanadaAmericas195268.751478558411367.16
CanadaAmericas195769.961701015412489.95
CanadaAmericas196271.301898584913462.49
CanadaAmericas200279.7703190226833328.97
CanadaAmericas200780.6533339014136319.24

Defining functions by embracing column names: {{ }}#

  • In the newest release of rlang, there has been the introduction of the {{ (pronounced “curly curly”) operator.

  • Does the same thing as enguo and !! but (hopefully) easier to use.

filter_gap <- function(col, val) {
    filter(gapminder, {{col}} == val)
}

filter_gap(country, "Canada")
A tibble: 12 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
CanadaAmericas195268.751478558411367.16
CanadaAmericas195769.961701015412489.95
CanadaAmericas196271.301898584913462.49
CanadaAmericas200279.7703190226833328.97
CanadaAmericas200780.6533339014136319.24

(Optional) Creating functions that handle column names as strings:#

Sometimes you want to pass a column name into a function as a string (often useful when you are programming and have the column names as a character vector).

You can do this by using symbols + unquoting with sym + !! :

# example of what we want to wrap: filter(gapminder, country == "Canada")
filter_gap <- function(col, val) {
    col <- sym(col)
    filter(gapminder, !!col == val)
}

filter_gap("country", "Canada")
A tibble: 12 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
CanadaAmericas195268.751478558411367.16
CanadaAmericas195769.961701015412489.95
CanadaAmericas196271.301898584913462.49
CanadaAmericas200279.7703190226833328.97
CanadaAmericas200780.6533339014136319.24

The walrus operator := is needed when assigning values#

  • := is needed when addinging values with tidyevaluation

group_summary <- function(data, group, col, fun) {
    data %>% 
        group_by({{ group }}) %>% 
        summarise( {{ col }} := fun({{ col }}))
}

group_summary(gapminder, continent, gdpPercap, mean)
A tibble: 5 × 2
continentgdpPercap
<fct><dbl>
Africa 2193.755
Americas 7136.110
Asia 7902.150
Europe 14469.476
Oceania 18621.609

Pass the dots when you can#

If you are only passing on variable to a tidyverse function, and that variable is not used in logical comparisons, or in variable assignment, you can get away with passing the dots:

sort_gap <- function(...) {
    arrange(gapminder, ...)
}

sort_gap(year)
A tibble: 1704 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AfghanistanAsia 195228.8018425333 779.4453
Albania Europe195255.23012826971601.0561
Algeria Africa195243.07792795252449.0082
Zambia Africa200742.384117460351271.2116
ZimbabweAfrica200743.48712311143 469.7093

Notes on passing the dots#

  • the dots should be the last function argument (or you will not be able to use positional arguments)

  • they are useful because you can add multiple arguments

For example:

sort_gap <- function(..., x) {
    print(x + 1)
    arrange(gapminder, ...)
}

sort_gap(year, continent, country, 2)
Error in print(x + 1): argument "x" is missing, with no default
Traceback:

1. sort_gap(year, continent, country, 2)
2. print(x + 1)   # at line 2 of file <text>
sort_gap <- function(x, ...) {
    print(x + 1)
    arrange(gapminder, ...)
}

sort_gap(1, year, continent, country)
[1] 2
A tibble: 1704 × 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AlgeriaAfrica195243.07792795252449.008
Angola Africa195230.01542320953520.610
Benin Africa195238.22317383151062.752
Australia Oceania200781.2352043417634435.37
New ZealandOceania200780.204 411577125185.01

Pass the dots is not always the solution…#

square_diff_n_select <- function(data, ...) {
    data %>% 
        mutate(... := (... - mean(...))^2) %>% 
        select(...)
}

square_diff_n_select(mtcars, mpg, mpg:hp)
Error: Problem with `mutate()` input `...`.
✖ object 'hp' not found
ℹ Input `...` is `(... - mean(...))^2`.
Traceback:

1. square_diff_n_select(mtcars, mpg, mpg:hp)
2. data %>% mutate(`:=`(..., (... - mean(...))^2)) %>% select(...)   # at line 2-4 of file <text>
3. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
4. eval(quote(`_fseq`(`_lhs`)), env, env)
5. eval(quote(`_fseq`(`_lhs`)), env, env)
6. `_fseq`(`_lhs`)
7. freduce(value, `_function_list`)
8. function_list[[i]](value)
9. mutate(., `:=`(..., (... - mean(...))^2))
10. mutate.data.frame(., `:=`(..., (... - mean(...))^2))
11. mutate_cols(.data, ...)
12. withCallingHandlers({
  .     for (i in seq_along(dots)) {
  .         not_named <- (is.null(dots_names) || dots_names[i] == 
  .             "")
  ...

When passing in different column names to different functions, embrace mutliple column names#

square_diff_n_select <- function(data, col_to_change, col_range) {
    data %>% 
        mutate({{ col_to_change }} := ({{ col_to_change }} - mean({{ col_to_change }}))^2) %>% 
        select({{col_range}})
}

square_diff_n_select(mtcars, mpg, mpg:hp)
A data.frame: 32 × 4
mpgcyldisphp
<dbl><dbl><dbl><dbl>
Mazda RX40.82696296160110
Mazda RX4 Wag0.82696296160110
Datsun 7107.34071294108 93
Maserati Bora25.9144638301335
Volvo 142E 1.7144634121109

Combining embracing with pass the dots:#

square_diff_n_select <- function(data, col_to_change, ...) {
    data %>% 
        mutate({{ col_to_change }} := ({{ col_to_change }} - mean({{ col_to_change }}))^2) %>% 
        select(..., {{ col_to_change }})
}

square_diff_n_select(mtcars, mpg, drat, carb)
A data.frame: 32 × 3
dratcarbmpg
<dbl><dbl><dbl>
Mazda RX43.9040.8269629
Mazda RX4 Wag3.9040.8269629
Datsun 7103.8517.3407129
Maserati Bora3.54825.914463
Volvo 142E4.112 1.714463

Programming defensively with tidy evaluation#

You can embrace {{ the column names in an if + stop statement to check user input when unquoted column names are used as function arguments.

First, we demonstrate how to check if a column is numeric using DATA_FRAME  %>% pull({{ COLUMN_NAME }}) to access the column:

check_if_numeric <- function(data, col) {
    is.numeric(data %>% pull({{ col }}))
}

check_if_numeric(gapminder, pop)
TRUE

Next, we add a if + stop to this, to throw an error in our square_diff_n_select function when the column type is not what our function is designed to handle. Here our function works, as the lifeExp column in the gapminder data set is numeric.

square_diff_n_select <- function(data, col_to_change, ...) {
    if (!is.numeric(data %>% pull({{ col_to_change }}))) {
        stop('col_to_change must be numeric')
    }
    
    data %>% 
        mutate({{ col_to_change }} := ({{ col_to_change }} - mean({{ col_to_change }}))^2) %>% 
        select(..., {{ col_to_change }})
}

square_diff_n_select(gapminder, lifeExp, country, year)
A tibble: 1704 × 3
countryyearlifeExp
<fct><int><dbl>
Afghanistan1952940.8599
Afghanistan1957849.2818
Afghanistan1962755.0097
Zimbabwe2002379.6823
Zimbabwe2007255.5982

Here our function throws an error, as the continent column in the gapminder data set is not numeric.

square_diff_n_select(gapminder, continent, country, year)
Error in square_diff_n_select(gapminder, continent, country, year): col_to_change must be numeric
Traceback:

1. square_diff_n_select(gapminder, continent, country, year)
2. stop("col_to_change must be numeric")   # at line 3 of file <text>

What did we learn?#

  • data masking and its role in tidy evaluation

  • programming with tidy-evaluated functions by embracing column names {{ }}

  • the walrus := operator for assignment when programming with tidy-evaluated functions

  • more useful examples of pass the dots ...

Attribution:#