Lecture 5 - Tidy control flow in R#

Lecture learning objectives:#

By the end of this lecture and worksheet 5, students should be able to:

  • Explain what a grouped data frame is, and how it can be used

  • Use {dplyr}’s group_by + summarize to perform the split-apply-combine approach in R to iterate over and summarize data by groups

  • Identify missing and erroneous values and manage them by removing (via {dplyr}’s filter) or replacing (using {dplyr}’s mutate + case_when)

  • Identify where in R code a commonly used functional, a {purrr}’s map* function, could be used in place of for loops and write code to do this

library(gapminder)
library(tidyverse)
options(repr.matrix.max.rows = 10)
Error in library(gapminder): there is no package called ‘gapminder’
Traceback:

1. library(gapminder)

Change or remove specific values#

  • Sometimes we’d like to selectively change values, we can do this using control flow within mutate

  • A related task is removing rows where there are NAs, either in a specified column or across the whole data frame or tibble

Selectively change a values#

What if we want to programatically change certain values? For example, update the country name “Cambodia” to its official English name, “Kingdom of Cambodia”?

If we want to change one or more (but not all things) in a column, one function we can use is case_when inside a mutate call:

gapminder %>% 
    mutate(country = case_when(country == "Cambodia" ~ "Kingdom of Cambodia",
                            TRUE ~ country))
Error: Problem with `mutate()` input `country`.
✖ must be a character vector, not a `factor` object.
ℹ Input `country` is `case_when(...)`.
Traceback:

1. gapminder %>% mutate(country = case_when(country == "Cambodia" ~ 
 .     "Kingdom of Cambodia", TRUE ~ country))
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(quote(`_fseq`(`_lhs`)), env
...

Ah! This doesn’t work? What is going on here? Let’s look at the structure of gapminder:

str(gapminder)
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

country (and continent) are factors! Let’s change these to character vectors so we can manipulate the character values they hold. We’ll make a new tibble named gap to do this:

gap <- gapminder %>% 
    mutate(country = as.character(country),
          continent = as.character(continent))

OK, now let’s try that conditional mutate again!

gap <- gap %>% 
    mutate(country = case_when(country == "Cambodia" ~ "Kingdom of Cambodia",
                            TRUE ~ country))
gap
A tibble: 1704 × 6
countrycontinentyearlifeExppopgdpPercap
<chr><chr><int><dbl><int><dbl>
AfghanistanAsia195228.801 8425333779.4453
AfghanistanAsia195730.332 9240934820.8530
AfghanistanAsia196231.99710267083853.1007
AfghanistanAsia196734.02011537966836.1971
AfghanistanAsia197236.08813079460739.9811
ZimbabweAfrica198762.351 9216418706.1573
ZimbabweAfrica199260.37710704340693.4208
ZimbabweAfrica199746.80911404948792.4500
ZimbabweAfrica200239.98911926563672.0386
ZimbabweAfrica200743.48712311143469.7093

Did it work?

gap %>%
    filter(country == "Kingdom of Cambodia")
A tibble: 12 × 6
countrycontinentyearlifeExppopgdpPercap
<chr><chr><int><dbl><int><dbl>
Kingdom of CambodiaAsia195239.4174693836368.4693
Kingdom of CambodiaAsia195741.3665322536434.0383
Kingdom of CambodiaAsia196243.4156083619496.9136
Kingdom of CambodiaAsia196745.4156960067523.4323
Kingdom of CambodiaAsia197240.3177450606421.6240
Kingdom of CambodiaAsia198753.914 8371791 683.8956
Kingdom of CambodiaAsia199255.80310150094 682.3032
Kingdom of CambodiaAsia199756.53411782962 734.2852
Kingdom of CambodiaAsia200256.75212926707 896.2260
Kingdom of CambodiaAsia200759.723141318581713.7787

Selectively change two or more values with case_when#

case_when will also let us easily modify > 2 cases. What if, for example, we wanted to change the continents from their English name to their French name?

french_contients <- gap %>% 
  mutate(continent = case_when(continent == "Asia" ~ "Asie", 
                               continent == "Europe" ~ "L'Europe",
                               continent == "Africa" ~ "Afrique", 
                               continent == "Americas" ~ "les amériques", 
                               continent == "Oceania" ~ "Océanie"))
head(french_contients)
A tibble: 6 × 6
countrycontinentyearlifeExppopgdpPercap
<chr><chr><int><dbl><int><dbl>
AfghanistanAsie195228.801 8425333779.4453
AfghanistanAsie195730.332 9240934820.8530
AfghanistanAsie196231.99710267083853.1007
AfghanistanAsie196734.02011537966836.1971
AfghanistanAsie197236.08813079460739.9811
AfghanistanAsie197738.43814880372786.1134

What if we don’t want to change Asia?

french_contients <- gap %>% 
  mutate(continent = case_when(#continent == "Asia" ~ "Asie", 
                               continent == "Europe" ~ "L'Europe",
                               continent == "Africa" ~ "Afrique", 
                               continent == "Americas" ~ "les amériques", 
                               continent == "Oceania" ~ "Océanie"))
head(french_contients)
A tibble: 6 × 6
countrycontinentyearlifeExppopgdpPercap
<chr><chr><int><dbl><int><dbl>
AfghanistanNA195228.801 8425333779.4453
AfghanistanNA195730.332 9240934820.8530
AfghanistanNA196231.99710267083853.1007
AfghanistanNA196734.02011537966836.1971
AfghanistanNA197236.08813079460739.9811
AfghanistanNA197738.43814880372786.1134

Uh oh, now Asia is NA?? We need to say TRUE ~ column_name to tell R to not put in NA’s in the non-specified class, but instead leave the values that were already there:

french_contients <- gap %>% 
  mutate(continent = case_when(#continent == "Asia" ~ "Asie", 
                               continent == "Europe" ~ "L'Europe",
                               continent == "Africa" ~ "Afrique", 
                               continent == "Americas" ~ "les amériques", 
                               continent == "Oceania" ~ "Océanie", 
                               TRUE ~ continent))
head(french_contients)
A tibble: 6 × 6
countrycontinentyearlifeExppopgdpPercap
<chr><chr><int><dbl><int><dbl>
AfghanistanAsia195228.801 8425333779.4453
AfghanistanAsia195730.332 9240934820.8530
AfghanistanAsia196231.99710267083853.1007
AfghanistanAsia196734.02011537966836.1971
AfghanistanAsia197236.08813079460739.9811
AfghanistanAsia197738.43814880372786.1134

Removing rows where a specific column holds NAs#

Here, we’d like to remove the rows where the column x is NA:

df <- tibble(x = c(3, 1, 2, NA), y = c("z", "a", NA, "b"))
df
A tibble: 4 × 2
xy
<dbl><chr>
3z
1a
2NA
NAb

drop_na is the tidyverse function we can use to do this.

In this case we pass the column for which we would like to use to drive the removal of rows (based on the presence of NAs in that row):

df %>% drop_na(x:y)
A tibble: 2 × 2
xy
<dbl><chr>
3z
1a

Removing all rows where there is a NA in any column#

If no arguments are given to drop_na, then all rows with NAs will be dropped:

df %>% drop_na()
A tibble: 2 × 2
xy
<dbl><chr>
3z
1a

Iterate over groups of rows#

summarise calculates summaries over rows#

Examples might be calculating the mean horsepower and mean miles per gallon of cars in the mtcars data set:

# calculate mean hp and mpg for all cars
mtcars %>% 
    summarise(mean_hp = mean(hp),
             mean_mpg = mean(mpg),
             sum_hp = sum(hp))
A data.frame: 1 × 3
mean_hpmean_mpgsum_hp
<dbl><dbl><dbl>
146.687520.090624694

Iteration with group_by + summarise#

  • Useful when you want to do something repeatedly to a group of rows

  • An example, say we want to calculate the average life expectancy (lifeExp) for each continent from the gapminder data set

# calculate the average life expectancy for each continent
gapminder %>% 
    group_by(continent) %>% 
    summarise(mean_life_exp = mean(lifeExp))
A tibble: 5 × 2
continentmean_life_exp
<fct><dbl>
Africa 48.86533
Americas64.65874
Asia 60.06490
Europe 71.90369
Oceania 74.32621

We can also group_by multiple columns, for example what if we want to know the mean life expectancy for each continent and each year:

# calculate the mean life expectancy for each continent and each year
gapminder %>% 
    group_by(continent, year) %>% 
    summarise(mean_life_exp = mean(lifeExp))
`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
A grouped_df: 60 × 3
continentyearmean_life_exp
<fct><int><dbl>
Africa195239.13550
Africa195741.26635
Africa196243.31944
Africa196745.33454
Africa197247.45094
Oceania198775.3200
Oceania199276.9450
Oceania199778.1900
Oceania200279.7400
Oceania200780.7195

Watch those NA’s!#

Calculate the maximum beak length for each of the Penguin species in the {palmerpenguins} penguins dataset.

library(palmerpenguins)
penguins
A tibble: 344 × 8
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexyear
<fct><fct><dbl><dbl><int><int><fct><int>
AdelieTorgersen39.118.71813750male 2007
AdelieTorgersen39.517.41863800female2007
AdelieTorgersen40.318.01953250female2007
AdelieTorgersen NA NA NA NANA 2007
AdelieTorgersen36.719.31933450female2007
ChinstrapDream55.819.82074000male 2009
ChinstrapDream43.518.12023400female2009
ChinstrapDream49.618.21933775male 2009
ChinstrapDream50.819.02104100male 2009
ChinstrapDream50.218.71983775female2009
penguins %>%
    group_by(species) %>%
    summarise(max_bill_length = max(bill_length_mm))
A tibble: 3 × 2
speciesmax_bill_length
<fct><dbl>
Adelie NA
Chinstrap58
Gentoo NA

Huh???

By default, summary statistics functions in R return NA if there are any NA observations in the data… This is often not ideal, what we’d like would be for R to just ignore the NA’s and kindly return us the statistic we asked for. We can specify this by setting the na.rm argument to TRUE.

penguins %>%
    group_by(species) %>%
    summarise(max_bill_length = max(bill_length_mm, na.rm = TRUE))
A tibble: 3 × 2
speciesmax_bill_length
<fct><dbl>
Adelie 46.0
Chinstrap58.0
Gentoo 59.6

Grouped mutate#

Sometimes you don’t want to collapse the n rows for each group into one row. You want to keep your groups, but compute within them.

Let’s make a new variable that is the years of life expectancy gained (lost) relative to 1952, for each individual country. We group by country and use mutate to make a new variable. The first function extracts the first value from a vector. Notice that first is operating on the vector of life expectancies within each country group.

# calculate life expectancy gained (or lost) relative to 1952
gapminder %>% 
    group_by(country) %>% 
    mutate(life_exp_gain = lifeExp - first(lifeExp)) %>% 
    head()
A grouped_df: 6 × 7
countrycontinentyearlifeExppopgdpPercaplife_exp_gain
<fct><fct><int><dbl><int><dbl><dbl>
AfghanistanAsia195228.801 8425333779.44530.000
AfghanistanAsia195730.332 9240934820.85301.531
AfghanistanAsia196231.99710267083853.10073.196
AfghanistanAsia196734.02011537966836.19715.219
AfghanistanAsia197236.08813079460739.98117.287
AfghanistanAsia197738.43814880372786.11349.637

Purrring instead of for loops#

https://purrr.tidyverse.org/

{purrr} map_* functions#

  • alternative to for loops that help you make less syntax errors

If you have programmed in R before#

purrr is an alternative to “apply” functions

purrr::map()base::lapply()

Iterating over columns of a data frame#

Say, for example we wanted to calculate the median for each column in the mtcars data frame:

head(mtcars)
A data.frame: 6 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Mazda RX421.061601103.902.62016.460144
Mazda RX4 Wag21.061601103.902.87517.020144
Datsun 71022.84108 933.852.32018.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.44017.020032
Valiant18.162251052.763.46020.221031
medians <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
    medians[i] <- median(mtcars[[i]], na.rm = TRUE)
}

OK, then next we want to calculate the mean for all of the columns:

means <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
    means[i] <- mean(mtcars[[i]], na.rm = TRUE)
}

OK, and then the variance…

variances <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
    variances[i] <- var(mtcars[[i]], na.rm = TRUE)
}

This is getting a little repetitive… What are we repeating?

Can we write this as a function?#

Given that functions are objects in R, this seems reasonable!

medians <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
    medians[i] <- median(mtcars[[i]], na.rm = TRUE)
}

This is essentially the guts of purrr::map_dbl. The only difference is that is coded in C and the use of ... for additional arguments.

mds_map <- function(x, fun)  {
    out <- vector("double", ncol(x))
    for (i in seq_along(x)) {
        out[i] <- fun(x[[i]], na.rm = TRUE)
    }
    out
}
mds_map(mtcars, min)
  1. 10.4
  2. 4
  3. 71.1
  4. 52
  5. 2.76
  6. 1.513
  7. 14.5
  8. 0
  9. 0
  10. 3
  11. 1

Functionals#

We have just written what is called a functional.

A functional is a function that takes a function (and other things) as an input and returns a vector as output.

R has several other functionals outside of purrr that you might have already encountered: lapply, apply, tapply, integrate or optim.

What can you do with functionals?#

  • Common use is as an alternative to for loops

  • For loops are actually quite effective for iteration, and efficient when used, however it is easy to make mistakes when setting them up as you have to:

    • pre-allocate space for the output

    • iterate over the thing the right amount of times

    • properly use the iteration index

Of course someone has to write for loops#

It doesn’t have to be you#

Jenny Bryan, Software Engineer at RStudio and UBC MDS Founder

The purrr::map* family of functions#

../../_images/map_family.png

Source: Advanced R by Hadley Wickham

Let’s start at the beginning with the most general purrr function: map#

map(.x, .f, ...)

Above reads as: for every element of .x apply .f

and can be pictured as:

https://d33wubrfki0l68.cloudfront.net/12f6af8404d9723dff9cc665028a35f07759299d/d0d9a/diagrams/functionals/map-list.png

Or picture as…

../../_images/minis_as_data.png

Source: Row-oriented workflows in R with the tidyverse by Jenny Bryan

../../_images/minis_map.png

Source: Row-oriented workflows in R with the tidyverse by Jenny Bryan

purrr::map test drive#

Let’s calculate the median of all the columns of the mtcars data frame using purrr::map:

library(purrr)
map(mtcars, median)
$mpg
19.2
$cyl
6
$disp
196.3
$hp
123
$drat
3.695
$wt
3.325
$qsec
17.71
$vs
0
$am
0
$gear
4
$carb
2

That looks different from our mds_map function! The output is of type list. We can use map_df to get a vector of type double:

map_dbl(mtcars, median)
mpg
19.2
cyl
6
disp
196.3
hp
123
drat
3.695
wt
3.325
qsec
17.71
vs
0
am
0
gear
4
carb
2

And map_df will give us a tibble:

map_df(mtcars, median)
A tibble: 1 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
19.26196.31233.6953.32517.710042

So now it’s super efficient and easy to get several summary stats across columns!

map_df(mtcars, median)
map_df(mtcars, mean)
map_df(mtcars, max)
map_df(mtcars, min)
A tibble: 1 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
19.26196.31233.6953.32517.710042
A tibble: 1 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
20.090626.1875230.7219146.68753.5965633.2172517.848750.43750.406253.68752.8125
A tibble: 1 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
33.984723354.935.42422.91158
A tibble: 1 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
10.4471.1522.761.51314.50031

What if our data frame had missing values?#

Let’s make some to see the consequences…

mtcars_NA <- mtcars
mtcars_NA[1, 1] <- NA

map_dbl(mtcars_NA, median)
mpg
<NA>
cyl
6
disp
196.3
hp
123
drat
3.695
wt
3.325
qsec
17.71
vs
0
am
0
gear
4
carb
2

How do we tell median to ignore NA’s? Using na.rm = TRUE! But how do we add this to our map_dbl call?

map_dbl(mtcars_NA, median, na.rm = TRUE)
mpg
19.2
cyl
6
disp
196.3
hp
123
drat
3.695
wt
3.325
qsec
17.71
vs
0
am
0
gear
4
carb
2

That’s all we will learn of the {purrr} functions for now - however we will meet them again (and learn more about them) next week when we start working with nested data frames.

What did we learn today?#

  • How to use case_when to selectively change values in a data frame (similar to base R if statements)

  • How to use group_by to iterate over groups of rows (similar to for loops in base R)

  • How to use {purrr} map_* functions to iterate over columns (similar to for loops in base R)

Attributions#