Lecture 7: Functional-style programming in R#

Note from Firas

Our series of R lectures will be presented by Dr. Tiffany Timbers, the other option co-director of the Vancouver MDS program.

First, some things leftover from last week…#

Reading in functions from an R script#

Usually the step before packaging your code, is having some functions in another script that you want to read into your analysis. We use the source function to do this:

source("src/kelvin_to_celsius.R")

Warning message in file(filename, "r", encoding = encoding):
“cannot open file 'src/kelvin_to_celsius.R': No such file or directory”

Error in file(filename, "r", encoding = encoding): cannot open the connection
Traceback:

1. source("src/kelvin_to_celsius.R")
2. file(filename, "r", encoding = encoding)

Once you do this, you have access to all functions contained within that script:

kelvin_to_celsius(273.15)

Note - this is how the test_* functions are brought into your Jupyter notebooks for the autograding part of your lab3 homework.

Introduction to R packages#

source("script_with_functions.R") is useful, but when you start using these functions in different projects you need to keep copying the script, or having overly specific paths…

The next step is packaging your R code so that it can be installed and then used across multiple projects on your (and others) machines without directly pointing to where the code is stored, but instead accessed using the library function.

You will learn how to do this in Collaborative Software Development (term 2), but for now, let’s tour a simple R package to get a better understanding of what they are: https://github.com/ttimbers/convertemp

Install the convertemp R package:#

In RStudio, type: devtools::install_github("ttimbers/convertemp")

library(convertemp)

?celsius_to_kelvin

celsius_to_kelvin(0)

Packages and environments#

Each package attached by library() becomes one of the parents of the global environment
The immediate parent of the global environment is the last package you attached, the parent of that package is the second to last package you attached, …

https://d33wubrfki0l68.cloudfront.net/038b2da4f5db1d2a8acaf4ee1e7d08d04ab36ebc/ac22a/diagrams/environments/search-path.png

Source: Advanced R by Hadley Wickham

Packages and environments#

When you attach another package with library(), the parent environment of the global environment changes:

https://d33wubrfki0l68.cloudfront.net/7c87a5711e92f0269cead3e59fc1e1e45f3667e9/0290f/diagrams/environments/search-path-2.png

Source: Advanced R by Hadley Wickham

Functional style programming in R with `purrr`#

https://purrr.tidyverse.org/

If you have programmed in R before#

purrr is an alternative to “apply” functions

purrr::map() ≈ base::lapply()

How do we apply a function to all columns of a data frame?#

Say, for example we wanted to calculate the median for each column in the mtcars data frame:

head(mtcars)

medians <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
    medians[i] <- median(mtcars[[i]], na.rm = TRUE)
}

OK, then next we want to calculate the mean for all of the columns:

means <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
    means[i] <- mean(mtcars[[i]], na.rm = TRUE)
}

OK, and then the variance…

variances <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
    variances[i] <- var(mtcars[[i]], na.rm = TRUE)
}

This is getting a little repetitive… What are we repeating?

Can we write this as a function?#

Given that functions are objects in R, this seems reasonable!

medians <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
    medians[i] <- median(mtcars[[i]], na.rm = TRUE)
}

This is essentially the guts of purrr::map_dbl. The only difference is that is coded in C and the use of ... for additional arguments.

mds_map <- function(x, fun)  {
    out <- vector("double", ncol(x))
    for (i in seq_along(x)) {
        out[i] <- fun(x[[i]], na.rm = TRUE)
    }
    out
}
mds_map(mtcars, min)

Functionals#

We have just written what is called a functional.

A functional is a function that takes a function (and other things) as an input and returns a vector as output.

R has several other functionals outside of purrr that you might have already encountered: lapply, apply, tapply, integrate or optim.

What can you do with functionals?#

Common use is as an alternative to for loops

For loops are actually quite effective for iteration, and efficient when used, however it is easy to make mistakes when setting them up as you have to:
- pre-allocate space for the output
- iterate over the thing the right amount of times
- properly use the iteration index

Of course someone has to write for loops#

It doesn’t have to be you#

– Jenny Bryan, Software Developer at RStudio and MDS Founder

The `purrr::map*` family of functions#

Source: Advanced R by Hadley Wickham

Let’s start at the beginning with the most general `purrr` function: `map`#

map(.x, .f, ...)

Above reads as: for every element of .x apply .f

and can be pictured as:

https://d33wubrfki0l68.cloudfront.net/12f6af8404d9723dff9cc665028a35f07759299d/d0d9a/diagrams/functionals/map-list.png

Or picture as…

Source: Row-oriented workflows in R with the tidyverse by Jenny Bryan

Source: Row-oriented workflows in R with the tidyverse by Jenny Bryan

`purrr::map` test drive#

Let’s calculate the median of all the columns of the mtcars data frame using purrr::map:

library(purrr)
map(mtcars, median)

That looks different from our mds_map function! The output is of type list.

Choosing the `purrr::map*` function based on your desired output#

Source: Advanced R by Hadley Wickham

Trying again with `purrr::map_dbl`#

map_dbl(mtcars, median)

What if our data frame had missing values?#

Let’s make some to see the consequences…

mtcars_NA <- mtcars
mtcars_NA[1, 1] <- NA

map_dbl(mtcars_NA, median)

map_dbl returns a vector of type double.

How do we tell median to ignore NA’s? Using na.rm = TRUE! But how do we add this to our map_dbl call?

Solution!#

Creating an anonymous function within the purrr::map_dbl function!

map_dbl(mtcars_NA, function(df) median(df, na.rm  = TRUE))

(function(x) x + 1)(1)

Above the function takes in x as an argument and adds one to it. The function definition is surrounded by round brackets, as is the value being passed to the anonymous function.

Aside: Anonymous functions in R#

General format: function(x) body_of_function

To use one in the global environment, outside of another function call, you do the following:

Back to anonymous function calls within `purrr::map*`#

Long form:

map_dbl(mtcars_NA, function(df) median(df, na.rm  = TRUE))

Short form:

map_dbl(mtcars_NA, ~ median(., na.rm  = TRUE))

In the shortcut we replace function(VARIABLE) with a ~ and replace the VARIABLE in the function call with a .

Challenge 1:#

Use a purrr::map function to caclulate the variance (using var) of each of the numerical columns in the iris dataset. Return the object as a data frame.

Mapping with > 1 data objects#

What if the function you want to map takes in > 1 data objects?

map2* and pmap* are your friends here!

`purrr::map2*`#

map2*(.x, .y, .f, ...)

Above reads as: for every element of .x and .y apply .f

Or picture as…

Source: purrr workshop by Jenny Bryan

Source: purrr workshop by Jenny Bryan

`purrr::map2_df` example:#

For example, say you want to calculate a weighted means (using weighted.mean) for columns of a data frame where you had another data frame containing those weights.

Let’s make some data:

data <- tibble(frequency = runif(10),
             loudness = runif(10),
               power = runif(10),
              rating = rpois(10, 5) + 1,
                year = rpois(10, 5) + 1999)
data[1, 1] <- NA
data

library(dplyr, quietly = TRUE)
data <- tibble(x1 = runif(10),
               x2 = runif(10),
               x3 = runif(10))
data[1, 1] <- NA
weights <- tibble(x1 = rpois(10, 5) + 1,
                 x2 = rpois(10, 5) + 1,
                 x3 = rpois(10, 5) + 1,)

data
weights

`purrr::map2_df` example:#

Let’s use map2_df to calculate the weighted mean using these two data frames.

?weighted.mean

map2_df(data, weights, weighted.mean)

Ah! That NA got us again! We need to write this an an anonymous function so that we can pass in na.rm = TRUE

`purrr::map2_df` example:#

Now using an anonymous function with the long form:

map2_df(data, weights, function(x, y) weighted.mean(x, y, na.rm = TRUE))

Now with the short form:

map2_df(data, weights, ~ weighted.mean(.x, .y, na.rm = TRUE))

Not too bad eh!

`purrr::map2*`#

Also, if y has less elements than x, it recycles y:

https://d33wubrfki0l68.cloudfront.net/55032525ec77409e381dcd200a47e1787e65b964/dcaef/diagrams/functionals/map2-recycle.png

This is most useful when y has only one element.

`purrr::pmap*`#

pmap*(list(.x1, .x2, ... .xn), .f, ...)

Above reads as: for every element of in the list (that contains .x1, .x2, ... .xn) apply .f

Example of using `pmap_df` to calculate the weighted means:#

pmap_df(list(data, weights), ~ weighted.mean(.x, .y, na.rm = TRUE))

But what happens when you have > 2 arguments?

More than two arguments#

Without an anonymous function, works as so:

f1 <- function(x, y, z) {
    x + y + z
}

pmap_dbl(list(c(1, 1), c(1, 2), c(2, 2)), f1)

If you want to use an anonymous function, then use ..1, ..2, ..3, and so on to specify where the mapped objets go in your function:

f2 <- function(x, y, z, a = 0) {
    x + y + z + a
}

pmap_dbl(list(c(1, 1), c(1, 2), c(2, 2)), ~ f2(..1, ..2, ..3, a = -1))

We only used two inputs to our function here, but we can use any number with pmap, we just need to add them to our list!

Want to iterate row-wise, instead of column-wise?#

Here you can use purrr::pmap on a single data frame!

This: purrr::pmap(df, .f)

reads as: for every tuple in .l (i.e., each row of df) apply .f

The key point is that pmap() iterates over tuples = the collection of i-th elements of k lists. A data frame row is an interesting special case.

Here’s an example of row-wise iteration#

Here we calculate the sum for each row in the mtcars data frame:

pmap(mtcars, sum)

What about mapping over groups of rows???#

There are two strategies we will learn in the Data Wrangling course next block:

dplyr::group_by + dplyr::summarize
dplyr::group_by + tidyr::nest

What did we learn today?#

Attribution#

Advanced R by Hadley Wickham
Jenny Bryan’s purrr tutorial

Data 531

Lecture 7: Functional-style programming in R

Contents

Lecture 7: Functional-style programming in R#

First, some things leftover from last week…#

Reading in functions from an R script#

Introduction to R packages#

Install the convertemp R package:#

Packages and environments#

Packages and environments#

Functional style programming in R with purrr#

If you have programmed in R before#

How do we apply a function to all columns of a data frame?#

Can we write this as a function?#

Functionals#

What can you do with functionals?#

Of course someone has to write for loops#

It doesn’t have to be you#

The purrr::map* family of functions#

Let’s start at the beginning with the most general purrr function: map#

purrr::map test drive#

Choosing the purrr::map* function based on your desired output#

Trying again with purrr::map_dbl#

What if our data frame had missing values?#

Solution!#

Aside: Anonymous functions in R#

Back to anonymous function calls within purrr::map*#

Challenge 1:#

Mapping with > 1 data objects#

purrr::map2*#

purrr::map2_df example:#

purrr::map2_df example:#

purrr::map2_df example:#

purrr::map2*#

purrr::pmap*#

Example of using pmap_df to calculate the weighted means:#

More than two arguments#

Want to iterate row-wise, instead of column-wise?#

Here’s an example of row-wise iteration#

What about mapping over groups of rows???#

What did we learn today?#

Attribution#

Functional style programming in R with `purrr`#

The `purrr::map*` family of functions#

Let’s start at the beginning with the most general `purrr` function: `map`#

`purrr::map` test drive#

Choosing the `purrr::map*` function based on your desired output#

Trying again with `purrr::map_dbl`#

Back to anonymous function calls within `purrr::map*`#

`purrr::map2*`#

`purrr::map2_df` example:#

`purrr::map2_df` example:#

`purrr::map2_df` example:#

`purrr::map2*`#

`purrr::pmap*`#

Example of using `pmap_df` to calculate the weighted means:#