Lecture 6: Functions and testing in R¶

Note from Firas

Our series of R lectures will be presented by Dr. Tiffany Timbers, the other option co-director of the Vancouver MDS program.

Her lecture videos are not yet posted on YouTube, but I’ve got them locally so that’s how we’ll watch them.

Correction from last class:¶

You can do manipulations across rows as well as columns in R with data frames without iteration:

head(cars)
A data.frame: 6 × 2
speeddist
<dbl><dbl>
14 2
2410
37 4
4722
5816
6910
cars[1:3, ] <- (cars[1:3, ] + 1)
head(cars)
A data.frame: 6 × 2
speeddist
<dbl><dbl>
15 3
2511
38 5
4722
5816
6910

The memory implications I stated on Tuesday however, still stand.

Another question from lab¶

What is a factor?

df <- data.frame(province = c("BC", "AB", "ON"), location = c("west", "west", "east"))
df
A data.frame: 3 × 2
provincelocation
<chr><chr>
BCwest
ABwest
ONeast
str(df)
'data.frame':	3 obs. of  2 variables:
 $ province: chr  "BC" "AB" "ON"
 $ location: chr  "west" "west" "east"

Factors are integer vectors with attributes¶

  • have two attributes:

    • class

    • levels

typeof(df$location)
'character'
attributes(df$location)
NULL
  • used to store categorical data, and very useful for data visualization and modeling

  • not at all helpful for data wrangling

Other vector types built on top of the atomic vectors¶

https://d33wubrfki0l68.cloudfront.net/baa19d0ebf9b97949a7ad259b29a1c4ae031c8e2/8e9b8/diagrams/vectors/summary-tree-s3-1.png

Writing readable R code¶

  • WriTing AND reading (code) TaKes cognitive RESOURCES, & We only hAvE so MUCh!

  • To help free up cognitive capacity, we will follow the tidyverse style guide

Sample code not in tidyverse style¶

Can we spot what’s wrong?

Sample code in tidyverse style¶

Lecture learning objectives:¶

By then end of the lecture & lab 3, students should be able to:

  • In R, define and use a named function that accepts parameters and returns values

  • Describe lazy evaluation and ... (variable arguments) and how it affects functions in R

  • explain the importance of scoping and environments in R as they relate to functions

  • Use testthat to formulate a test case to prove a function design specification

  • Use test-driven development principles to define a function that accepts parameters, returns values and passes all tests

  • Handle errors gracefully via exception handling

  • Use roxygen2 friendly function documentation to describe parameters, return values, description and example(s).

  • Write comments within a function to improve readability

  • Evaluate the readability, complexity and performance of a function

  • Source and use functions stored as R code in another file, as well as those in R packages/libraries

  • Describe what R packages/libraries are, as well as explain when and why they are useful

Defining functions in R¶

  • Use variable <- function(
arguments
) { 
body
 } to create a function and give it a name

Example:

add <- function(x, y) {
  x + y
}

add(5, 10)
15
  • As in Python, functions in R are objects. This is referred to as “first-class functions”.

  • The last line of the function returns a value, to return a value early use the special word return

add <- function(x, y) {
    if (!is.numeric(x) | !is.numeric(y)) {
        return(NA)
    }
    x + y
}

add(5, "a")
<NA>

Note: you probably want to throw an error here instead of return NA, this example was purely for illustrating early returns

Default function arguments¶

Same as in Python!

repeat_string <- function(x, n = 2) {
    repeated <- ""
    for (i in seq_along(1:n)) {
        repeated <- paste0(repeated, x)
    }
    repeated
}

repeat_string("MDS")
'MDSMDS'

Optional - Advanced¶

Extra arguments via ...¶

If we want our function to be able to take extra arguments that we don’t specify, we must explicitly convert ... to a list:

add <- function(x, y, ...) {
    total = x + y
    for (value in list(...)) {
        total <- total + value
    }
    total
    print(list(...))
}
add(1, 3, 5, 6)
[[1]]
[1] 5

[[2]]
[1] 6

Lexical scoping in R¶

R’s lexical scoping follows several rules, we will cover the following 3:

  • Name masking

  • Dynamic lookup

  • A fresh start

Name masking¶

  • Names defined inside a function mask names defined outside a function

  • If a name isn’t defined inside a function, R looks one level up (and then all the way up into the global environment and even loaded packages!)

Talk through the following code with your neighbour and predict the output, then let’s confirm the result by running the code.

x <- 1
g04 <- function() {
  y <- 2
  i <- function() {
    z <- 3
    c(x, y, z)
  }
  i()
}
g04()
  1. 1
  2. 2
  3. 3

Dynamic lookup¶

  • R looks for values when the function is run, not when the function is created.

  • This means that the output of a function can differ depending on the objects outside the function’s environment.

Talk through the following code with your neighbour and predict the output, then let’s confirm the result by running the code.

g12 <- function() x + 1
x <- 15
g12()

x <- 20
g12()
16
21

A fresh start¶

  • Every time a function is called a new environment is created to host its execution.

  • This means that a function has no way to tell what happened the last time it was run; each invocation is completely independent.

Talk through the following code with your neighbour and predict the output, then let’s confirm the result by running the code.

g11 <- function() {
  if (!exists("a")) {
    a <- 1
  } else {
    a <- a + 1
  }
  a
}

g11()
g11()
g11()
1
1
1

Lazy evaluation¶

In R, function arguments are lazily evaluated: they’re only evaluated if accessed.

Knowing that, now consider the add_one function written in both R and Python below:

# R code (this would work)
add_one <- function(x, y) {
    x <- x + 1
    return(x)
} 
# Python code (this would not work)
def add_one(x, y):
    x = x + 1
    return x

Answer the poll on Slack.

Poll:

From the list below, select the reason why the above add_one function will work in R, but the equivalent version of the function in python would break.

  1. Python evaluates the function arguments before it evaluates the function and because it doesn’t know what y is, it will break even though it is not used in the function.

  2. R performs lazy evaluation, meaning it delays the evaluation of the function arguments until its value is needed within/inside the function.

  3. The question is wrong, both functions would work in their respective languages.

  4. answer 1 & 2 are correct

The power of lazy evaluation¶

Let’s you have easy to use interactive code like this:

head(mtcars, n = 2)
A data.frame: 2 × 11
mpgcyldisphpdratwtqsecvsamgearcarb
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
Mazda RX42161601103.92.62016.460144
Mazda RX4 Wag2161601103.92.87517.020144
dplyr::select(mtcars, mpg, cyl, hp, qsec)
A data.frame: 32 × 4
mpgcylhpqsec
<dbl><dbl><dbl><dbl>
Mazda RX421.0611016.46
Mazda RX4 Wag21.0611017.02
Datsun 71022.84 9318.61
Hornet 4 Drive21.4611019.44
Hornet Sportabout18.7817517.02
Valiant18.1610520.22
Duster 36014.3824515.84
Merc 240D24.44 6220.00
Merc 23022.84 9522.90
Merc 28019.2612318.30
Merc 280C17.8612318.90
Merc 450SE16.4818017.40
Merc 450SL17.3818017.60
Merc 450SLC15.2818018.00
Cadillac Fleetwood10.4820517.98
Lincoln Continental10.4821517.82
Chrysler Imperial14.7823017.42
Fiat 12832.44 6619.47
Honda Civic30.44 5218.52
Toyota Corolla33.94 6519.90
Toyota Corona21.54 9720.01
Dodge Challenger15.5815016.87
AMC Javelin15.2815017.30
Camaro Z2813.3824515.41
Pontiac Firebird19.2817517.05
Fiat X1-927.34 6618.90
Porsche 914-226.04 9116.70
Lotus Europa30.4411316.90
Ford Pantera L15.8826414.50
Ferrari Dino19.7617515.50
Maserati Bora15.0833514.60
Volvo 142E21.4410918.60

Notes:

  • There’s more than just lazy evaluation happening in the code above, but lazy evaluation is part of it.

  • package::function() is a way to use a function from an R package without loading the entire library.

Function composition¶

You have 3 options in R:

  • assigning values to intermediate objects,

  • nested function calls, or

  • the binary operator %>%, which is called the pipe and is pronounced as “and then”.

For example, imagine you want to compute the population standard deviation using sqrt() and mean() as building blocks, and we create the two functions:

square <- function(x) {
    x^2
}
deviation <- function(x) {
    x - mean(x)
}
x <- runif(100)
x
  1. 0.25022658566013
  2. 0.234014874557033
  3. 0.144372418988496
  4. 0.785271859029308
  5. 0.699429926462471
  6. 0.771207594079897
  7. 0.314416424371302
  8. 0.870611026417464
  9. 0.207680284744129
  10. 0.168246673652902
  11. 0.69520825939253
  12. 0.0292122275568545
  13. 0.537068107165396
  14. 0.521737476810813
  15. 0.895236987853423
  16. 0.0903797801584005
  17. 0.335109637351707
  18. 0.607225279323757
  19. 0.227828048169613
  20. 0.628951021237299
  21. 0.942104737507179
  22. 0.699409523978829
  23. 0.467985502909869
  24. 0.160552969668061
  25. 0.241791844600812
  26. 0.238411884522066
  27. 0.76329149492085
  28. 0.802768731722608
  29. 0.754189879400656
  30. 0.949982912512496
  31. 0.255262546474114
  32. 0.0539967038203031
  33. 0.549355506896973
  34. 0.234619335038587
  35. 0.612806283170357
  36. 0.111309222411364
  37. 0.765110291074961
  38. 0.321251665242016
  39. 0.200868639163673
  40. 0.212012768723071
  41. 0.517911856528372
  42. 0.291292979381979
  43. 0.306515076430514
  44. 0.278687868732959
  45. 0.630747026298195
  46. 0.783034417545423
  47. 0.00465247919782996
  48. 0.160219222074375
  49. 0.706471200101078
  50. 0.675747690955177
  51. 0.768958376022056
  52. 0.079433829523623
  53. 0.708048960659653
  54. 0.571499577956274
  55. 0.588008775375783
  56. 0.801330324960873
  57. 0.50124762672931
  58. 0.436468082480133
  59. 0.59256277536042
  60. 0.402759515214711
  61. 0.0106187257915735
  62. 0.731107883388177
  63. 0.619682423537597
  64. 0.0847816201858222
  65. 0.0228138982784003
  66. 0.589616667712107
  67. 0.564619412645698
  68. 0.77781374193728
  69. 0.397121073212475
  70. 0.403469764161855
  71. 0.354331506416202
  72. 0.726002972107381
  73. 0.808434300823137
  74. 0.0770575746428221
  75. 0.68574546626769
  76. 0.470633488846943
  77. 0.721869681961834
  78. 0.105715058045462
  79. 0.524519525235519
  80. 0.629048575414345
  81. 0.837673249188811
  82. 0.281168871093541
  83. 0.833539023064077
  84. 0.0159828360192478
  85. 0.582876326749101
  86. 0.703502891585231
  87. 0.122999010141939
  88. 0.878704373957589
  89. 0.347580010537058
  90. 0.933304481673986
  91. 0.10655885306187
  92. 0.0534643181599677
  93. 0.743015992222354
  94. 0.0733375798445195
  95. 0.848653605673462
  96. 0.209068194963038
  97. 0.777745527680963
  98. 0.785296281101182
  99. 0.688700990285724
  100. 0.39948672358878

Option 1: assigning values to intermediate objects

out <- deviation(x)
out <- square(out)
out <- mean(out)
out <- sqrt(out)
out
0.277957314206755

Option 2: nested function calls

sqrt(mean(square(deviation(x))))
0.277957314206755

Option 3: the binary operator %>%, which is called the pipe and is pronounced as “and then”.

library(magrittr, quietly = TRUE) # also loaded as a dependency of dplyr and tidyverse

x %>%
  deviation() %>%
  square() %>%
  mean() %>%
  sqrt()
Attaching package: ‘magrittr’
The following object is masked _by_ ‘.GlobalEnv’:

    add
0.277957314206755

What to choose?¶

Each of the three options has its own strengths and weaknesses:

Intermediate objects:

  • requires you to name intermediate objects. This is a strength when objects are important, but a weakness when values are truly intermediate.

Nesting:

  • is concise, and well suited for short sequences.

  • But longer sequences are hard to read because they are read inside out and right to left.

Piping:

  • allows you to read code in straightforward left-to-right fashion and doesn’t require you to name intermediate objects.

  • But you can only use it with linear sequences of transformations of a single object.

  • It also requires an additional third party package and assumes that the reader understands piping.

5 min break¶

Writing tests in R with test_that¶

  • Industry standard tool for writing tests in R is the testthat package.

  • To use an R package, we typically load the package into R using the library function:

library(testthat)
Attaching package: ‘testthat’
The following objects are masked from ‘package:magrittr’:

    equals, is_less_than, not

How to write a test with testthat::test_that¶

test_that("Message to print if test fails", expect_*(...))

Often our test_that function calls are longer than 80 characters, so we use { to split the code across multiple lines, for example:

x <- c(3.5, 3.5, 3.5)
y <- c(3.5, 3.5, 3.49999)
test_that("x and y should contain the same values", {
    expect_equal(x, y)
})
Error: Test failed: 'x and y should contain the same values'
* <text>:4: `x` not equal to `y`.
1/3 mismatches
[3] 3.5 - 3.5 == 1e-05
Traceback:

1. test_that("x and y should contain the same values", {
 .     expect_equal(x, y)
 . })
2. test_code(desc, code, env = parent.frame())
3. get_reporter()$end_test(context = get_reporter()$.context, test = test)
4. stop(message, call. = FALSE)

Are you starting to see a pattern with { yet


Common expect_* statements for use with test_that¶

Is the object equal to a value?¶

  • expect_identical - test two objects for being exactly equal

  • expect_equal - compare R objects x and y testing ‘near equality’ (can set a tolerance)

  • expect_equivalent - compare R objects x and y testing ‘near equality’ (can set a tolerance) and does not assess attributes

Does code produce an output/message/warning/error?¶

  • expect_error - tests if an expression throws an error

  • expect_warning - tests whether an expression outputs a warning

  • expect_output - tests that print output matches a specified value

Is the object true/false?¶

These are fall-back expectations that you can use when none of the other more specific expectations apply. The disadvantage is that you may get a less informative error message.

  • expect_true - tests if the object returns TRUE

  • expect_false - tests if the object returns FALSE

Challenge 1:¶

Add a tolerance arguement to the expect_equal statement such that the observed difference between these very similar vectors doesn’t cause the test to fail.

x <- c(3.5, 3.5, 3.5)
y <- c(3.5, 3.5, 3.49999)
test_that("x and y should contain the same values", {
    expect_equal(x, y)
})

Unit test example¶

celsius_to_fahr <- function(temp) {
  (temp * (9 / 5)) + 32
}
test_that("Temperature should be the same in Celcius and Fahrenheit at -40", {
        expect_identical(celsius_to_fahr(-40), -40)
    })
test_that("Room temperature should be about 23 degrees in Celcius and 73 degrees Fahrenheit", {
        expect_equal(celsius_to_fahr(23), 73, tolerance = 1)
    })

Test-driven development (TDD) review¶

  1. Write your tests first (that call the function you haven’t yet written), based on edge cases you expect or can calculate by hand

  2. If necessary, create some “helper” data to test your function with (this might be done in conjunction with step 1)

  3. Write your function to make the tests pass (in this process you might think of more tests that you want to add)

Toy example of how TDD can be helpful¶

Let’s create a function called fahr_to_celsius that converts temperatures from Fahrenheit to Celsius.

First we’ll write the tests (which will fail):

test_fahr_to_celsius <- function() {
    test_that("Temperature should be the same in Celcius and Fahrenheit at -40", {
        expect_identical(fahr_to_celsius(-40), -40)
    })
    test_that("Room temperature should be about 73 degrees Fahrenheit and 23 degrees in Celcius", {
        expect_equal(fahr_to_celsius(73), 23, tolerance = 1)
    })
}

Then we write our function to pass the tests:

fahr_to_celsius <- function(temp) {
    (temp + 32) * 5/9
}

Then we call our tests to check it:

test_fahr_to_celsius()

Exception handling in R¶

How to check type and throw an error if not the expected type:

if (!is.numeric(c(1, 2, "c")))
  stop("Cannot compute of a vector of characters.")

Example of defensive programming at the beginning of a function:

fahr_to_celsius <- function(temp) {
    if(!is.numeric(temp)){
        stop("Cannot calculate temperature in Farenheit for non-numerical values")
    }
    (temp - 32) * 5/9
}
fahr_to_celsius("thirty")

If you wanted to issue a warning instead of an error, you could use warning in place of stop in the example above. However, in most cases it is better practice to throw an error than to print a warning


We can test our exceptions using test_that:¶

test_that("Non-numeric values for temp should throw an error", {
    expect_error(fahr_to_celsius("thirty"))
    expect_error(fahr_to_celsius(list(4)))
    })

try in R¶

Similar to Python, R has a try function to attempt to run code, and continue running subsequent code even if code in the try block does not work:

try({
    # some code
    # that can be 
    # split across several
    # lines
})

# code to continue even if error in code 
# in try code block above

This code normally results in an error that stops following code from running:

x <- data.frame(col1 = c(1, 2, 3, 2, 1), 
                col2 = c(0, 1, 0, 0 , 1))
x[3]
dim(x)

Try let’s the code following the error run:

try({x <- data.frame(col1 = c(1, 2, 3, 2, 1), 
                     col2 = c(0, 1, 0, 0 , 1))
     x[3]
})
dim(x)

Sensibly (IMHO) try has a default of silent=FALSE, which you can change if you find good reason too.

roxygen2 friendly function documentation¶

#' Converts temperatures from Fahrenheit to Celsius.
#'    
#' @param temp a vector of temperatures in Fahrenheit
#' 
#' @return a vector of temperatures in Celsius
#' 
#' @examples
#' fahr_to_celcius(-20)
fahr_to_celsius <- function(temp) {
    (temp - 32) * 5/9
}

Why roxygen2 documentation? If you document your functions like this, when you create an R package to share them they will be set up to have the fancy documentation that we get using ?function_name.

RStudio has template for roxygen2 documentation¶

../_images/insert_roxygen.png

Reading in functions from an R script¶

Usually the step before packaging your code, is having some functions in another script that you want to read into your analysis. We use the source function to do this:

source("src/kelvin_to_celsius.R")

Once you do this, you have access to all functions contained within that script:

kelvin_to_celsius(273.15)

Note - this is how the test_* functions are brought into your Jupyter notebooks for the autograding part of your lab3 homework.

Introduction to R packages¶

  • source("script_with_functions.R") is useful, but when you start using these functions in different projects you need to keep copying the script, or having overly specific paths


  • The next step is packaging your R code so that it can be installed and then used across multiple projects on your (and others) machines without directly pointing to where the code is stored, but instead accessed using the library function.

  • You will learn how to do this in Collaborative Software Development (term 2), but for now, let’s tour a simple R package to get a better understanding of what they are: https://github.com/ttimbers/convertemp

Install the convertemp R package:¶

In RStudio, type: devtools::install_github("ttimbers/convertemp")

library(convertemp)
?celsius_to_kelvin
celsius_to_kelvin(0)

What did we learn today?¶

Attribution:¶