Contents

Lecture 5: Introduction to R

Note from Firas

Our series of R lectures will be presented by Dr. Tiffany Timbers, the other option co-director of the Vancouver MDS program.

Her lecture videos are not yet posted on YouTube, but I’ve got them locally so that’s how we’ll watch them.

High level goal for part II of this course:

Learn the R’isms of the R programming language.

First, a bit of history about me

  • Ph.D. in Neuroscience (2012)

  • Started using R in ~ 2010 because I needed to do “complex” statistics

  • Other programming languages I have used:

    • Turing

    • Java

    • Matlab

    • Python (only other language that I still currently use and remember)

Oh, and I like gifs, so you might see some in my lecture notes…

https://media.giphy.com/media/3PAL5bChWnak0WJ32x/giphy.gif

Now, a bit of history about R

  • An implementation of the S programming language (created at Bell labs in 1976)

  • written in C, Fortran, and R itself

  • R was created by Ross Ihaka and Robert Gentleman (Statisticians from NZ)

  • R is named partly after the authors and partly as a play on the name of S

  • First stable beta version in 2000

http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d2594d25970c-pi

Source: https://blog.revolutionanalytics.com/2016/03/16-years-of-r-history.html

R currently has more than 15,000 additional packages (as of September 2018)!

So, who’s used R before?

Let’s start with a vignette

library(tidyverse, quietly = TRUE)
us_2015_econ <- read_csv("data/state_property_data.csv")
head(us_2015_econ)
Copy to clipboard
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
Copy to clipboard
 ggplot2 3.3.2      purrr   0.3.4
 tibble  3.0.1      dplyr   1.0.0
 tidyr   1.1.0      stringr 1.4.0
 readr   1.3.1      forcats 0.5.0
Copy to clipboard
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()
Copy to clipboard
Error: 'data/state_property_data.csv' does not exist in current working directory ('/Users/firasm/Sync/Teaching/ubco/data531_course/class').
Traceback:

1. read_csv("data/state_property_data.csv")
2. read_delimited(file, tokenizer, col_names = col_names, col_types = col_types, 
 .     locale = locale, skip = skip, skip_empty_rows = skip_empty_rows, 
 .     comment = comment, n_max = n_max, guess_max = guess_max, 
 .     progress = progress)
3. standardise_path(file)
4. check_path(path)
5. stop("'", path, "' does not exist", if (!is_absolute_path(path)) paste0(" in current working directory ('", 
 .     getwd(), "')"), ".", call. = FALSE)
Copy to clipboard
options(repr.plot.height = 3, repr.plot.width = 4)
ggplot(us_2015_econ, aes(x = med_income, y = med_prop_val)) +
  geom_point() +
  xlab("Income (USD)") +
  ylab("Median property value (USD)")
Copy to clipboard
../_images/lecture5_16_0.png
us_2016_vote <- read_csv("data/2016_presidential_election_state_vote.csv")
head(us_2016_vote)
Copy to clipboard
Parsed with column specification:
cols(
  party = col_character(),
  state = col_character()
)
Copy to clipboard
A tibble: 6 × 2
partystate
<chr><chr>
republicanAL
republicanAK
republicanAZ
republicanAR
democrat CA
democrat CO
us_data <- left_join(us_2015_econ, us_2016_vote)
head(us_data)
Copy to clipboard
Joining, by = "state"
Copy to clipboard
A tibble: 6 × 6
statemed_incomemed_prop_valpopulationmean_commute_minutesparty
<chr><dbl><dbl><dbl><dbl><chr>
AK64222197300 73337510.46830republican
AL36924 94800 483062025.30991republican
AR35833 83300 295820822.40109republican
AZ44748128700 664192820.58786republican
CA530752521003842146423.38085democrat
CO48098198900 527890619.50792democrat
options(repr.plot.height = 3, repr.plot.width = 5)
ggplot(us_data, aes(x = med_income, y = med_prop_val, color = party)) +
  geom_point() +
  xlab("Income (USD)") +
  ylab("Median property value (USD)")
Copy to clipboard
../_images/lecture5_19_0.png
state_data <- filter(us_data, party != "Not Applicable")
ggplot(state_data, aes(x = med_income, y = med_prop_val, color = party)) +
    geom_point() +
    xlab("Income (USD)") +
    ylab("Median property value (USD)") +
    scale_colour_manual(values = c("blue", "red")) +
    scale_x_continuous(labels = scales::dollar_format()) +
    scale_y_continuous(labels = scales::dollar_format())
Copy to clipboard
../_images/lecture5_20_0.png

The whole game

  • What about this makes R an attractive language for data science?

  • What is different about this R code compared to other common languages?

  • How & why does R do these things?

Answering these questions is the aim of the second part of this course!

Lecture learning objectives:

By then end of the lecture & lab 3, students should be able to:

  • Use the assignment symbol, <-, to assign values to objects in R and explain how it differs from =

  • Create in R, and define and differentiate in English, the below listed key datatypes in R:

    • logical, numeric and character vectors

    • lists

    • data frames and tibbles

  • Use R to determine the type and structure of an object

  • Explain the distinction between names and values, and when R will copy an object

  • Use the three subsetting operators, [[, [, and $, to subset single and multiple elements from vectors and data frames

  • Compute numeric and boolean values using their respective types and operations

  • Write conditional statements in R with if, else if and else to run different code depending on the input

  • Write for loops in R to repeatedly run code

  • Write R code that is human readable and follows the tidyverse style guide

The assignment symbol, <-

  • R came from S, S used <-

  • S was inspired from APL, which also used <-

  • APL was designed on a specific keyboard, which had a key for <-

  • At that time there was no == for testing equality, it was tested with =, so something else need to be used for assignment.

https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/APL-keybd2.svg/410px-APL-keybd2.svg.png

source: https://colinfay.me/r-assignment/

The assigment symbol, <-

  • Nowadays, = can also be used for assignment, however there are some things to be aware of…

  • stylistically, <- is preferred over = for readability

  • <- and -> are valid in R, the latter can be useful in pipelines (more on this in data wrangling)

  • <- and = have different emphasis in regards to environments

  • we expect you to use <- in MDS for object assignment in R

Assignment readability

Consider this code:

c <- 12
d <- 13
Copy to clipboard

Which equality is easier to read?

e = c == d
Copy to clipboard

or

e <- c == d
Copy to clipboard

Assignment environment

What value does x hold at the end of each of these code chunks?

median(x = 1:10)

vs

median(x <- 1:10)

median(y = 1:10)
x
Copy to clipboard
Error in is.factor(x): argument "x" is missing, with no default
Traceback:

1. median(y = 1:10)
2. median.default(y = 1:10)
3. is.factor(x)
Copy to clipboard
median(x <- 1:10)
x
Copy to clipboard
5.5
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10

What does assignment do in R?

When you type this into R: x <- c(1, 2, 3)

This is what R does:

https://d33wubrfki0l68.cloudfront.net/bd90c87ac98708b1731c92900f2f53ec6a71edaf/ce375/diagrams/name-value/binding-1.png

Source: Advanced R by Hadley Wickham

What does assignment do in R?

And then if you type y <- c(1, 2, 3)

Then R does this:

https://d33wubrfki0l68.cloudfront.net/bdc72c04d3135f19fb3ab13731129eb84c9170af/f0ab9/diagrams/name-value/binding-2.png

We are binding names like “x” and/or “y” to objects, not creating objects named something like “x” or “y”.

Source: Advanced R by Hadley Wickham

A note on names

Rules for syntactic names:

  • May use: letters, digits, . and _

  • Cannot begin with _ or a digit

  • Cannot use reserved words (e.g., for, if, return)

How to manage non-syntactic names

  • Usually come across these when reading in someone else’s data

  • Backticks, `, can be used manage these cases (e.g., `_abc` <- 1)

  • If your data contains these, use R to rename things to make them syntactic (for your future sanity)

Key datatypes in R

../_images/r_datatypes.png

note - There are no scalars in R, they are represented by vectors of length 1.

Source: Advanced R by Hadley Wickham

  • NULL is not a vector, but related and frequently functions in the role of a generic zero length vector.

What is a data frame?

From a data perspective, it is a rectangle where the rows are the observations:

https://ubc-dsci.github.io/introduction-to-datascience/img/obs.jpeg

What is a data frame?

and the columns are the variables:

https://ubc-dsci.github.io/introduction-to-datascience/img/vars.jpeg

What is a data frame?

From a computer programming perspective, in R, a data frame is a special subtype of a list object whose elements (columns) are vectors.

https://ubc-dsci.github.io/introduction-to-datascience/img/vectors.jpeg

Question: What do you notice about the elements of each of the vectors in this data frame?

What is a vector?

  • objects that can contain 1 or more elements

  • elements are ordered

  • must all be of the same type (e.g., double, integer, character, logical)

https://ubc-dsci.github.io/introduction-to-datascience/img/vector.jpeg

How are vectors different from a list?

https://ubc-dsci.github.io/introduction-to-datascience/img/vec_vs_list.jpeg

Reminder: what do lists have to do with data frames?

https://ubc-dsci.github.io/introduction-to-datascience/img/dataframe.jpeg

A bit more about Vectors

Your closest and most important friend in R

https://media.giphy.com/media/EQCgmS4lwDS8g/giphy.gif

Creating vectors and vector types

char_vec <- c("joy", "peace", "help", "fun", "sharing")
char_vec
typeof(char_vec)
Copy to clipboard
  1. 'joy'
  2. 'peace'
  3. 'help'
  4. 'fun'
  5. 'sharing'
'character'
log_vec <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
log_vec
typeof(log_vec)
Copy to clipboard
  1. TRUE
  2. TRUE
  3. FALSE
  4. FALSE
  5. TRUE
'logical'
double_vec <- c(1, 2, 3, 4, 5)
double_vec
typeof(double_vec)
Copy to clipboard
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
'double'
int_vec <- c(1L, 2L, 3L, 4L, 5L)
int_vec
typeof(int_vec)
Copy to clipboard
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
'integer'

str is a useful command to get even more information about an object:

str(int_vec)
Copy to clipboard
 int [1:5] 1 2 3 4 5
Copy to clipboard

What happens to vectors of mixed type?

mixed_vec <- c("joy", 5.6, TRUE, 1L, "sharing")
typeof(mixed_vec)
Copy to clipboard
'character'

Hierarchy for coercion:

character → double → integer → logical

Useful functions for testing type and forcing coercion:

  • is.logical(), is.integer(), is.double(), and is.character() returns TRUE or FALSE, depending on type of object and function used.

  • as.logical(), as.integer(), as.double(), or as.character() coerce vector to type specified by function name.

How to subset and modify vectors

https://media.giphy.com/media/l4pTocra1lFDomV5S/giphy.gif

Subsetting

  • REMEMBER - R counts from 1!!!

name <- c("T", "i", "f", "f", "a", "n", "y")
Copy to clipboard

What letter will I get in R? What would I get in Python?

name[3]
Copy to clipboard
'f'

What letters will I get in R? What would I get in Python?

name[2:4]
Copy to clipboard
  1. 'i'
  2. 'f'
  3. 'f'

What letter will I get in R? What would I get in Python?

name[-1]
Copy to clipboard
  1. 'i'
  2. 'f'
  3. 'f'
  4. 'a'
  5. 'n'
  6. 'y'

How do I get the last element in a vector in R?

name[length(name)]
Copy to clipboard
'y'

Modifing vectors

We can combine the assignment symbol and subsetting to modify vectors:

name
Copy to clipboard
  1. 'T'
  2. 'i'
  3. 'f'
  4. 'f'
  5. 'a'
  6. 'n'
  7. 'y'
name[1] <- "t"
name
Copy to clipboard
  1. 't'
  2. 'i'
  3. 'f'
  4. 'f'
  5. 'a'
  6. 'n'
  7. 'y'
name[1:3] <- c("T", "I", "F")
name
name[8:12]
Copy to clipboard
  1. 'T'
  2. 'I'
  3. 'F'
  4. 'f'
  5. 'a'
  6. 'n'
  7. 'y'
  1. NA
  2. NA
  3. NA
  4. NA
  5. NA
name[8:12] <- c("-", "A", "n", "n", "e")
name
Copy to clipboard
  1. 'T'
  2. 'I'
  3. 'F'
  4. 'f'
  5. 'a'
  6. 'n'
  7. 'y'
  8. '-'
  9. 'A'
  10. 'n'
  11. 'n'
  12. 'e'

What happens when you modify a vector in R?

Consider:

x <- c(1, 2, 3)
y <- x

y[3] <- 4
x
#> [1] 1 2 3
Copy to clipboard

What is happening in R’s memory for each line of code?

Code

R’s memory representation

x <- c(1, 2, 3)

y <- x

y[[3]] <- 4

This is called “copy-on-modify”.

Source: Advanced R by Hadley Wickham

../_images/copy-on-modify.png

This is called “copy-on-modify”.

Source: Advanced R by Hadley Wickham

Why copy-on-modify

  • Since there are no scalars in R, vectors are essentially immutable

  • If you change one element of the vector, you have to copy the whole thing to update it

Why do we care about knowing this?

  • Given that data frames are built on-top of vectors, this has implications for speed when working with large data frames

Why vectors?

Vectorized operations!

c(1, 2, 3, 4) + c(1, 1, 1, 1)
Copy to clipboard
  1. 2
  2. 3
  3. 4
  4. 5

But watch out for vector recycling in R!

This makes sense:

c(1, 2, 3, 4) + c(1)
Copy to clipboard
  1. 2
  2. 3
  3. 4
  4. 5

but this does not!

c(1, 2, 3, 4) + c(1, 2)
Copy to clipboard
  1. 2
  2. 4
  3. 4
  4. 6

A list of vector operators here: R Operators cheat sheet

One to watch out for, logical and (&) and or (|) operators come in both an elementwise and first element comparison form, for example:

# compares each elements of each vector by position
c(TRUE, TRUE, TRUE) & c(FALSE, TRUE, TRUE)
Copy to clipboard
  1. FALSE
  2. TRUE
  3. TRUE
# compares only the first elements of each vector
c(TRUE, TRUE, TRUE) && c(FALSE, TRUE, TRUE)
Copy to clipboard
FALSE

Extending our knowledge to data frames

https://ubc-dsci.github.io/introduction-to-datascience/img/dataframe.jpeg

Getting to know a data frame

head(us_data)
Copy to clipboard
A tibble: 6 × 6
statemed_incomemed_prop_valpopulationmean_commute_minutesparty
<chr><dbl><dbl><dbl><dbl><chr>
AK64222197300 73337510.46830republican
AL36924 94800 483062025.30991republican
AR35833 83300 295820822.40109republican
AZ44748128700 664192820.58786republican
CA530752521003842146423.38085democrat
CO48098198900 527890619.50792democrat
str(us_data)
Copy to clipboard
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':	52 obs. of  6 variables:
 $ state               : chr  "AK" "AL" "AR" "AZ" ...
 $ med_income          : num  64222 36924 35833 44748 53075 ...
 $ med_prop_val        : num  197300 94800 83300 128700 252100 ...
 $ population          : num  733375 4830620 2958208 6641928 38421464 ...
 $ mean_commute_minutes: num  10.5 25.3 22.4 20.6 23.4 ...
 $ party               : chr  "republican" "republican" "republican" "republican" ...
Copy to clipboard

Subsetting and modifying data frames

There are 3 operators that can be used when subsetting data frames: [, $ and [[

../_images/subsetting.png

Note that $ and [[ remove a level of structure from the data frame (this happens with lists too).

Subsetting and modifying data frames

There are 3 operators that can be used when subsetting data frames: [, $ and [[

Operator

Example use

What it returns

[

us_data[1:10, 2:4]

rows 1-10 for columns 2-4 of the data frame, as a data frame

[

us_data[1:10, ]

rows 1-10 for all columns of the data frame, as a data frame

[

us_data[1]

the first column of the data frame, as a data frame

[[

us_data[[1]]

the first column of the data frame, as a vector

$

us_data$state

the column the corresponds to the name that follows the $, as a vector

Note that $ and [[ remove a level of structure from the data frame object (this happens with lists too).

Logical indexing of data frames

We can also use logical statements to filter for rows containing certain values, or values above or below a threshold:

us_data[us_data$party == "republican", ]
Copy to clipboard
A tibble: 32 × 6
statemed_incomemed_prop_valpopulationmean_commute_minutesparty
<chr><dbl><dbl><dbl><dbl><chr>
AK64222.0197300 73337510.46830republican
AL36924.0 94800 483062025.30991republican
AR35833.0 83300 295820822.40109republican
AZ44748.0128700 664192820.58786republican
FL43355.01256001964577224.78056republican
GA37865.01017001000669324.54914republican
IA49448.0102700 309352618.35024republican
ID43080.5143900 161654719.85348republican
IN47194.0111800 656864523.51750republican
KS46875.0 85200 289298716.69279republican
KY38827.5 94050 439735324.48608republican
LA40757.5 98250 462525325.91754republican
ME45594.0150850 132910021.98533republican
MI42161.0103200 990057121.92783republican
MO40597.0101200 604544822.39583republican
MS33748.5 78700 298808125.55173republican
MT44267.0135550 101469916.02368republican
NC40543.5134000 984533323.76222republican
ND54960.0101100 72164016.33698republican
NE49068.0 91200 186936516.43135republican
OH46931.01110501157597723.52100republican
OK43781.0 88500 384973321.11157republican
PA47313.01385001277955924.08609republican
NA NA NA NA NANA
SC38769.5100450 477757625.11677republican
SD48415.5 88000 84319015.06729republican
TN38576.0108600 649961525.52855republican
TX53207.01360002653861421.49733republican
UT50781.0173500 290337918.93005republican
WI49754.5145000 574211721.14367republican
WV39096.0 94600 185142027.10972republican
WY56569.0190000 57967918.67275republican

Another example:

us_data[us_data$mean_commute_minutes > 25, ]
Copy to clipboard
A tibble: 11 × 6
statemed_incomemed_prop_valpopulationmean_commute_minutesparty
<chr><dbl><dbl><dbl><dbl><chr>
AL36924.0 94800483062025.30991republican
DC70848.0475800 64748428.25340democrat
LA40757.5 98250462525325.91754republican
MD66745.5250950593053828.61998democrat
MS33748.5 78700298808125.55173republican
NJ70471.0299700890441328.80077democrat
PR16851.5106750358307328.02170NA
SC38769.5100450477757625.11677republican
TN38576.0108600649961525.52855republican
VA47911.0176000825663026.20629democrat
WV39096.0 94600185142027.10972republican

Modifing data frames

Similar to vectors, we can combine the assignment symbol and subsetting to modify data frames.

For example, here we create a new column called mean_commute_hours:

us_data$mean_commute_hours <- us_data$mean_commute_minutes / 50
head(us_data)
Copy to clipboard
A tibble: 6 × 7
statemed_incomemed_prop_valpopulationmean_commute_minutespartymean_commute_hours
<chr><dbl><dbl><dbl><dbl><chr><dbl>
AK64222197300 73337510.46830republican0.2093660
AL36924 94800 483062025.30991republican0.5061981
AR35833 83300 295820822.40109republican0.4480218
AZ44748128700 664192820.58786republican0.4117572
CA530752521003842146423.38085democrat 0.4676170
CO48098198900 527890619.50792democrat 0.3901584

The same syntax works to overwrite an existing column.

What happens when we modify an entire column? or a row?

To answer this we need to look at how data frames are represented in R’s memory.

How R represents data frames:

  • Remember that data frames are lists of vectors

  • As such, they don’t store the values themselves, they store references to them:

d1 <- data.frame(x = c(1, 5, 6), y = c(2, 4, 3))

https://d33wubrfki0l68.cloudfront.net/80d8995999aa240ff4bc91bb6aba2c7bf72afc24/95ee6/diagrams/name-value/dataframe.png

Source: Advanced R by Hadley Wickham

How R represents data frames:

If you modify a column, only that column needs to be modified; the others will still point to their original references:

d2[, 2] <- d2[, 2] * 2```

<img src="https://d33wubrfki0l68.cloudfront.net/c19fd7e31bf34ceff73d0fac6e3ea22b09429e4a/23d8d/diagrams/name-value/d-modify-c.png" width="250">

*Source: [Advanced R](https://adv-r.hadley.nz/) by Hadley Wickham*
Copy to clipboard

How R represents data frames:

However, if you modify a row, every column is modified, which means every column must be copied:

d3[1, ] <- d3[1, ] * 3```

<img src="https://d33wubrfki0l68.cloudfront.net/36df61f54d1ac62e066fb814cb7ba38ea6047a74/facf8/diagrams/name-value/d-modify-r.png" width="400">

*Source: [Advanced R](https://adv-r.hadley.nz/) by Hadley Wickham*
Copy to clipboard

An exception to copy-on-modify

If an object has a single name bound to it, R will modify it in place:

v <- c(1, 2, 3)

https://d33wubrfki0l68.cloudfront.net/496ac87edf04d7e235747c3cf4a4e66deca754f2/3ac04/diagrams/name-value/v-inplace-1.png

v[[3]] <- 4

https://d33wubrfki0l68.cloudfront.net/a6ef7ab337f156cdb2c21816923368383bc2e858/1f8bb/diagrams/name-value/v-inplace-2.png
  • Hence, modify in place can be a useful optimization for speeding up code.

  • However, there are some complications that make predicting exactly when R applies this optimisation challenging (see here for details)

  • There is one other time R will do this, we will cover this when we get to environments.

Source: Advanced R by Hadley Wickham

Control Flow: for loops

  • For loops in R, work like this: for (item in vector) perform_action

  • When code needs to be split across lines in R, we use the { operator to surround it

Challenge or, are you still awake ;p

Rearrange this code to make a working for loop in R. Then discuss with your neighbour when the index is updated.

## Answer

for (i in 1:3) {
    i <- i * 2
    print(i)
}
 
Copy to clipboard
[1] 2
[1] 4
[1] 6
Copy to clipboard

Pro-tip for for loops in R

Beware of this:

means <- c()
for (i in 1:length(means)) {
  print(i)
}
Copy to clipboard
[1] 1
[1] 0
Copy to clipboard

That might seem not too bad, but it is if you are trying to actually do something:

means <- c()
out <- vector("list", length(means))
for (i in 1:length(means)) {
  out[[i]] <- rnorm(10, means[[i]])
}
Copy to clipboard
Error in rnorm(10, means[[i]]): invalid arguments
Traceback:

1. rnorm(10, means[[i]])
Copy to clipboard

What went wrong here and how can we avoid it?

This occurs because : works with both increasing and decreasing sequences:

1:length(means)
Copy to clipboard
  1. 1
  2. 0

Source: Advanced R by Hadley Wickham

Pro-tip for for loops in R

Use seq_along(x) instead. It always returns a value the same length as x:

seq_along(means)
Copy to clipboard
means <- c()
for (i in seq_along(means)) {
  print(i)
}
Copy to clipboard
means <- c()
out <- vector("list", length(means))
for (i in seq_along(means)) {
  out[[i]] <- rnorm(10, means[[i]])
}
Copy to clipboard

Source: Advanced R by Hadley Wickham

Control Flow: if, if else and else statements

The basic form of an if statement in R is as follows:

if (condition) true_action

if (condition) true_action else false_action

  • Again, when code needs to be split across lines in R, we use the { operator to surround it to create code blocks

Challenge or, are you still awake ;p

Rearrange this code to make a working if, if else and else statement in R.

## Answer

threshold <- 95.0
measure  <- 93.5

if (measure > threshold) {
    print("Over the limit")
} else if (measure < threshold) {
    print("Under the limit")
} else {
    print("Exactly at threshold")
}
Copy to clipboard
[1] "Under the limit"
Copy to clipboard

Writing readable R code

  • WriTing AND reading (code) TaKes cognitive RESOURCES, & We only hAvE so MUCh!

  • To help free up cognitive capacity, we will follow the tidyverse style guide

Sample code not in tidyverse style

Can we spot what’s wrong?

Sample code in tidyverse style

What did we learn today?

Attribution: