Lecture 1 - Introduction to R via the tidyverse#

(or Reading data, single data frame manipulations & tidying data in R)

Learning objectives#

By the end of this lecture and worksheet 1, students should be able to:

  • Choose and use the appropriate readr::read_* function and function arguments to load a given rectangular, plain text data set into R

  • Use the assignment symbol, <-, to assign values to objects in R

  • Write a dataframe to a .csv file using readr::write_csv

  • Use readr::read_csv to bring data from standard comma separated value (.csv) files into R

  • Recall and use the following dplyr functions and operators for their intended data wrangling tasks:

    • select

    • filter

    • mutate

    • arrange

    • desc

    • slice

    • pull

    • %in%

  • Use the pipe operator, |>, to combine two or more functions

  • Define the term โ€œtidy dataโ€

  • Discuss the advantages and disadvantages of the tidy data format

  • Use tidyr::pivot_wider & tidyr::pivot_longer in R to make untidy data tidy

First, a bit of history about me#

  • Ph.D. in Neuroscience (2012)

  • Started using R in ~ 2010 because I needed to do โ€œcomplexโ€ statistics

  • Other programming languages I have used:

    • Turing

    • Java

    • Matlab

    • Python (only other language that I still currently use and remember)

Oh, and I like gifs, so you might see some in my lecture notesโ€ฆ

https://media.giphy.com/media/3PAL5bChWnak0WJ32x/giphy.gif

Now, a bit of history about R#

  • An implementation of the S programming language (created at Bell labs in 1976)

  • written in C, Fortran, and R itself

  • R was created by Ross Ihaka and Robert Gentleman (Statisticians from NZ)

  • R is named partly after the authors and partly as a play on the name of S

  • First stable beta version in 2000

http://revolution-computing.typepad.com/.a/6a010534b1db25970b01b8d2594d25970c-pi

Source: https://blog.revolutionanalytics.com/2016/03/16-years-of-r-history.html

R currently has more than 15,000 additional packages (as of September 2018)!

So, whoโ€™s used R before?#

Letโ€™s take a poll!

Loading/importing data#

Taking our first step in data analysis:

Source: Grolemund & Wickham, R for Data Science

The four most common ways to do this in Data Science#

  1. read in a text file with data in a spreadsheet format

  2. read from a database (e.g., SQLite, PostgreSQL)

  3. scrape data from the web

  4. use a web API to read data from a website

Reading spreadsheet-like data into R#

  • We recommend using the readr package (part of the tidyverse) functions for plain text files

  • We recommed using the readxl package (part of the tidyverse but needs to be explicitly loaded in addition to tidyverse) for Microsoft Excel files.

Workflow for reading in spreadsheet-like data#

Step 1: LOOK AT THE FILE! ๐Ÿ‘€

Step 2: Match what you see with an appropriate {readr} or {readxl} package function

Step 3: Choose the correct arguments for that file and that functin

Step 1: LOOK AT THE RESULT TO MAKE SURE IT WORKED! ๐Ÿ‘€

The simplest case: a comma separated value file#

  • The simplest plain text data file you will encounter is a a comma separated value (.csv) file.

  • read_csv, from the {readr} package, is the function of choice here.

  • In its most basic use-case, read_csv expects that the data file:

    • has column names (or headers),

    • uses a comma (,) to separate the columns, and

    • does not have row names.

Reading in data/can_lang.csv data using read_csv:

First, load the {readr} or {tidyverse} library:

Note: {readr} is part of the {tidyverse} metapackage, so when you load {tidyverse} you get the {readr} package functions, as well as a bunch of other goodies weโ€™ll learn about shortly!

library(tidyverse)
Error in library(tidyverse): there is no package called โ€˜tidyverseโ€™
Traceback:

1. library(tidyverse)

Next, use the read_csv function to read the data into R from the can_lang.csv file:

can_lang1 <- read_csv("data/can_lang.csv")
head(can_lang1)
Rows: 214 Columns: 6

โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Delimiter: ","
chr (2): category, language
dbl (4): mother_tongue, most_at_home, most_at_work, lang_known


โ„น Use `spec()` to retrieve the full column specification for this data.
โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 ร— 6
categorylanguagemother_tonguemost_at_homemost_at_worklang_known
<chr><chr><dbl><dbl><dbl><dbl>
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languagesAfrikaans 10260 4785 8523415
Non-Official & Non-Aboriginal languagesAfro-Asiatic languages, n.i.e. 1150 445 10 2775
Non-Official & Non-Aboriginal languagesAkan (Twi) 13460 5985 2522150
Non-Official & Non-Aboriginal languagesAlbanian 268951313534531930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120

Skipping rows when reading in data#

  • Often times information about how data was collected, or other relevant information, is included at the top of the data file.

  • Referred to as โ€œmetadataโ€

Reading in data/can_lang-meta-data.csv

can_lang2 <- read_csv("data/can_lang-meta-data.csv", skip = 2)
head(can_lang2)
Rows: 214 Columns: 6

โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Delimiter: ","
chr (2): category, language
dbl (4): mother_tongue, most_at_home, most_at_work, lang_known


โ„น Use `spec()` to retrieve the full column specification for this data.
โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 ร— 6
categorylanguagemother_tonguemost_at_homemost_at_worklang_known
<chr><chr><dbl><dbl><dbl><dbl>
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languagesAfrikaans 10260 4785 8523415
Non-Official & Non-Aboriginal languagesAfro-Asiatic languages, n.i.e. 1150 445 10 2775
Non-Official & Non-Aboriginal languagesAkan (Twi) 13460 5985 2522150
Non-Official & Non-Aboriginal languagesAlbanian 268951313534531930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120

If you need to skip rows at the bottom of a file, use the n_max argument.

Note - you will need to know how many lines are in the file to succesfully use this. You can use some shell/command line tools to do this.

  • wc -l FILENAME will tell you how many lines are in the file

  • tail -n 10 FILENAME will print the last 10 lines of the file

read_delim as a more flexible method to get data into R#

  • most flexible readr function is read_delim

  • doesnโ€™t assume any delimiter, you have to specify

Reading in data/can_lang.tsv

can_lang3 <- read_delim("data/can_lang.tsv",  
                     delim = "\t", 
                     col_names = FALSE)
head(can_lang3)
Rows: 214 Columns: 6

โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Delimiter: "\t"
chr (2): X1, X2
dbl (4): X3, X4, X5, X6


โ„น Use `spec()` to retrieve the full column specification for this data.
โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 ร— 6
X1X2X3X4X5X6
<chr><chr><dbl><dbl><dbl><dbl>
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languagesAfrikaans 10260 4785 8523415
Non-Official & Non-Aboriginal languagesAfro-Asiatic languages, n.i.e. 1150 445 10 2775
Non-Official & Non-Aboriginal languagesAkan (Twi) 13460 5985 2522150
Non-Official & Non-Aboriginal languagesAlbanian 268951313534531930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120

Note: you can see that above we also used another argument: col_names = FALSE. This is because this version of the data set had no column names, and if we have such a file and donโ€™t specify that, then the first observation will be taken (inocrrectly) to be the column names. Another useful extension of this arguement is using it to assign column names, see the example below:

can_lang3 <- read_delim("data/can_lang.tsv",  
                     delim = "\t", 
                     col_names = c("category", "language", "mother_tongue", 
                                 "most_at_home", "most_at_work", "lang_known"))

Reading tabular data directly from a URL#

  • We can also use read_csv or read_delim (and related functions) to read in tabular data directly from a url that contains tabular data

  • In this case, we provide the url to the read_* function as the path to the file instead of a path to a local file on our computer

Reading in https://github.com/ttimbers/canlang/blob/master/inst/extdata/can_lang.csv

can_lang4 <- read_csv("https://raw.githubusercontent.com/ttimbers/canlang/master/inst/extdata/can_lang.csv")
head(can_lang4)
Rows: 214 Columns: 6

โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Delimiter: ","
chr (2): category, language
dbl (4): mother_tongue, most_at_home, most_at_work, lang_known


โ„น Use `spec()` to retrieve the full column specification for this data.
โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 ร— 6
categorylanguagemother_tonguemost_at_homemost_at_worklang_known
<chr><chr><dbl><dbl><dbl><dbl>
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languagesAfrikaans 10260 4785 8523415
Non-Official & Non-Aboriginal languagesAfro-Asiatic languages, n.i.e. 1150 445 10 2775
Non-Official & Non-Aboriginal languagesAkan (Twi) 13460 5985 2522150
Non-Official & Non-Aboriginal languagesAlbanian 268951313534531930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120

Reading data from an Microsoft Excel file#

  • it is very common to encounter, and need to load into R, data stored as a Microsoft Excel spreadsheet (with the filename extension .xlsx)

  • To be able to do this, a key thing to know is that even though .csv and .xlsx files look almost identical when loaded into Excel, the data themselves are stored completely differently.

Read in data/can_lang.xlsx:

library(readxl)
can_lang5 <- read_excel("data/can_lang.xlsx")
head(can_lang5)
A tibble: 6 ร— 6
categorylanguagemother_tonguemost_at_homemost_at_worklang_known
<chr><chr><dbl><dbl><dbl><dbl>
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languagesAfrikaans 10260 4785 8523415
Non-Official & Non-Aboriginal languagesAfro-Asiatic languages, n.i.e. 1150 445 10 2775
Non-Official & Non-Aboriginal languagesAkan (Twi) 13460 5985 2522150
Non-Official & Non-Aboriginal languagesAlbanian 268951313534531930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120

Note - if there are multiple sheets, use the sheet argument to specify the sheet number or name.

Reading a Microsoft Excel file from the web#

When trying to read a Microsoft Excel file from the web into R using a URL you cannot just pass the URL to read_excel.

First you must download the file, and then read it locally from there:

url <- "https://github.com/ttimbers/canlang/blob/master/inst/extdata/can_lang.xlsx?raw=true"
download.file(url, "temp.xlsx")
can_lang6 <- read_excel("temp.xlsx")
head(can_lang6)
A tibble: 6 ร— 6
categorylanguagemother_tonguemost_at_homemost_at_worklang_known
<chr><chr><dbl><dbl><dbl><dbl>
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languagesAfrikaans 10260 4785 8523415
Non-Official & Non-Aboriginal languagesAfro-Asiatic languages, n.i.e. 1150 445 10 2775
Non-Official & Non-Aboriginal languagesAkan (Twi) 13460 5985 2522150
Non-Official & Non-Aboriginal languagesAlbanian 268951313534531930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120

Note about loading data#

  • Itโ€™s important to do it carefully + check results after!

    • will help reduce bugs and speed up your analyses down the road

  • Think of it as tying your shoes before you run; not exciting, but if done wrong it will trip you up later!

Writing files from R to a .csv file#

  • Once you have modified or summarized some data, you often want to save that data to a file for use later, or sharing with others.

  • One of the most common formats for this is .csv

  • To write to .csv in R we recommend using readr::write_csv as it uses sensible choices for a .csv file (e.g., includes column, but not row names, does not write quotes around the data, uses a comma as the delimiter, etc)

write_csv(can_lang6, "data/can_lang7.csv")

Cleaning up column names#

Column names from wild data can be wild! For example, column names with symbols or white space require special syntax (surround the column names with back ticks) to program with them in R. One of the first things you want to do when you read in data can be to clean these up!

If things are not too bad, and there are not too many, rename can be a friend. Letโ€™s use rename to change the column names that have white space to using underscores. Letโ€™s look at the can_lang-colnames.csv file for example:

can_lang8 <- read_csv("data/can_lang-colnames.csv")
head(can_lang8)
Rows: 214 Columns: 6

โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Delimiter: ","
chr (2): Category, Language
dbl (4): Mother tongue, Spoken most at home, Spoken most at work, Language k...


โ„น Use `spec()` to retrieve the full column specification for this data.
โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 ร— 6
CategoryLanguageMother tongueSpoken most at homeSpoken most at workLanguage known
<chr><chr><dbl><dbl><dbl><dbl>
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languagesAfrikaans 10260 4785 8523415
Non-Official & Non-Aboriginal languagesAfro-Asiatic languages, n.i.e. 1150 445 10 2775
Non-Official & Non-Aboriginal languagesAkan (Twi) 13460 5985 2522150
Non-Official & Non-Aboriginal languagesAlbanian 268951313534531930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120
can_lang9 <- rename(can_lang8, Mother_tongue = `Mother tongue`,
                   Spoken_most_at_home = `Spoken most at home`,
                   Spoken_most_at_work = `Spoken most at work`,
                   Language_known = `Language known`)
head(can_lang9)
A tibble: 6 ร— 6
CategoryLanguageMother_tongueSpoken_most_at_homeSpoken_most_at_workLanguage_known
<chr><chr><dbl><dbl><dbl><dbl>
Aboriginal languages Aboriginal languages, n.o.s. 590 235 30 665
Non-Official & Non-Aboriginal languagesAfrikaans 10260 4785 8523415
Non-Official & Non-Aboriginal languagesAfro-Asiatic languages, n.i.e. 1150 445 10 2775
Non-Official & Non-Aboriginal languagesAkan (Twi) 13460 5985 2522150
Non-Official & Non-Aboriginal languagesAlbanian 268951313534531930
Aboriginal languages Algonquian languages, n.i.e. 45 10 0 120

Cleaning up a lot of column names#

Sometimes there are A LOT of column names with non-syntactic names. rename is not the best choice here. Instead weโ€™ll use the clean_names from the {janitor} package to transform all column names to syntactic ones.

library(janitor)
Attaching package: โ€˜janitorโ€™


The following objects are masked from โ€˜package:statsโ€™:

    chisq.test, fisher.test
census <- read_csv("data/census_snippet.csv")
head(census, 3)
Rows: 12 Columns: 16

โ”€โ”€ Column specification โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Delimiter: ","
chr  (4): Geographic name, Geographic type, Geographic name, Province or ter...
dbl (12): Geographic code, Geographic code, Province or territory, Global no...


โ„น Use `spec()` to retrieve the full column specification for this data.
โ„น Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 3 ร— 16
Geographic codeGeographic nameGeographic typeGeographic name, Province or territoryGeographic code, Province or territoryGlobal non-response rateData quality flagHousehold typeNumber of households, 2006Number of households, 2016Median household total income (2015 constant dollars), 2005Median household total income (2015 constant dollars), 2015Median household total income (2015 constant dollars), % changeMedian household after-tax income (2015 constant dollars), 2005Median household after-tax income (2015 constant dollars), 2015Median household after-tax income (2015 constant dollars), % change
<dbl><chr><chr><chr><dbl><dbl><dbl><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1001Division No. 1Census divisionNewfoundland and Labrador103.70Total โ€“ Household type including census family structure 97015112620576677467629.5497506450429.7
1001Division No. 1Census divisionNewfoundland and Labrador103.70Census-family households 71960 78825697209612437.9597138148636.5
1001Division No. 1Census divisionNewfoundland and Labrador103.70Households consisting of only one census family without additional persons66250 73020688379456637.4587158001236.3
clean_census <- clean_names(census)
head(clean_census, 3)
A tibble: 3 ร— 16
geographic_codegeographic_namegeographic_typegeographic_name_province_or_territorygeographic_code_province_or_territoryglobal_non_response_ratedata_quality_flaghousehold_typenumber_of_households_2006number_of_households_2016median_household_total_income_2015_constant_dollars_2005median_household_total_income_2015_constant_dollars_2015median_household_total_income_2015_constant_dollars_percent_changemedian_household_after_tax_income_2015_constant_dollars_2005median_household_after_tax_income_2015_constant_dollars_2015median_household_after_tax_income_2015_constant_dollars_percent_change
<dbl><chr><chr><chr><dbl><dbl><dbl><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1001Division No. 1Census divisionNewfoundland and Labrador103.70Total โ€“ Household type including census family structure 97015112620576677467629.5497506450429.7
1001Division No. 1Census divisionNewfoundland and Labrador103.70Census-family households 71960 78825697209612437.9597138148636.5
1001Division No. 1Census divisionNewfoundland and Labrador103.70Households consisting of only one census family without additional persons66250 73020688379456637.4587158001236.3

clean_names is a thing of great beauty!!!

Single data frame manipulations#

Here we are going to learn how to use some of the most essential single data frame manipulation functions from the {dplyr} tidyverse package. To explore these, we will work with some of the data from the Gapminder project. Jenny Bryan (MDS Founder and Software Developer at RStudio) released this as an R package called {gapminder}. We can load the data by loading the {gapminder} library:

library(gapminder)

Letโ€™s take a quick look at the first 6 rows using head:

head(gapminder)
A tibble: 6 ร— 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AfghanistanAsia195228.801 8425333779.4453
AfghanistanAsia195730.332 9240934820.8530
AfghanistanAsia196231.99710267083853.1007
AfghanistanAsia196734.02011537966836.1971
AfghanistanAsia197236.08813079460739.9811
AfghanistanAsia197738.43814880372786.1134

Use select to subset columns#

select from the [dplyr] package is used to subset the data on variables or columns. Hereโ€™s a conventional call to select two columns, year and lifeExp:

select(gapminder, year, lifeExp)
A tibble: 1704 ร— 2
yearlifeExp
<int><dbl>
195228.801
195730.332
196231.997
196734.020
197236.088
197738.438
198239.854
198740.822
199241.674
199741.763
200242.129
200743.828
195255.230
195759.280
196264.820
196766.220
197267.690
197768.930
198270.420
198772.000
199271.581
199772.950
200275.651
200776.423
195243.077
195745.685
196248.303
196751.407
197254.518
197758.014
โ‹ฎโ‹ฎ
198249.113
198752.922
199255.599
199758.020
200260.308
200762.698
195242.038
195744.077
196246.023
196747.768
197250.107
197751.386
198251.821
198750.821
199246.100
199740.238
200239.193
200742.384
195248.451
195750.469
196252.358
196753.995
197255.635
197757.674
198260.363
198762.351
199260.377
199746.809
200239.989
200743.487

Wow! Thatโ€™s a lot of rowsโ€ฆ For teaching purposes I am intentionally not binding a name to the output of these commands, as they are for demonstration only, however, that means we donโ€™t have an object to call head onโ€ฆ In Jupyter we can use this options function call to limit the number of rows output when we call an entire data frame.

# run this command to limit data frame output to 10 rows
options(repr.matrix.max.rows = 10)
select(gapminder, year, lifeExp)
A tibble: 1704 ร— 2
yearlifeExp
<int><dbl>
195228.801
195730.332
196231.997
196734.020
197236.088
โ‹ฎโ‹ฎ
198762.351
199260.377
199746.809
200239.989
200743.487

You can also use select to get a range of columns using names:

select(gapminder, country:lifeExp)
A tibble: 1704 ร— 4
countrycontinentyearlifeExp
<fct><fct><int><dbl>
AfghanistanAsia195228.801
AfghanistanAsia195730.332
AfghanistanAsia196231.997
AfghanistanAsia196734.020
AfghanistanAsia197236.088
โ‹ฎโ‹ฎโ‹ฎโ‹ฎ
ZimbabweAfrica198762.351
ZimbabweAfrica199260.377
ZimbabweAfrica199746.809
ZimbabweAfrica200239.989
ZimbabweAfrica200743.487

Use filter to subset rows that meet a specific condition#

filter from the {dplyr package, takes logical expressions and returns the rows for which all are TRUE, for example we can subset rows which have a life expectancy > 29:

filter(gapminder, lifeExp < 29)
A tibble: 2 ร— 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AfghanistanAsia 195228.8018425333779.4453
Rwanda Africa199223.5997290203737.0686

We can susbet rows that meet two conditions, for example we can subset rows which are from the country Rwanda and whose year is > 1979. Note we use the comma , to separate the two conditions:

filter(gapminder, country == "Rwanda", year > 1979)
A tibble: 6 ร— 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
RwandaAfrica198246.2185507565881.5706
RwandaAfrica198744.0206349365847.9912
RwandaAfrica199223.5997290203737.0686
RwandaAfrica199736.0877212583589.9445
RwandaAfrica200243.4137852401785.6538
RwandaAfrica200746.2428860588863.0885

to filter for multiple conditions that need not co-occur, use |:

filter(gapminder, lifeExp > 80 | year == 2007)
A tibble: 150 ร— 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AfghanistanAsia 200743.82831889923 974.5803
Albania Europe 200776.423 3600523 5937.0295
Algeria Africa 200772.30133333216 6223.3675
Angola Africa 200742.73112420476 4797.2313
Argentina Americas200775.3204030192712779.3796
โ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎ
Vietnam Asia 200774.249852623562441.5764
West Bank and GazaAsia 200773.422 40183323025.3498
Yemen, Rep. Asia 200762.698222117432280.7699
Zambia Africa200742.384117460351271.2116
Zimbabwe Africa200743.48712311143 469.7093

Finally, we can use the %in% operator to subset rows which have values that match a value from several possible values:

filter(gapminder, country %in% c("Mexico", "United States", "Canada"))
A tibble: 36 ร— 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
CanadaAmericas195268.751478558411367.16
CanadaAmericas195769.961701015412489.95
CanadaAmericas196271.301898584913462.49
CanadaAmericas196772.132081976716076.59
CanadaAmericas197272.882228450018970.57
โ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎ
United StatesAmericas198775.02024280353329884.35
United StatesAmericas199276.09025689418932003.93
United StatesAmericas199776.81027291176035767.43
United StatesAmericas200277.31028767552639097.10
United StatesAmericas200778.24230113994742951.65

Combining functions with the pipe |> operator#

In R, we often have to call multiple functions in a sequence to process a data frame. The basic ways of doing this can become quickly unreadable if there are many steps. For example, suppose we need to perform three operations on a data frame data:

  1. add a new column new_col that is double another old_col

  2. filter for rows where another column, other_col, is more than 5, and

  3. select only the new column new_col for those rows.

One way of doing is to just write multiple lines of code, storing temporary objects as you go:

output_1 <- mutate(data, new_col = old_col * 2)
output_2 <- filter(output_1, other_col > 5)
output <- select(output_2, new_col)

This is difficult to understand for multiple reasons. The reader may be tricked into thinking the named output_1 and output_2 objects are important for some reason, while they are just temporary intermediate computations. Further, the reader has to look through and find where output_1 and output_2 are used in each subsequent line.

Another option for doing this would be to compose the functions:

output <- select(filter(mutate(data, new_col = old_col * 2), other_col > 5), new_col)

Code like this can also be difficult to understand. Functions compose (reading from left to right) in the opposite order in which they are computed by R (above, mutate happens first, then filter, then select). It is also just a really long line of code to read in one go.

The pipe operator |> solves this problem, resulting in cleaner and easier-to-follow code. |> in built into R so you donโ€™t need to load any packages to use it. The code below accomplishes the same thing as the previous two code blocks:

output <- data |>
mutate(new_col = old_col * 2) |>
filter(other_col > 5) |>
select(new_col)

The pipe operator takes the thing on the left-hand-side and pipes it into the function call on the right-hand-side โ€“ literally, drops it in as the first argument.

  • this year (2021) R created |> as a built-in pipe operator!!!

  • it was inspired from the pipe from the magrittr package %>%; which is imported by the tidyverse (and dplyr) package

Referring to the 1948 painting La Trahison des images by Rene Magritte

So now we also have:

base R pipe hex

Logo by @LuisDVerde

Source: https://twitter.com/LuisDVerde/status/1430905603405144065

Which to use? Most of the time there will be no difference. I will be adopting |> in my notes, homeworks, etc however, some videos from past times will use %>% and you will still see |> in other peopleโ€™s code.

I am choosing |> because I can then use piping in R packages without depending upon the magrittr package.

So instead of using intermediate objects to combine two functions (e.g., select & filter) like this:

gap_under_29 <- filter(gapminder, lifeExp < 29)
select(gap_under_29, country, year)
A tibble: 2 ร— 2
countryyear
<fct><int>
Afghanistan1952
Rwanda 1992

And instead of composing functions in a hard to read manner:

select(filter(gapminder, lifeExp < 29), country, year)
A tibble: 2 ร— 2
countryyear
<fct><int>
Afghanistan1952
Rwanda 1992

We can use the pipe |> to create easy to read piplines to connect our function calls:

filter(gapminder, lifeExp < 29) |>
    select(country, year)
A tibble: 2 ร— 2
countryyear
<fct><int>
Afghanistan1952
Rwanda 1992

We can even start out our pipelines with an R data frame!

gapminder |>
    filter(lifeExp < 29) |>
    select(country, year)
A tibble: 2 ร— 2
countryyear
<fct><int>
Afghanistan1952
Rwanda 1992

We will embrace the pipe going forward in MDS!

Pro-tip for creating objects using the pipe#

Donโ€™t start writing your code by assigning to an object - itโ€™s hard to debug and find the errors:

new_df <- gapminder |> 
    filter(country == "Cabodia") |>
    select(Year, lifeExp) 
Error in .f(.x[[i]], ...): object 'Year' not found
Traceback:

1. gapminder |> filter(country == "Cabodia") |> select(Year, lifeExp)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(quote(`_fseq`(`_lhs`)), env, env)
5. `_fseq`(`_lhs`)
6. freduce(value, `_function_list`)
7. withVisible(function_list[[k]](value))
8. function_list[[k]](value)
9. select(., Year, lifeExp)
10. select.data.frame(., Year, lifeExp)
11. tidyselect::vars_select(tbl_vars(.data), !!!enquos(...))
12. vars_select_eval(.vars, quos)
13. map_if(quos, !is_helper, eval_tidy, mask)
14. map(.x[sel], .f, ...)
15. .f(.x[[i]], ...)

Instead, create your pipeline, just dumping it to the screen:

gapminder |> 
    filter(country == "Cambodia") |>
    select(year, lifeExp)
A tibble: 12 ร— 2
yearlifeExp
<int><dbl>
195239.417
195741.366
196243.415
196745.415
197240.317
โ‹ฎโ‹ฎ
198753.914
199255.803
199756.534
200256.752
200759.723

Then, once you know it works, assign it to an object:

new_df <- gapminder |>
    filter(country == "Cambodia") |>
    select(year, lifeExp) 

Use mutate to add new variables#

Imagine we wanted to recover each countryโ€™s GDP. After all, the Gapminder data has a variable for population and GDP per capita. Letโ€™s multiply them together.

mutate is a function that defines and inserts new variables into a tibble. You can refer to existing variables by name.

gapminder |> 
    mutate(tot_gdp = pop * gdpPercap)
A tibble: 1704 ร— 7
countrycontinentyearlifeExppopgdpPercaptot_gdp
<fct><fct><int><dbl><int><dbl><dbl>
AfghanistanAsia195228.801 8425333779.44536567086330
AfghanistanAsia195730.332 9240934820.85307585448670
AfghanistanAsia196231.99710267083853.10078758855797
AfghanistanAsia196734.02011537966836.19719648014150
AfghanistanAsia197236.08813079460739.98119678553274
โ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎ
ZimbabweAfrica198762.351 9216418706.15736508240905
ZimbabweAfrica199260.37710704340693.42087422611852
ZimbabweAfrica199746.80911404948792.45009037850590
ZimbabweAfrica200239.98911926563672.03868015110972
ZimbabweAfrica200743.48712311143469.70935782658337

To mutate more than one column, use commas to separate the mutation for each column:

gapminder |>
    mutate(tot_gdp = pop * gdpPercap,
          pop_thousands = pop / 1000)
A tibble: 1704 ร— 8
countrycontinentyearlifeExppopgdpPercaptot_gdppop_thousands
<fct><fct><int><dbl><int><dbl><dbl><dbl>
AfghanistanAsia195228.801 8425333779.44536567086330 8425.333
AfghanistanAsia195730.332 9240934820.85307585448670 9240.934
AfghanistanAsia196231.99710267083853.1007875885579710267.083
AfghanistanAsia196734.02011537966836.1971964801415011537.966
AfghanistanAsia197236.08813079460739.9811967855327413079.460
โ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎ
ZimbabweAfrica198762.351 9216418706.15736508240905 9216.418
ZimbabweAfrica199260.37710704340693.4208742261185210704.340
ZimbabweAfrica199746.80911404948792.4500903785059011404.948
ZimbabweAfrica200239.98911926563672.0386801511097211926.563
ZimbabweAfrica200743.48712311143469.7093578265833712311.143

We can even change a column in place:

gapminder |>
    mutate(lifeExp = round(lifeExp, 0))
A tibble: 1704 ร— 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
AfghanistanAsia195229 8425333779.4453
AfghanistanAsia195730 9240934820.8530
AfghanistanAsia19623210267083853.1007
AfghanistanAsia19673411537966836.1971
AfghanistanAsia19723613079460739.9811
โ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎ
ZimbabweAfrica198762 9216418706.1573
ZimbabweAfrica19926010704340693.4208
ZimbabweAfrica19974711404948792.4500
ZimbabweAfrica20024011926563672.0386
ZimbabweAfrica20074312311143469.7093

Use arrange to sort a data frame#

Imagine we wanted to know the country (and year) with the shortest life expectancy? How can we do this?

arrange is a function that sorts data frames based on a column, or columns. Letโ€™s use it below to answer the questiom we just posed above!

gapminder |>
    arrange(lifeExp)
A tibble: 1704 ร— 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
Rwanda Africa199223.5997290203 737.0686
Afghanistan Asia 195228.8018425333 779.4453
Gambia Africa195230.000 284320 485.2307
Angola Africa195230.01542320953520.6103
Sierra LeoneAfrica195230.3312143249 879.7877
โ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎ
Switzerland Europe200781.701 755466137506.42
Iceland Europe200781.757 30193136180.79
Japan Asia 200282.00012706584128604.59
Hong Kong, ChinaAsia 200782.208 698041239724.98
Japan Asia 200782.60312746797231656.07

What about the hightest life expectancy? We can pair the arrange function, with another function, called desc to sort the data frame in descending order:

gapminder |>
    arrange(desc(lifeExp))
A tibble: 1704 ร— 6
countrycontinentyearlifeExppopgdpPercap
<fct><fct><int><dbl><int><dbl>
Japan Asia 200782.60312746797231656.07
Hong Kong, ChinaAsia 200782.208 698041239724.98
Japan Asia 200282.00012706584128604.59
Iceland Europe200781.757 30193136180.79
Switzerland Europe200781.701 755466137506.42
โ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎโ‹ฎ
Sierra LeoneAfrica195230.3312143249 879.7877
Angola Africa195230.01542320953520.6103
Gambia Africa195230.000 284320 485.2307
Afghanistan Asia 195228.8018425333 779.4453
Rwanda Africa199223.5997290203 737.0686

Getting a single value from a data frame#

What if you want the name of the country with the longest life expectancy back as just a single character vector, and not a data frame? How do we do this?

We need two additional {dplyr} functions to get this job done using the tidyverse, slice and pull. slice allows you to numerically index rows in R, returning the row number(s) specified. pull removes the data frame structure, returning the data as the next simpler data structure in R, a vector.

Letโ€™s use slice and pull to now extract the name of the country with the longest life expectancy:

gapminder |> 
    arrange(desc(lifeExp)) |>
    slice(1) |>
    pull(lifeExp)
82.603

We just learned A LOT of new functions and operators:

  • select

  • filter

  • mutate

  • arrange

  • desc

  • slice

  • pull

  • |>

  • %in%

Trust that with practice (and you will get a lot of practice by completing the worksheet and lab homeworks) you will learn these. Also, when needed, you can refer to the data transformation cheat sheet: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

Tidy data#

Shameless borrowing of slides from Jenny Bryan#

https://www.slideshare.net/Plotly/plotcon-nyc-behind-every-great-plot-theres-a-great-deal-of-wrangling

../../_images/whisperer.png ../../_images/on_balcony.png ../../_images/tame.png

How should you wrangle your data?#

We make it โ€œtidyโ€!#

What is tidy data?#

A tidy data is one that is satified by these three criteria:

  • each row is a single observation,

  • each variable is a single column, and

  • each value is a single cell (i.e., its row, column position in the data frame is not shared with another value)

What is a variable and an observation may depend on your immediate goal.

Source: R for Data Science by Garrett Grolemund & Hadley Wickham

A tale of 4 data tablesโ€ฆ#

โ€ฆhere is the same data represented in 4 different ways, letโ€™s vote on which are tidy

Example source: https://garrettgman.github.io/tidying/

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

country

year

cases_per_capita

Afghanistan

1999

745/19987071

Afghanistan

2000

2666/20595360

Brazil

1999

37737/172006362

Brazil

2000

80488/174504898

China

1999

212258/1272915272

China

2000

213766/1280428583

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

country

cases (year=1999)

cases (year=2000)

Afghanistan

745

2666

Brazil

37737

80488

China

212258

213766

country

population (year=1999)

population (year=2000)

Afghanistan

19987071

20595360

Brazil

172006362

174504898

China

1272915272

1280428583

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

country

year

cases

population

Afghanistan

1999

745

19987071

Afghanistan

2000

2666

20595360

Brazil

1999

37737

172006362

Brazil

2000

80488

174504898

China

1999

212258

1272915272

China

2000

213766

1280428583

Statistical question: What variables are associated with the number of TB cases?

This data is tidy, true or false?

country

year

key

value

Afghanistan

1999

cases

745

Afghanistan

1999

population

19987071

Afghanistan

2000

cases

2666

Afghanistan

2000

population

20595360

Brazil

1999

cases

37737

Brazil

1999

population

172006362

Brazil

2000

cases

80488

Brazil

2000

population

174504898

China

1999

cases

212258

China

1999

population

1272915272

China

2000

cases

213766

China

2000

population

1280428583

Pivoting longer#

pivot_longer from the {tidyr} package takes a wide data frame and making it longer! {tidyr} is another package in the {tidyverse} metapackage.

Source: April Hillโ€™s teachthat GitHub repository

Consider the data frame below and that we are interested in finding what variables are associated with the number of TB cases?

table4a
A tibble: 3 ร— 3
country19992000
<chr><int><int>
1Afghanistan 745 2666
2Brazil 37737 80488
3China 212258213766

This is currently difficult as the values for the variable year are stuck in column names. We can use the pivot_longer function to tidy this data:

table4a |>
    pivot_longer(`1999`:`2000`, names_to = "year", values_to = "cases")
A tibble: 6 ร— 3
countryyearcases
<chr><chr><int>
Afghanistan1999 745
Afghanistan2000 2666
Brazil 1999 37737
Brazil 2000 80488
China 1999212258
China 2000213766
# or a less verbose and efficient way to specify this:
table4a |>
    pivot_longer(-country, names_to = "year", values_to = "cases")
A tibble: 6 ร— 3
countryyearcases
<chr><chr><int>
Afghanistan1999 745
Afghanistan2000 2666
Brazil 1999 37737
Brazil 2000 80488
China 1999212258
China 2000213766

To use pivot_longer we need to specify:

  1. the dataset

  2. The columns we want to apply our pivot operation to (we use select-like syntax for this, and if you want to operate on all columns in a data frame, you can provide the function everything() here.

  3. the name of a new column that will be created, whose values will come from the names of the columns that we want to combine (the result argument)

  4. the name of a new column that will be created, whose values will come from the values of the columns we want to combine (the value argument)

  5. the names of the columns that we want to combine (we list these after specifying the key and value, and separate the column names with commas)

Pivoting wider#

pivot_wider from the {tidyr} package takes a narrow data frame and making it wider! It is the opposite of pivot_longer!

Consider the data frame below and that we are interested in finding what variables are associated with the number of TB cases?

table2
A tibble: 12 ร— 4
countryyeartypecount
<chr><int><chr><int>
Afghanistan1999cases 745
Afghanistan1999population19987071
Afghanistan2000cases 2666
Afghanistan2000population20595360
Brazil 1999cases 37737
โ‹ฎโ‹ฎโ‹ฎโ‹ฎ
Brazil2000population 174504898
China 1999cases 212258
China 1999population1272915272
China 2000cases 213766
China 2000population1280428583

This is currently difficult as observation is scattered across multiple rows. We can tidy this data frame with the function pivot_wider.

table2 |>
    pivot_wider(names_from = type, values_from = count)
A tibble: 6 ร— 4
countryyearcasespopulation
<chr><int><int><int>
Afghanistan1999 745 19987071
Afghanistan2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 19992122581272915272
China 20002137661280428583

To use pivot_wider we need to specify:

  1. the dataset

  2. the name of the column whose values we would like to use as column names when we widen the data

  3. the name of the column whose values you would like to spread into separate columns based on the value they had in the column specified as the key

https://www.tidyverse.org/

The tidy tools manifesto#

โ€œThe goal of these principles is to provide a uniform interface so that tidyverse packages work together naturally, and once youโ€™ve mastered one, you have a head start on mastering the others.โ€

https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html

The tidy tools manifesto#

There are four basic principles to a tidy API:

  1. Reuse existing data structures.

  2. Compose simple functions with the pipe (|>).

  3. Embrace functional programming.

  4. Design for humans.

Reuse existing data structures.#

  • Where possible, re-use existing data structures, rather than creating custom data structures for your own package

  • When it is not possible to use an existing data structure, use S3 class (simplest OO system in R) built on top of an atomic vector or list

  • When working on data.frames/tibbles, assume the data is tidy

Compose simple functions with the pipe (|>)#

  • A powerful strategy for solving complex problems is to combine many simple pieces. Each piece should be easily understood in isolation, and have a standard way to combine with other pieces.

  • In R, this strategy plays out by composing single functions with the pipe, |>.

  • The pipe is a common composition tool that works across all packages.

Embrace functional programming#

R is a functional programming language; embrace it, donโ€™t fight it. If youโ€™re familiar with an object-oriented language like Python or C#, this is going to take some adjustment. But in the long run you will be much better off working with the language rather than fighting it.

Generally, this means you should favour:

  • Immutable objects and copy-on-modify semantics. This makes your code easier to reason about.

  • The generic functions provided by S3 and S4. These work very naturally inside a pipe. If you do need mutable state, try to make it an internal implementation detail, rather than exposing it to the user.

  • Tools that abstract over for-loops, like the apply family of functions or the map functions in purrr.

Design for humans#

Design your API primarily so that it is easy to use by humans. Computer efficiency is a secondary concern because the bottleneck in most data analysis is thinking time, not computing time.

  • Invest time in naming your functions. Evocative function names make your API easier to use and remember.

  • Favour explicit, lengthy names, over short, implicit, names. Save the shortest names for the most important operations.

  • Think about how autocomplete can also make an API thatโ€™s easy to write. Make sure that function families are identified by a common prefix, not a common suffix. This makes autocomplete more helpful, as you can jog your memory with the prompts. For smaller packages, this may mean that every function has a common prefix (e.g. stringr, xml2, rvest).

What did we learn today?#

  • How to read and write data using R

  • Single table functions from {dplyr} to manipulate data frames

  • A new operator, the pipe |> for combining function calls

  • What tidy data is, and how to use the {tidyr} functions to get our data tidy

  • That R is awesome for manipulating data!

Attributions#

  • These notes were prepared by Dr. Tiffany Timbers as part of DSCI 523 at UBC-Vancouver

  • MUCH of these notes were derived from Stat 545 created by Jenny Bryan

  • R for Data Science by Garrett Grolemund & Hadley Wickham