# Lecture 3 -  dates & times, strings,  as well as factors

### Lecture learning objectives:
By the end of this lecture and worksheet 3, students should be able to:

* Manipulate dates and times using the {lubridate} package
* Be able to modify strings in a data frame using regular expressions and the {stringr} package
* Cast categorical columns in a data frame as factors when appropriate, and manipulate factor levels as needed in preparation for data visualisation and statistical analysis (using base R and {forcats} package functions)


In [1]:
library(tidyverse)
library(gapminder)
options(repr.matrix.max.rows = 10)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.0.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## Tibbles versus data frames

You have seen and heard me talk about data frames and tibbles in R, and sometimes I carelessly interchange the terms. Let's take moment to discuss the difference between the two!

Tibbles are special data frames, with special/extra properties. The two important ones you will care about are:

- In RStudio, tibbles only output the first 10 rows

- When you numerically subset a data frame to 1 column, you get a vector. However, when you numerically subset a tibble you still get a tibble back.

When you create a data frame using base R functions, either via `data.frame` or one of the base R `read.*` functions, you get an objects whose class is data frame:

In [2]:
example <- data.frame(a = c(1, 5, 9), b = "z", "a", "t")
example

class(example)

a,b,X.a.,X.t.
<dbl>,<chr>,<chr>,<chr>
1,z,a,t
5,z,a,t
9,z,a,t


Tibbles inherit from the data frame class (meaning that have many of the same properties as data frames), but they also have the extra properties I just discussed:

In [3]:
example2 <- tibble(a = c(1, 5, 9), b = "z", "a", "t")
example

class(example2)

a,b,X.a.,X.t.
<dbl>,<chr>,<chr>,<chr>
1,z,a,t
5,z,a,t
9,z,a,t


Note: there are **some** tidyverse functions that will coerce a data frame to a tibble, because what the user is asking for is not possible with a data frame. One such example is `group_by` (which we will learn about next week):

In [4]:
group_by(example, a) %>% 
    class()

**Rule of thumb:**  if you want a tibble, it’s on you to know that and express that explicitly with `as_tibble()` (if it’s a data frame to start out with).

## Dates and times

<img src="https://d33wubrfki0l68.cloudfront.net/baa19d0ebf9b97949a7ad259b29a1c4ae031c8e2/8e9b8/diagrams/vectors/summary-tree-s3-1.png" width=300>

*Source: [Advanced R](https://adv-r.hadley.nz/) by Hadley Wickham*

### Working with dates and times

Your weapon: The lubridate package (CRAN; [GitHub](https://github.com/tidyverse/lubridate); [main vignette](https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html)).

In [5]:
library(lubridate)


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




### Get your hands on some dates or date-times

Use `lubridate`’s `today` to get today’s date, without any time:

In [6]:
today()

In [7]:
class(today())

Use `lubridate`'s `now` to get RIGHT NOW, meaning the date and the time:

In [8]:
now()

[1] "2022-02-01 09:49:35 PST"

In [9]:
class(now())

### Get date or date-time from character

Use the `lubridate` helpers to convert character or unquoted numbers into dates or date-times:

In [10]:
ymd("2017-01-31")

In [11]:
mdy("January 31st, 2017")

In [12]:
dmy("31-Jan-2017")

In [13]:
ymd(20170131)

In [14]:
ymd_hms("2017-01-31 20:11:59")

[1] "2017-01-31 20:11:59 UTC"

In [15]:
mdy_hm("01/31/2017 08:01")

[1] "2017-01-31 08:01:00 UTC"

You can also force the creation of a date-time from a date by supplying a timezone:

In [16]:
class(ymd(20170131, tz = "UTC"))

### Build date or date-time from parts

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. 

In [17]:
dates <- tibble(year = c(2015, 2016, 2017, 2018, 2019),
               month = c(9, 9, 9, 9, 9),
               day = c(3, 4, 2, 6, 3))

dates

year,month,day
<dbl>,<dbl>,<dbl>
2015,9,3
2016,9,4
2017,9,2
2018,9,6
2019,9,3


To create a date/time from this sort of input, use `make_date` for dates, or `make_datetime` for date-times:

In [18]:
# make a single date from year, month and day
dates %>% 
    mutate(date = make_date(year, month, day))

year,month,day,date
<dbl>,<dbl>,<dbl>,<date>
2015,9,3,2015-09-03
2016,9,4,2016-09-04
2017,9,2,2017-09-02
2018,9,6,2018-09-06
2019,9,3,2019-09-03


### Getting components from a date or date-time

Sometimes you have the date or date-time and you want to extract a component, such as year or day.

In [19]:
datetime <- ymd_hms("2016-07-08 12:34:56")
datetime

[1] "2016-07-08 12:34:56 UTC"

In [20]:
year(datetime)

In [21]:
month(datetime)

In [22]:
mday(datetime)

In [23]:
yday(datetime)

In [24]:
wday(datetime, label = TRUE, abbr = FALSE)

For `month` and `wday` you can set `label = TRUE` to return the abbreviated name of the month or day of the week. Set `abbr = FALSE` to return the full name.

### More date and time wrangling possibilities:

- [Tools for working with time spans](https://r4ds.had.co.nz/dates-and-times.html#time-spans), to calculate durations, periods and intervals
- [Tools for dealing with time zones](https://r4ds.had.co.nz/dates-and-times.html#time-zones)

## String manipulations

#### The recommended tools:

- **[stringr](https://stringr.tidyverse.org/) package:** A core package in the tidyverse. Main functions start with str_. Auto-complete is your friend. [`stringr` cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/strings.pdf)

- **[tidyr](https://tidyr.tidyverse.org/) package:** Useful for functions that split one character vector into many and vice versa: `separate`, `unite`, `extract`. [`tidyr` cheatsheet](https://github.com/rstudio/cheatsheets/blob/master/data-import.pdf)

- **Base functions:** `nchar`, `strsplit`, `substr`, `paste`, and `paste0`.

- The [glue package](https://glue.tidyverse.org/) is fantastic for string interpolation. If stringr::str_interp() doesn’t get your job done, check out the glue package.

#### Regex-free string manipulation with stringr and tidyr

Basic string manipulation tasks:

- Study a single character vector
    - How long are the strings?
    - Presence/absence of a literal string
- Operate on a single character vector
    - Keep/discard elements that contain a literal string
    - Split into two or more character vectors using a fixed delimiter
    - Snip out pieces of the strings based on character position
    - Collapse into a single string
- Operate on two or more character vectors
    - Glue them together element-wise to get a new character vector.

`fruit`, `words`, and `sentences` are character vectors that ship with `stringr` for practicing.

NOTE - we will be working with vectors today. If you want to operate on data frames, you will need to use these functions inside of data frame/tibble functions, like `filter` and `mutate`.

#### Detect or filter on a target string

Determine presence/absence of a literal string with `str_detect`. Spoiler: later we see `str_detect` also detects regular expressions.

Which fruits actually use the word “fruit”?

In [25]:
# detect "fruit"
#typeof(fruit)
#fruit
str_detect(fruit, "fruit")

What’s the easiest way to get the actual fruits that match? Use `str_subset` to keep only the matching elements. Note we are storing this new vector `my_fruit` to use in later examples!

In [26]:
# subset "fruit"
my_fruit <- str_subset(fruit, "fruit")
my_fruit

#### String splitting by delimiter

Use `stringr::str_split` to split strings on a delimiter. 

Some of our fruits are compound words, like “grapefruit”, but some have two words, like “ugli fruit”. Here we split on a single space " ", but show use of a regular expression later.

In [27]:
# split on " "
str_split(my_fruit, " ")

It’s bummer that we get a list back. But it must be so! In full generality, split strings must return list, because who knows how many pieces there will be?

If you are willing to commit to the number of pieces, you can use `str_split_fixed` and get a character matrix. You’re welcome!

In [28]:
str_split_fixed(my_fruit, pattern = " ", n = 2)

0,1
breadfruit,
dragonfruit,
grapefruit,
jackfruit,
kiwi,fruit
passionfruit,
star,fruit
ugli,fruit


If the to-be-split variable lives in a data frame, `tidyr::separate` will split it into 2 or more variables:

In [29]:
# separate on " "
my_fruit[5] <- "yellow kiwi fruit"
my_fruit
tibble(unsplit = my_fruit) %>% 
    separate(unsplit, into = c("pre", "post"), sep = " ")

“Expected 2 pieces. Additional pieces discarded in 1 rows [5].”
“Expected 2 pieces. Missing pieces filled with `NA` in 5 rows [1, 2, 3, 4, 6].”


pre,post
<chr>,<chr>
breadfruit,
dragonfruit,
grapefruit,
jackfruit,
yellow,kiwi
passionfruit,
star,fruit
ugli,fruit


#### Substring extraction (and replacement) by position

Count characters in your strings with `str_length`. *Note this is different from the length of the character vector itself.*

In [30]:
# get length of each string
str_length(my_fruit)
my_fruit

You can snip out substrings based on character position with `str_sub`.

In [31]:
# remove first three strings
str_sub(my_fruit, 1, 3)

Finally, `str_sub` also works for assignment, i.e. on the left hand side of `<-`

In [32]:
# replace three characters with AAA
str_sub(my_fruit, 1, 3)  <- "AAA"
my_fruit

#### Collapse a vector

You can collapse a character vector of length n > 1 to a single string with `str_c`, which also has other uses (see the next section).

In [33]:
# collapse a character vector into one 
head(fruit) %>% 
    str_c(collapse = "-")

#### Create a character vector by catenating multiple vectors

If you have two or more character vectors of the same length, you can glue them together element-wise, to get a new vector of that length. Here are some … awful smoothie flavors?

In [34]:
# concatenate character vectors
fruit[1:4]
fruit[5:8]
str_c(fruit[1:4], fruit[5:8], sep = "")

If the to-be-combined vectors are variables in a data frame, you can use `tidyr::unite` to make a single new variable from them.

In [35]:
# concatenate character vectors when they are in a data frame
tibble(fruit1 = fruit[1:4],
      fruit2 = fruit[5:8]) %>% 
    unite("flavour_combo", fruit1, fruit2, sep = " & ")

flavour_combo
<chr>
apple & bell pepper
apricot & bilberry
avocado & blackberry
banana & blackcurrant


####  Substring replacement

You can replace a pattern with `str_replace`. Here we use an explicit string-to-replace, but later we revisit with a regular expression.

In [36]:
# replace fruit with vegetable
my_fruit <- str_subset(fruit, "fruit")
my_fruit
str_replace(my_fruit, "fruit", "vegetable")

- A special case that comes up a lot is replacing `NA`, for which there is `str_replace_na`.

- If the NA-afflicted variable lives in a data frame, you can use `tidyr::replace_na`.

#### Other `str_*` functions?

There are many many other useful `str_*` functions from the `stringr` package. Too many to go through them all here. If these shown in lecture aren't what you need, then you should try `?str` + tab to see the possibilities:

In [37]:
?str_

#### Regular expressions with stringr

or...

![](https://stat545.com/img/regexbytrialanderror-big-smaller.png)

#### Examples with gapminder

In [38]:
library(gapminder)
head(gapminder)

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134


#### Filtering rows with `str_detect`

Let's filter for rows where the country name starts with "AL":

In [39]:
library(tidyverse)
library(gapminder)

In [40]:
# detect countries that start with "AL"
gapminder %>% 
    filter(str_detect(country, "^Al")) %>% 
    pull(country) %>% 
    unique() %>% 
    length()

And now rows where the country ends in `tan`:

In [41]:
# detect countries that end with "tan"
gapminder %>% 
    filter(str_detect(country, "tan$"))

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.8530
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.020,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
⋮,⋮,⋮,⋮,⋮,⋮
Pakistan,Asia,1987,58.245,105186881,1704.687
Pakistan,Asia,1992,60.838,120065004,1971.829
Pakistan,Asia,1997,61.818,135564834,2049.351
Pakistan,Asia,2002,63.610,153403524,2092.712


Or countries containing ", Dem. Rep." :

In [42]:
# detect countries that contain ", Dem. Rep." 
gapminder %>% 
    filter(str_detect(country, "\\, Dem. Rep."))

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
"Congo, Dem. Rep.",Africa,1952,39.143,14100005,780.5423
"Congo, Dem. Rep.",Africa,1957,40.652,15577932,905.8602
"Congo, Dem. Rep.",Africa,1962,42.122,17486434,896.3146
"Congo, Dem. Rep.",Africa,1967,44.056,19941073,861.5932
"Congo, Dem. Rep.",Africa,1972,45.989,23007669,904.8961
⋮,⋮,⋮,⋮,⋮,⋮
"Korea, Dem. Rep.",Asia,1987,70.647,19067554,4106.492
"Korea, Dem. Rep.",Asia,1992,69.978,20711375,3726.064
"Korea, Dem. Rep.",Asia,1997,67.727,21585105,1690.757
"Korea, Dem. Rep.",Asia,2002,66.662,22215365,1646.758


Replace ", Dem. Rep." with " Democratic Republic":

In [43]:
# replace ", Dem. Rep." with " Democratic Republic"
gapminder %>% 
    mutate(country = str_replace(country,
                                "\\, Dem. Rep.",
                                " Democratic Republic")) %>%
    filter(country == "Korea Democratic Republic")

country,continent,year,lifeExp,pop,gdpPercap
<chr>,<fct>,<int>,<dbl>,<int>,<dbl>
Korea Democratic Republic,Asia,1952,50.056,8865488,1088.278
Korea Democratic Republic,Asia,1957,54.081,9411381,1571.135
Korea Democratic Republic,Asia,1962,56.656,10917494,1621.694
Korea Democratic Republic,Asia,1967,59.942,12617009,2143.541
Korea Democratic Republic,Asia,1972,63.983,14781241,3701.622
⋮,⋮,⋮,⋮,⋮,⋮
Korea Democratic Republic,Asia,1987,70.647,19067554,4106.492
Korea Democratic Republic,Asia,1992,69.978,20711375,3726.064
Korea Democratic Republic,Asia,1997,67.727,21585105,1690.757
Korea Democratic Republic,Asia,2002,66.662,22215365,1646.758


### Extract matches

To extract the actual text of a match, use `str_extract`. 

We’re going to need a more complicated example, the Harvard `sentences` from the `stringr` package: 

In [44]:
typeof(sentences)
head(sentences)

Say we want to extract all of the colours used in the sentences. We can do this by creating a pattern which would match them, and passing that and our vector of sentences to `str_extract`:

In [45]:
colours <- "red|orange|yellow|green|blue|purple"

In [46]:
# extract colours used in sentences
str_extract(sentences, colours)

`str_extract` only returns the first match for each element. To return all matches from an element we need to use `str_extract_all`:

In [47]:
# extract all colours used in sentences
str_extract_all(sentences, colours)

Note: `str_extract` returns a character vector, whereas `str_extract_all` returns a litst. This is because when asking for multiple matches back, you do not know how many you will get, and thus we cannot expect a rectangular shape.

### Capture groups

You can also use parentheses to extract parts of a complex match. 

For example, imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”. Defining a “word” in a regular expression is a little tricky, so here I use a simple approximation: a sequence of at least one character that isn’t a space.

In [48]:
# extract nouns from sentences
noun <- "(a|the) ([^ ]+)"

str_match(sentences, noun) %>% 
    head()

0,1,2
the smooth,the,smooth
the sheet,the,sheet
the depth,the,depth
a chicken,a,chicken
,,
,,


Like `str_extract`, if you want all matches for each string, you’ll need `str_match_all`

### Summary of string manipulation functions covered so far:

| function | description |
|----------|-------------|
| `str_detect` | Detects elements in a vector that match a pattern, returns a vector of logicals |
| `srt_subset` | Detects and returns elements in a vector that match a pattern |
| `str_split` | Split strings in a vector on a delimiter. Returns a list (used `str_split_fixed` to get a matrix) |
| `separate` | Split character vectors from a data frame on a delimiter which get returned as additional columns in the data frame |
| `str_length` | Counts the number of characters for each element of a character vector, and returns a numeric vector of the counts |
| `str_sub` | Remove substrings based on character position |
| `str_c` | Collapse and/or concatenate elements from a character vector(s) |
| `unite` | Concatenate elements from character vectors from a data frame to create a single column |
| `str_replace` | Replace a pattern in a vector of character vectors with a given string |
| `str_extract` | Extract the actual text of a match from a character vector |
| `str_match` | Use capture groups to extract parts of a complex match from a character vector, returns the match and the capture groups as columns of a matrix |

## Factors

<img src="https://d33wubrfki0l68.cloudfront.net/baa19d0ebf9b97949a7ad259b29a1c4ae031c8e2/8e9b8/diagrams/vectors/summary-tree-s3-1.png" width=300>

*Source: [Advanced R](https://adv-r.hadley.nz/) by Hadley Wickham*

### Be the boss of your factors

- I love and hate factors

- I love them for data visualization and statistics because I do not need to make dummy variables 

- I hate them because if you are not careful, they fool you because they look like character vectors. And when you treat them like character vectors you get cryptic error messages, like we saw when we tried to do a conditional mutate on the gapminder data set

#### Tidyverse philosophy for factors

- Humans, not computers should decide which columns are factors

- Factors are not that useful until you are at the end of your data wrangling, before that you want character vectors so you can do string manipulations

- Tidyverse functions, like `tibble`, and `read_csv` give you columns with strings as character vectors, Base R functions like `data.frame` and `read.csv`

### Factor inspection

Get to know your factor before you start touching it! It’s polite. Let’s use `gapminder$continent` as our example.

In [49]:
str(gapminder$continent)

 Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...


In [50]:
levels(gapminder$continent)

In [51]:
nlevels(gapminder$continent)

In [52]:
class(gapminder$continent)

### Dropping unused levels

Just because you drop all the rows corresponding to a specific factor level, the levels of the factor itself do not change. Sometimes all these unused levels can come back to haunt you later, e.g., in figure legends.

Watch what happens to the levels of `country` when we filter Gapminder to a handful of countries:

In [53]:
nlevels(gapminder$country)

In [54]:
h_countries <- gapminder %>% 
    filter(country %in% c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela"))

nlevels(h_countries$country)

huh? Even though `h_gap` only has data for a handful of countries, we are still schlepping around all 142 levels from the original `gapminder` tibble.

How to get rid of them? We'll use the `forcats::fct_drop` function to do this:

In [55]:
h_countries$country %>% nlevels()

In [56]:
h_countries$country %>% 
    fct_drop() %>% 
    nlevels

### Change order of the levels, principled

By default, factor levels are ordered alphabetically. Which might as well be random, when you think about it! It is preferable to order the levels according to some principle:

- Frequency. Make the most common level the first and so on.
- Another variable. Order factor levels according to a summary statistic for another variable. Example: order Gapminder countries by life expectancy.

First, let’s order continent by frequency, forwards and backwards. This is often a great idea for tables and figures, esp. frequency barplots.

In [57]:
## default order is alphabetical
gapminder$continent %>% 
    levels()

Let's use `forcats::fct_infreq` to order by frequency:

In [58]:
gapminder$continent %>% 
    fct_infreq() %>% 
    levels()

gap2 <- gapminder %>% 
    mutate(continent = fct_infreq(continent))

gap2$continent %>% 
    levels()

Or reverse frequency:

In [59]:
gapminder$continent %>% 
  fct_infreq() %>%
  fct_rev() %>% 
  levels()

### Order one variable by another

You can use `forcats::fct_reorder` to order one variable by another.

The factor is the grouping variable and the default summarizing function is `median` but you can specify something else.

In [60]:
## order countries by median life expectancy
fct_reorder(gapminder$country, gapminder$lifeExp) %>% 
    levels() %>% 
    head()

Using `min` instead to reorder the factors:

In [61]:
## order accoring to minimum life exp instead of median
fct_reorder(gapminder$country, gapminder$lifeExp, min) %>% 
    levels() %>% 
    head()

### Change order of the levels, “because I said so”

Sometimes you just want to hoist one or more levels to the front. Why? Because I said so (sometimes really useful when creating visualizations).

Reminding ourselves of the level order for `gapminder$continent`:

In [62]:
gapminder$continent %>% levels()

Reorder and put Asia and Africa first:

In [63]:
gapminder$continent %>% 
    fct_relevel("Asia", "Africa")  %>% 
    levels()

#### Why do we need to know how to do this?

- Factor levels impact statistical analysis & data visualization!

- For example, these two barcharts of frequency by continent differ only in the order of the continents. Which do you prefer? Discuss with your neighbour.

![](img/factor_viz.png)

## What did we learn today?

- The differences between data frames and tibbles

- The beautiful {lubridate} package for working with dates and times

- Tools for manipulating and working with character data in the {stringr} and {tidyr} packages

- How to take control of our factors using {forcats} and how to investigate factors using base R functions

## Attributions
- [Stat 545](https://stat545.com/) created by Jenny Bryan
- [R for Data Science](https://r4ds.had.co.nz/index.html) by Garrett Grolemund & Hadley Wickham