add sesion 7 and 8

debf0f15 · Laurent Modolo · 7772b8fd · debf0f15 · debf0f15
Verified Commit debf0f15 authored Sep 10, 2021 by Laurent Modolo
--- a/session_7/slides.Rmd
+++ b/session_7/slides.Rmd
+---
+title: '#7 String & RegExp'
+author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
+date: "08 Nov 2019"
+always_allow_html: yes
+output:
+  beamer_presentation:
+    theme: metropolis
+    slide_level: 3
+    fig_caption: no
+    df_print: tibble
+    highlight: tango
+    latex_engine: xelatex
+  slidy_presentation:
+    highlight: tango
+---
+```{r setup, include=FALSE, cache=TRUE}
+knitr::opts_chunk$set(echo = FALSE)
+library(tidyverse)
+```
+## String basics
+```
+string1 <- "This is a string"
+string2 <- 'If I want to include a "quote"
+inside a string, I use single quotes'
+```
+If you forget to close a quote, you’ll see +, the continuation character:
+```
+> "This is a string without a closing quote
+ 
+ 
+ HELP I'M STUCK
+```
+If this happen to you, press Escape and try again!
+## String basics
+To include a literal single or double quote in a string you can use \ to “escape” it:
+```
+double_quote <- "\"" # or '"'
+single_quote <- '\'' # or "'"
+```
+if you want to include a literal backslash, you’ll need to double it up: `"\\"`.
+## String basics
+the printed representation of a string is not the same as string itself
+```
+x <- c("\"", "\\")
+x
+#> [1] "\"" "\\"
+writeLines(x)
+#> "
+#> \
+```
+## String basics
+Special characters:
+The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`
+## String basics
+- String length
+```{r str_length, eval=T, message=FALSE, cache=T}
+str_length(c("a", "R for data science", NA))
+```
+- Combining strings
+```{r str_c, eval=T, message=FALSE, cache=T}
+str_c("x", "y", "z")
+```
+- Subsetting strings
+```{r str_sub, eval=T, message=FALSE, cache=T}
+x <- c("Apple", "Banana", "Pear")
+str_sub(x, 1, 3)
+```
+## String basics
+- Subsetting strings
+negative numbers count backwards from end
+```{r str_sub2, eval=T, message=FALSE, cache=T}
+str_sub(x, -3, -1)
+```
+- Lower case transform
+```{r str_to_lower, eval=T, message=FALSE, cache=T}
+str_to_lower(x)
+```
+- ordering
+```{r str_sort, eval=T, message=FALSE, cache=T}
+str_sort(x)
+```
+## Matching patterns with regular expressions
+Regexps are a very terse language that allow you to describe patterns in strings.
+To learn regular expressions, we’ll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match.
+## Matching patterns with regular expressions
+```{r str_view, eval=T, message=FALSE, cache=T}
+x <- c("apple", "banana", "pear")
+str_view(x, "an")
+```
+The next step up in complexity is `.`, which matches any character (except a newline):
+```{r str_viewdot, eval=T, message=FALSE, cache=T}
+x <- c("apple", "banana", "pear")
+str_view(x, ".a.")
+```
+## Matching patterns with regular expressions
+But if “`.`” matches any character, how do you match the character “`.`”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an ., you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string "`\\.`".
+## Matching patterns with regular expressions
+```{r str_viewdotescape, eval=T, message=FALSE, cache=T}
+dot <- "\\."
+writeLines(dot)
+str_view(c("abc", "a.c", "bef"), "a\\.c")
+```
+## Matching patterns with regular expressions
+If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write "`\\\\`" — you need four backslashes to match one!
+## Matching patterns with regular expressions
+```{r str_viewbackslashescape, eval=T, message=FALSE, cache=T}
+x <- "a\\b"
+writeLines(x)
+str_view(x, "\\\\")
+```
+## Exercises
+- Explain why each of these strings don’t match a \: "`\`", "`\\`", "`\\\`".
+- How would you match the sequence `"'\`?
+- What patterns will the regular expression `\..\..\..` match? How would you represent it as a string?
+## Anchors
+- `^` match the start of the string.
+- `$` match the end of the string.
+```{r str_viewanchors, eval=T, cache=T}
+x <- c("apple", "banana", "pear")
+str_view(x, "^a")
+```
+## Anchors
+- `^` match the start of the string.
+- `$` match the end of the string.
+```{r str_viewanchorsend, eval=T, cache=T}
+str_view(x, "a$")
+```
+## Anchors
+- `^` match the start of the string.
+- `$` match the end of the string.
+```{r str_viewanchorsstartend, eval=T, cache=T}
+x <- c("apple pie", "apple", "apple cake")
+str_view(x, "^apple$")
+```
+## Exercices
+- How would you match the literal string `"$^$"`?
+- Given the corpus of common words in stringr::words, create regular expressions that find all words that:
+  -Start with “y”.
+  - End with “x”
+  - Are exactly three letters long. (Don’t cheat by using `str_length()`!)
+  - Have seven letters or more.
+Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.
+## Character classes and alternatives
+- `\d`: matches any digit.
+- `\s`: matches any whitespace (e.g. space, tab, newline).
+- `[abc]`: matches a, b, or c.
+- `[^abc]`: matches anything except a, b, or c.
+```
+str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
+str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
+str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
+```
+## Character classes and alternatives
+You can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either ‘“abc”’, or "deaf". Note that the precedence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
+```
+str_view(c("grey", "gray"), "gr(e|a)y")
+```
+## Exercices
+Create regular expressions to find all words that:
+- Start with a vowel.
+- That only contain consonants. (Hint: thinking about matching “not”-vowels.)
+- End with ed, but not with eed.
+- End with ing or ise.
+## Repetition
+- `?`: 0 or 1
+- `+`: 1 or more
+- `*`: 0 or more
+```
+x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
+str_view(x, "CC?")
+str_view(x, "CC+")
+str_view(x, 'C[LX]+')
+```
+## Repetition
+You can also specify the number of matches precisely:
+- `{n}`: exactly n
+- `{n,}`: n or more
+- `{,m}`: at most m
+- `{n,m}`: between n and m
+```
+str_view(x, "C{2}")
+str_view(x, "C{2,}")
+str_view(x, "C{2,3}")
+```
+## Exercices
+- Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
+  - `^.*$`
+  - `"\\{.+\\}"`
+  - `\d{4}-\d{2}-\d{2}`
+  - `"\\\\{4}"`
+- Create regular expressions to find all words that:
+  - Start with three consonants.
+  - Have three or more vowels in a row.
+  - Have two or more vowel-consonant pairs in a row.
+## Grouping
+You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like `\1`, `\2` etc. 
+```
+str_view(fruit, "(..)\\1", match = TRUE)
+```
+## Exercices
+- Describe, in words, what these expressions will match:
+  - `"(.)\1\1"`
+  - `"(.)(.)\\2\\1"`
+  - `"(..)\1"`
+  - `"(.).\\1.\\1"`
+  - `"(.)(.)(.).*\\3\\2\\1"`
+- Construct regular expressions to match words that:
+  - Start and end with the same character.
+  - Contain a repeated pair of letters (e.g. `“church”` contains `“ch”` repeated twice.)
+  - Contain one letter repeated in at least three places (e.g. `“eleven”` contains three `“e”`s.)
+## Detect matches
+```
+x <- c("apple", "banana", "pear")
+str_detect(x, "e")
+```
+How many common words start with t?
+```
+sum(str_detect(words, "^t"))
+```
+What proportion of common words end with a vowel?
+```
+mean(str_detect(words, "[aeiou]$"))
+```
+## Combining detection
+Find all words containing at least one vowel, and negate
+```
+no_vowels_1 <- !str_detect(words, "[aeiou]")
+```
+Find all words consisting only of consonants (non-vowels)
+```
+no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
+identical(no_vowels_1, no_vowels_2)
+```
+## With tibble
+```{r str_detecttibble, eval=T, cache=T}
+df <- tibble(
+  word = words, 
+  i = seq_along(word)
+)
+df %>% 
+  filter(str_detect(word, "x$"))
+```
+## Extract matches
+```{r str_sentences, eval=T, cache=T}
+head(sentences)
+```
+We want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:
+```{r color_regex, eval=T, cache=T}
+colours <- c("red", "orange", "yellow", "green", "blue", "purple")
+colour_match <- str_c(colours, collapse = "|")
+colour_match
+```
+## Extract matches
+We can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
+```{r color_regex_extract, eval=T, cache=T}
+has_colour <- str_subset(sentences, colour_match)
+matches <- str_extract(has_colour, colour_match)
+head(matches)
+```
+## Grouped matches
+Imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”.
+```{r noun_regex, eval=T, cache=T}
+noun <- "(a|the) ([^ ]+)"
+has_noun <- sentences %>%
+  str_subset(noun) %>%
+  head(10)
+has_noun %>% 
+  str_extract(noun)
+```
+## Grouped matches
+`str_extract()` gives us the complete match; `str_match()` gives each individual component.
+```{r noun_regex_match, eval=T, cache=T}
+has_noun %>% 
+  str_match(noun)
+```
+## Exercises
+- Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
+## Replacing matches
+Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.
+```{r replacing_matches, eval=T, cache=T}
+sentences %>% 
+  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
+  head(5)
+```
+## Exercices
+- Replace all forward slashes in a string with backslashes.
+- Implement a simple version of `str_to_lower()` using `replace_all()`.
+- Switch the first and last letters in words. Which of those strings are still words?
+## Splitting
+```{r splitting, eval=T, cache=T}
+sentences %>%
+  head(5) %>% 
+  str_split("\\s")
+```
\ No newline at end of file
--- a/session_8/slides.Rmd
+++ b/session_8/slides.Rmd
+---
+title: '#8 Factors'
+author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
+date: "31 Jan 2020"
+always_allow_html: yes
+output:
+  slidy_presentation:
+    highlight: tango
+  beamer_presentation:
+    theme: metropolis
+    slide_level: 3
+    fig_caption: no
+    df_print: tibble
+    highlight: tango
+    latex_engine: xelatex
+---
+```{r setup, include=FALSE, cache=TRUE}
+knitr::opts_chunk$set(echo = TRUE)
+library(tidyverse)
+```
+## Creating factors
+Imagine that you have a variable that records month:
+```{r declare_month, eval=T, cache=T}
+x1 <- c("Dec", "Apr", "Jan", "Mar")
+```
+Using a string to record this variable has two problems:
+1. There are only twelve possible months, and there’s nothing saving you from typos:
+```{r declare_month2, eval=T, cache=T}
+x2 <- c("Dec", "Apr", "Jam", "Mar")
+```
+2. It doesn’t sort in a useful way:
+```{r sort_month, eval=T, cache=T}
+sort(x1)
+```
+## Creating factors
+You can fix both of these problems with a factor.
+```{r sort_month_factor, eval=T, cache=T}
+month_levels <- c(
+  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
+  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
+)
+y1 <- factor(x1, levels = month_levels)
+y1
+sort(y1)
+```
+## Creating factors
+And any values not in the set will be converted to NA:
+```{r sort_month_factor2, eval=T, cache=T}
+y2 <- parse_factor(x2, levels = month_levels)
+y2
+```
+Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.
+```{r inorder_month_factor, eval=T, cache=T}
+f2 <- x1 %>% factor() %>% fct_inorder()
+f2
+levels(f2)
+```
+## General Social Survey
+```{r race_count, eval=T, cache=T}
+gss_cat %>%
+  count(race)
+```
+## General Social Survey
+By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:
+```{r race_plot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(gss_cat, aes(race)) +
+  geom_bar() +
+  scale_x_discrete(drop = FALSE)
+```
+## Modifying factor order
+It’s often useful to change the order of the factor levels in a visualisation.
+```{r tv_hour, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+relig_summary <- gss_cat %>%
+  group_by(relig) %>%
+  summarise(
+    age = mean(age, na.rm = TRUE),
+    tvhours = mean(tvhours, na.rm = TRUE),
+    n = n()
+  )
+ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
+```
+**8_a**
+## Modifying factor order
+It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using `fct_reorder()`. `fct_reorder()` takes three arguments:
+- `f`, the factor whose levels you want to modify.
+- `x`, a numeric vector that you want to use to reorder the levels.
+- Optionally, `fun`, a function that’s used if there are multiple values of `x` for each value of `f`. The default value is `median`.
+## Modifying factor order
+```{r tv_hour_order, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
+  geom_point()
+```
+**8_b**
+## Modifying factor order
+As you start making more complicated transformations, I’d recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:
+```{r tv_hour_order_mutate, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+relig_summary %>%
+  mutate(relig = fct_reorder(relig, tvhours)) %>%
+  ggplot(aes(tvhours, relig)) +
+    geom_point()
+```
+**8_c**
+## `fct_reorder2()`
+Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.
+```{r fct_reorder2, eval=T, plot=T}
+by_age <- gss_cat %>%
+  filter(!is.na(age)) %>%
+  count(age, marital) %>%
+  group_by(age) %>%
+  mutate(prop = n / sum(n))
+```
+**8_d**
+## `fct_reorder2()`
+```{r fct_reorder2a, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(by_age, aes(age, prop, colour = marital)) +
+  geom_line(na.rm = TRUE)
+```
+**8_e**
+## `fct_reorder2()`
+```{r fct_reorder2b, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
+  geom_line() +
+  labs(colour = "marital")
+```
+**8_f**
\ No newline at end of file