From debf0f15611956e9096354fdbb82b5ba4cce6a75 Mon Sep 17 00:00:00 2001
From: Laurent Modolo <laurent.modolo@ens-lyon.fr>
Date: Fri, 10 Sep 2021 17:34:59 +0200
Subject: [PATCH] add sesion 7 and 8

---
 session_7/slides.Rmd | 411 +++++++++++++++++++++++++++++++++++++++++++
 session_8/slides.Rmd | 168 ++++++++++++++++++
 2 files changed, 579 insertions(+)
 create mode 100644 session_7/slides.Rmd
 create mode 100644 session_8/slides.Rmd

diff --git a/session_7/slides.Rmd b/session_7/slides.Rmd
new file mode 100644
index 0000000..27d61a2
--- /dev/null
+++ b/session_7/slides.Rmd
@@ -0,0 +1,411 @@
+---
+title: '#7 String & RegExp'
+author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
+date: "08 Nov 2019"
+always_allow_html: yes
+output:
+  beamer_presentation:
+    theme: metropolis
+    slide_level: 3
+    fig_caption: no
+    df_print: tibble
+    highlight: tango
+    latex_engine: xelatex
+  slidy_presentation:
+    highlight: tango
+---
+```{r setup, include=FALSE, cache=TRUE}
+knitr::opts_chunk$set(echo = FALSE)
+library(tidyverse)
+```
+
+
+## String basics
+
+```
+string1 <- "This is a string"
+string2 <- 'If I want to include a "quote"
+inside a string, I use single quotes'
+```
+
+If you forget to close a quote, youâ€™ll see +, the continuation character:
+
+```
+> "This is a string without a closing quote
++ 
++ 
++ HELP I'M STUCK
+```
+
+If this happen to you, press Escape and try again!
+
+## String basics
+
+To include a literal single or double quote in a string you can use \ to â€œescapeâ€ it:
+
+```
+double_quote <- "\"" # or '"'
+single_quote <- '\'' # or "'"
+```
+if you want to include a literal backslash, youâ€™ll need to double it up: `"\\"`.
+ 
+## String basics
+
+the printed representation of a string is not the same as string itself
+
+```
+x <- c("\"", "\\")
+x
+#> [1] "\"" "\\"
+writeLines(x)
+#> "
+#> \
+```
+
+## String basics
+
+Special characters:
+
+The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`
+
+## String basics
+
+- String length
+
+```{r str_length, eval=T, message=FALSE, cache=T}
+str_length(c("a", "R for data science", NA))
+```
+
+- Combining strings
+```{r str_c, eval=T, message=FALSE, cache=T}
+str_c("x", "y", "z")
+```
+
+- Subsetting strings
+```{r str_sub, eval=T, message=FALSE, cache=T}
+x <- c("Apple", "Banana", "Pear")
+str_sub(x, 1, 3)
+```
+
+## String basics
+- Subsetting strings
+negative numbers count backwards from end
+```{r str_sub2, eval=T, message=FALSE, cache=T}
+str_sub(x, -3, -1)
+```
+
+- Lower case transform
+```{r str_to_lower, eval=T, message=FALSE, cache=T}
+str_to_lower(x)
+```
+
+- ordering
+```{r str_sort, eval=T, message=FALSE, cache=T}
+str_sort(x)
+```
+
+## Matching patterns with regular expressions
+
+Regexps are a very terse language that allow you to describe patterns in strings.
+
+To learn regular expressions, weâ€™ll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match.
+
+## Matching patterns with regular expressions
+
+```{r str_view, eval=T, message=FALSE, cache=T}
+x <- c("apple", "banana", "pear")
+str_view(x, "an")
+```
+
+The next step up in complexity is `.`, which matches any character (except a newline):
+
+```{r str_viewdot, eval=T, message=FALSE, cache=T}
+x <- c("apple", "banana", "pear")
+str_view(x, ".a.")
+```
+
+
+## Matching patterns with regular expressions
+
+But if â€œ`.`â€ matches any character, how do you match the character â€œ`.`â€? You need to use an â€œescapeâ€ to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an ., you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string "`\\.`".
+
+## Matching patterns with regular expressions
+
+```{r str_viewdotescape, eval=T, message=FALSE, cache=T}
+dot <- "\\."
+writeLines(dot)
+str_view(c("abc", "a.c", "bef"), "a\\.c")
+```
+
+## Matching patterns with regular expressions
+
+If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write "`\\\\`" â€” you need four backslashes to match one!
+
+
+## Matching patterns with regular expressions
+
+```{r str_viewbackslashescape, eval=T, message=FALSE, cache=T}
+x <- "a\\b"
+writeLines(x)
+str_view(x, "\\\\")
+```
+
+## Exercises
+
+- Explain why each of these strings donâ€™t match a \: "`\`", "`\\`", "`\\\`".
+- How would you match the sequence `"'\`?
+- What patterns will the regular expression `\..\..\..` match? How would you represent it as a string?
+
+## Anchors
+
+- `^` match the start of the string.
+- `$` match the end of the string.
+
+```{r str_viewanchors, eval=T, cache=T}
+x <- c("apple", "banana", "pear")
+str_view(x, "^a")
+```
+
+## Anchors
+
+- `^` match the start of the string.
+- `$` match the end of the string.
+
+```{r str_viewanchorsend, eval=T, cache=T}
+str_view(x, "a$")
+```
+
+## Anchors
+
+- `^` match the start of the string.
+- `$` match the end of the string.
+
+```{r str_viewanchorsstartend, eval=T, cache=T}
+x <- c("apple pie", "apple", "apple cake")
+str_view(x, "^apple$")
+```
+
+## Exercices
+
+
+
+- How would you match the literal string `"$^$"`?
+
+- Given the corpus of common words in stringr::words, create regular expressions that find all words that:
+  -Start with â€œyâ€.
+  - End with â€œxâ€
+  - Are exactly three letters long. (Donâ€™t cheat by using `str_length()`!)
+  - Have seven letters or more.
+
+Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.
+
+## Character classes and alternatives
+
+- `\d`: matches any digit.
+- `\s`: matches any whitespace (e.g. space, tab, newline).
+- `[abc]`: matches a, b, or c.
+- `[^abc]`: matches anything except a, b, or c.
+
+```
+str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
+str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
+str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
+```
+
+## Character classes and alternatives
+
+You can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either â€˜â€œabcâ€â€™, or "deaf". Note that the precedence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
+
+```
+str_view(c("grey", "gray"), "gr(e|a)y")
+```
+
+## Exercices
+
+Create regular expressions to find all words that:
+
+- Start with a vowel.
+- That only contain consonants. (Hint: thinking about matching â€œnotâ€-vowels.)
+- End with ed, but not with eed.
+- End with ing or ise.
+
+## Repetition
+
+- `?`: 0 or 1
+- `+`: 1 or more
+- `*`: 0 or more
+
+```
+x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
+str_view(x, "CC?")
+str_view(x, "CC+")
+str_view(x, 'C[LX]+')
+```
+
+## Repetition
+
+You can also specify the number of matches precisely:
+
+- `{n}`: exactly n
+- `{n,}`: n or more
+- `{,m}`: at most m
+- `{n,m}`: between n and m
+
+```
+str_view(x, "C{2}")
+str_view(x, "C{2,}")
+str_view(x, "C{2,3}")
+```
+
+## Exercices
+
+- Describe in words what these regular expressions match: (read carefully to see if Iâ€™m using a regular expression or a string that defines a regular expression.)
+  - `^.*$`
+  - `"\\{.+\\}"`
+  - `\d{4}-\d{2}-\d{2}`
+  - `"\\\\{4}"`
+- Create regular expressions to find all words that:
+  - Start with three consonants.
+  - Have three or more vowels in a row.
+  - Have two or more vowel-consonant pairs in a row.
+
+
+## Grouping
+
+You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like `\1`, `\2` etc. 
+
+```
+str_view(fruit, "(..)\\1", match = TRUE)
+```
+
+## Exercices
+
+
+
+- Describe, in words, what these expressions will match:
+  - `"(.)\1\1"`
+  - `"(.)(.)\\2\\1"`
+  - `"(..)\1"`
+  - `"(.).\\1.\\1"`
+  - `"(.)(.)(.).*\\3\\2\\1"`
+- Construct regular expressions to match words that:
+  - Start and end with the same character.
+  - Contain a repeated pair of letters (e.g. `â€œchurchâ€` contains `â€œchâ€` repeated twice.)
+  - Contain one letter repeated in at least three places (e.g. `â€œelevenâ€` contains three `â€œeâ€`s.)
+
+## Detect matches
+
+```
+x <- c("apple", "banana", "pear")
+str_detect(x, "e")
+```
+
+How many common words start with t?
+
+```
+sum(str_detect(words, "^t"))
+```
+
+What proportion of common words end with a vowel?
+
+```
+mean(str_detect(words, "[aeiou]$"))
+```
+
+## Combining detection
+
+Find all words containing at least one vowel, and negate
+```
+no_vowels_1 <- !str_detect(words, "[aeiou]")
+```
+
+Find all words consisting only of consonants (non-vowels)
+```
+no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
+identical(no_vowels_1, no_vowels_2)
+```
+
+## With tibble
+
+```{r str_detecttibble, eval=T, cache=T}
+df <- tibble(
+  word = words, 
+  i = seq_along(word)
+)
+df %>% 
+  filter(str_detect(word, "x$"))
+```
+
+## Extract matches
+
+```{r str_sentences, eval=T, cache=T}
+head(sentences)
+```
+
+We want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:
+
+```{r color_regex, eval=T, cache=T}
+colours <- c("red", "orange", "yellow", "green", "blue", "purple")
+colour_match <- str_c(colours, collapse = "|")
+colour_match
+```
+
+## Extract matches
+
+We can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
+
+```{r color_regex_extract, eval=T, cache=T}
+has_colour <- str_subset(sentences, colour_match)
+matches <- str_extract(has_colour, colour_match)
+head(matches)
+```
+
+## Grouped matches
+
+Imagine we want to extract nouns from the sentences. As a heuristic, weâ€™ll look for any word that comes after â€œaâ€ or â€œtheâ€.
+
+```{r noun_regex, eval=T, cache=T}
+noun <- "(a|the) ([^ ]+)"
+has_noun <- sentences %>%
+  str_subset(noun) %>%
+  head(10)
+has_noun %>% 
+  str_extract(noun)
+```
+
+## Grouped matches
+
+`str_extract()` gives us the complete match; `str_match()` gives each individual component.
+
+```{r noun_regex_match, eval=T, cache=T}
+has_noun %>% 
+  str_match(noun)
+```
+
+## Exercises
+
+- Find all words that come after a â€œnumberâ€ like â€œoneâ€, â€œtwoâ€, â€œthreeâ€ etc. Pull out both the number and the word.
+
+## Replacing matches
+
+Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.
+
+```{r replacing_matches, eval=T, cache=T}
+sentences %>% 
+  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
+  head(5)
+```
+
+## Exercices
+
+- Replace all forward slashes in a string with backslashes.
+- Implement a simple version of `str_to_lower()` using `replace_all()`.
+- Switch the first and last letters in words. Which of those strings are still words?
+
+## Splitting
+
+```{r splitting, eval=T, cache=T}
+sentences %>%
+  head(5) %>% 
+  str_split("\\s")
+```
\ No newline at end of file
diff --git a/session_8/slides.Rmd b/session_8/slides.Rmd
new file mode 100644
index 0000000..add9984
--- /dev/null
+++ b/session_8/slides.Rmd
@@ -0,0 +1,168 @@
+---
+title: '#8 Factors'
+author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
+date: "31 Jan 2020"
+always_allow_html: yes
+output:
+  slidy_presentation:
+    highlight: tango
+  beamer_presentation:
+    theme: metropolis
+    slide_level: 3
+    fig_caption: no
+    df_print: tibble
+    highlight: tango
+    latex_engine: xelatex
+---
+```{r setup, include=FALSE, cache=TRUE}
+knitr::opts_chunk$set(echo = TRUE)
+library(tidyverse)
+```
+
+## Creating factors
+
+Imagine that you have a variable that records month:
+
+```{r declare_month, eval=T, cache=T}
+x1 <- c("Dec", "Apr", "Jan", "Mar")
+```
+
+Using a string to record this variable has two problems:
+
+1. There are only twelve possible months, and thereâ€™s nothing saving you from typos:
+
+```{r declare_month2, eval=T, cache=T}
+x2 <- c("Dec", "Apr", "Jam", "Mar")
+```
+
+2. It doesnâ€™t sort in a useful way:
+
+```{r sort_month, eval=T, cache=T}
+sort(x1)
+```
+
+## Creating factors
+
+You can fix both of these problems with a factor.
+
+```{r sort_month_factor, eval=T, cache=T}
+month_levels <- c(
+  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
+  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
+)
+y1 <- factor(x1, levels = month_levels)
+y1
+sort(y1)
+```
+
+## Creating factors
+
+And any values not in the set will be converted to NA:
+
+```{r sort_month_factor2, eval=T, cache=T}
+y2 <- parse_factor(x2, levels = month_levels)
+y2
+```
+
+Sometimes youâ€™d prefer that the order of the levels match the order of the first appearance in the data.
+
+```{r inorder_month_factor, eval=T, cache=T}
+f2 <- x1 %>% factor() %>% fct_inorder()
+f2
+levels(f2)
+```
+
+## General Social Survey
+
+```{r race_count, eval=T, cache=T}
+gss_cat %>%
+  count(race)
+```
+
+## General Social Survey
+
+By default, ggplot2 will drop levels that donâ€™t have any values. You can force them to display with:
+
+```{r race_plot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(gss_cat, aes(race)) +
+  geom_bar() +
+  scale_x_discrete(drop = FALSE)
+```
+
+## Modifying factor order
+
+Itâ€™s often useful to change the order of the factor levels in a visualisation.
+
+```{r tv_hour, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+relig_summary <- gss_cat %>%
+  group_by(relig) %>%
+  summarise(
+    age = mean(age, na.rm = TRUE),
+    tvhours = mean(tvhours, na.rm = TRUE),
+    n = n()
+  )
+ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
+```
+
+**8_a**
+
+## Modifying factor order
+
+It is difficult to interpret this plot because thereâ€™s no overall pattern. We can improve it by reordering the levels of relig using `fct_reorder()`. `fct_reorder()` takes three arguments:
+
+- `f`, the factor whose levels you want to modify.
+- `x`, a numeric vector that you want to use to reorder the levels.
+- Optionally, `fun`, a function thatâ€™s used if there are multiple values of `x` for each value of `f`. The default value is `median`.
+
+## Modifying factor order
+
+```{r tv_hour_order, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
+  geom_point()
+```
+
+**8_b**
+
+## Modifying factor order
+
+As you start making more complicated transformations, Iâ€™d recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:
+
+```{r tv_hour_order_mutate, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+relig_summary %>%
+  mutate(relig = fct_reorder(relig, tvhours)) %>%
+  ggplot(aes(tvhours, relig)) +
+    geom_point()
+```
+**8_c**
+
+## `fct_reorder2()`
+
+Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.
+
+```{r fct_reorder2, eval=T, plot=T}
+by_age <- gss_cat %>%
+  filter(!is.na(age)) %>%
+  count(age, marital) %>%
+  group_by(age) %>%
+  mutate(prop = n / sum(n))
+```
+**8_d**
+
+## `fct_reorder2()`
+
+```{r fct_reorder2a, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(by_age, aes(age, prop, colour = marital)) +
+  geom_line(na.rm = TRUE)
+```
+
+**8_e**
+
+## `fct_reorder2()`
+
+```{r fct_reorder2b, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
+ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
+  geom_line() +
+  labs(colour = "marital")
+```
+
+**8_f**
\ No newline at end of file
-- 
GitLab