Skip to content
Snippets Groups Projects
Verified Commit debf0f15 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

add sesion 7 and 8

parent 7772b8fd
No related branches found
No related tags found
No related merge requests found
---
title: '#7 String & RegExp'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "08 Nov 2019"
always_allow_html: yes
output:
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
slidy_presentation:
highlight: tango
---
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
```
## String basics
```
string1 <- "This is a string"
string2 <- 'If I want to include a "quote"
inside a string, I use single quotes'
```
If you forget to close a quote, you’ll see +, the continuation character:
```
> "This is a string without a closing quote
+
+
+ HELP I'M STUCK
```
If this happen to you, press Escape and try again!
## String basics
To include a literal single or double quote in a string you can use \ to “escape” it:
```
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```
if you want to include a literal backslash, you’ll need to double it up: `"\\"`.
## String basics
the printed representation of a string is not the same as string itself
```
x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \
```
## String basics
Special characters:
The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`
## String basics
- String length
```{r str_length, eval=T, message=FALSE, cache=T}
str_length(c("a", "R for data science", NA))
```
- Combining strings
```{r str_c, eval=T, message=FALSE, cache=T}
str_c("x", "y", "z")
```
- Subsetting strings
```{r str_sub, eval=T, message=FALSE, cache=T}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
```
## String basics
- Subsetting strings
negative numbers count backwards from end
```{r str_sub2, eval=T, message=FALSE, cache=T}
str_sub(x, -3, -1)
```
- Lower case transform
```{r str_to_lower, eval=T, message=FALSE, cache=T}
str_to_lower(x)
```
- ordering
```{r str_sort, eval=T, message=FALSE, cache=T}
str_sort(x)
```
## Matching patterns with regular expressions
Regexps are a very terse language that allow you to describe patterns in strings.
To learn regular expressions, we’ll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match.
## Matching patterns with regular expressions
```{r str_view, eval=T, message=FALSE, cache=T}
x <- c("apple", "banana", "pear")
str_view(x, "an")
```
The next step up in complexity is `.`, which matches any character (except a newline):
```{r str_viewdot, eval=T, message=FALSE, cache=T}
x <- c("apple", "banana", "pear")
str_view(x, ".a.")
```
## Matching patterns with regular expressions
But if “`.`” matches any character, how do you match the character “`.`”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an ., you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string "`\\.`".
## Matching patterns with regular expressions
```{r str_viewdotescape, eval=T, message=FALSE, cache=T}
dot <- "\\."
writeLines(dot)
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
## Matching patterns with regular expressions
If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write "`\\\\`" — you need four backslashes to match one!
## Matching patterns with regular expressions
```{r str_viewbackslashescape, eval=T, message=FALSE, cache=T}
x <- "a\\b"
writeLines(x)
str_view(x, "\\\\")
```
## Exercises
- Explain why each of these strings don’t match a \: "`\`", "`\\`", "`\\\`".
- How would you match the sequence `"'\`?
- What patterns will the regular expression `\..\..\..` match? How would you represent it as a string?
## Anchors
- `^` match the start of the string.
- `$` match the end of the string.
```{r str_viewanchors, eval=T, cache=T}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
```
## Anchors
- `^` match the start of the string.
- `$` match the end of the string.
```{r str_viewanchorsend, eval=T, cache=T}
str_view(x, "a$")
```
## Anchors
- `^` match the start of the string.
- `$` match the end of the string.
```{r str_viewanchorsstartend, eval=T, cache=T}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "^apple$")
```
## Exercices
- How would you match the literal string `"$^$"`?
- Given the corpus of common words in stringr::words, create regular expressions that find all words that:
-Start with “y”.
- End with “x”
- Are exactly three letters long. (Don’t cheat by using `str_length()`!)
- Have seven letters or more.
Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.
## Character classes and alternatives
- `\d`: matches any digit.
- `\s`: matches any whitespace (e.g. space, tab, newline).
- `[abc]`: matches a, b, or c.
- `[^abc]`: matches anything except a, b, or c.
```
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
```
## Character classes and alternatives
You can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either ‘“abc”’, or "deaf". Note that the precedence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
```
str_view(c("grey", "gray"), "gr(e|a)y")
```
## Exercices
Create regular expressions to find all words that:
- Start with a vowel.
- That only contain consonants. (Hint: thinking about matching “not”-vowels.)
- End with ed, but not with eed.
- End with ing or ise.
## Repetition
- `?`: 0 or 1
- `+`: 1 or more
- `*`: 0 or more
```
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')
```
## Repetition
You can also specify the number of matches precisely:
- `{n}`: exactly n
- `{n,}`: n or more
- `{,m}`: at most m
- `{n,m}`: between n and m
```
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
```
## Exercices
- Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
- `^.*$`
- `"\\{.+\\}"`
- `\d{4}-\d{2}-\d{2}`
- `"\\\\{4}"`
- Create regular expressions to find all words that:
- Start with three consonants.
- Have three or more vowels in a row.
- Have two or more vowel-consonant pairs in a row.
## Grouping
You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like `\1`, `\2` etc.
```
str_view(fruit, "(..)\\1", match = TRUE)
```
## Exercices
- Describe, in words, what these expressions will match:
- `"(.)\1\1"`
- `"(.)(.)\\2\\1"`
- `"(..)\1"`
- `"(.).\\1.\\1"`
- `"(.)(.)(.).*\\3\\2\\1"`
- Construct regular expressions to match words that:
- Start and end with the same character.
- Contain a repeated pair of letters (e.g. `“church”` contains `“ch”` repeated twice.)
- Contain one letter repeated in at least three places (e.g. `“eleven”` contains three `“e”`s.)
## Detect matches
```
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```
How many common words start with t?
```
sum(str_detect(words, "^t"))
```
What proportion of common words end with a vowel?
```
mean(str_detect(words, "[aeiou]$"))
```
## Combining detection
Find all words containing at least one vowel, and negate
```
no_vowels_1 <- !str_detect(words, "[aeiou]")
```
Find all words consisting only of consonants (non-vowels)
```
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
```
## With tibble
```{r str_detecttibble, eval=T, cache=T}
df <- tibble(
word = words,
i = seq_along(word)
)
df %>%
filter(str_detect(word, "x$"))
```
## Extract matches
```{r str_sentences, eval=T, cache=T}
head(sentences)
```
We want to find all sentences that contain a colour. We first create a vector of colour names, and then turn it into a single regular expression:
```{r color_regex, eval=T, cache=T}
colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
```
## Extract matches
We can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
```{r color_regex_extract, eval=T, cache=T}
has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
```
## Grouped matches
Imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”.
```{r noun_regex, eval=T, cache=T}
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
```
## Grouped matches
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
```{r noun_regex_match, eval=T, cache=T}
has_noun %>%
str_match(noun)
```
## Exercises
- Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
## Replacing matches
Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.
```{r replacing_matches, eval=T, cache=T}
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
```
## Exercices
- Replace all forward slashes in a string with backslashes.
- Implement a simple version of `str_to_lower()` using `replace_all()`.
- Switch the first and last letters in words. Which of those strings are still words?
## Splitting
```{r splitting, eval=T, cache=T}
sentences %>%
head(5) %>%
str_split("\\s")
```
\ No newline at end of file
---
title: '#8 Factors'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "31 Jan 2020"
always_allow_html: yes
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
---
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
## Creating factors
Imagine that you have a variable that records month:
```{r declare_month, eval=T, cache=T}
x1 <- c("Dec", "Apr", "Jan", "Mar")
```
Using a string to record this variable has two problems:
1. There are only twelve possible months, and there’s nothing saving you from typos:
```{r declare_month2, eval=T, cache=T}
x2 <- c("Dec", "Apr", "Jam", "Mar")
```
2. It doesn’t sort in a useful way:
```{r sort_month, eval=T, cache=T}
sort(x1)
```
## Creating factors
You can fix both of these problems with a factor.
```{r sort_month_factor, eval=T, cache=T}
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
sort(y1)
```
## Creating factors
And any values not in the set will be converted to NA:
```{r sort_month_factor2, eval=T, cache=T}
y2 <- parse_factor(x2, levels = month_levels)
y2
```
Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.
```{r inorder_month_factor, eval=T, cache=T}
f2 <- x1 %>% factor() %>% fct_inorder()
f2
levels(f2)
```
## General Social Survey
```{r race_count, eval=T, cache=T}
gss_cat %>%
count(race)
```
## General Social Survey
By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:
```{r race_plot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(gss_cat, aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
```
## Modifying factor order
It’s often useful to change the order of the factor levels in a visualisation.
```{r tv_hour, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
```
**8_a**
## Modifying factor order
It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using `fct_reorder()`. `fct_reorder()` takes three arguments:
- `f`, the factor whose levels you want to modify.
- `x`, a numeric vector that you want to use to reorder the levels.
- Optionally, `fun`, a function that’s used if there are multiple values of `x` for each value of `f`. The default value is `median`.
## Modifying factor order
```{r tv_hour_order, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
```
**8_b**
## Modifying factor order
As you start making more complicated transformations, I’d recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:
```{r tv_hour_order_mutate, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()
```
**8_c**
## `fct_reorder2()`
Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.
```{r fct_reorder2, eval=T, plot=T}
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
```
**8_d**
## `fct_reorder2()`
```{r fct_reorder2a, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
```
**8_e**
## `fct_reorder2()`
```{r fct_reorder2b, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")
```
**8_f**
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment