From c9a59b1cb1238e674490c4cae900bf1c1ba7dd22 Mon Sep 17 00:00:00 2001 From: Gilquin <laurent.gilquin@ens-lyon.fr> Date: Wed, 29 Jan 2025 18:34:00 +0100 Subject: [PATCH] fix: correct exercices and typos * unified the vowels definition (in some exercices the "y" character was missing) * moved one exercice where the repetition special character "+" was used before the section introducing it * changed "Grouping" section title by "Capture group" * illustrated the difference between functions str_extract and str_extract_all --- session_7/session_7.Rmd | 65 +++++++++++++++++++++-------------------- 1 file changed, 33 insertions(+), 32 deletions(-) diff --git a/session_7/session_7.Rmd b/session_7/session_7.Rmd index f869258..6152d0d 100644 --- a/session_7/session_7.Rmd +++ b/session_7/session_7.Rmd @@ -254,6 +254,12 @@ str_view(x, "^apple$") d. Have seven letters or more. Since this list is long, you might want to use the match argument to `str_view()` to show only the matching or non-matching words. + +3. What is the difference between these two commands: + ```{r, str_viewanchorsdiff, eval=F, cache=T} + str_view(stringr::words, "(or|ing$)") + str_view(stringr::words, "(or|ing)$") + ``` ::: <details><summary>Solution</summary> @@ -261,7 +267,7 @@ str_view(x, "^apple$") 1. We would need the pattern `"\\$\\^\\$"` -<p></p> +</p><p> 2. a. start with "y": `"^y"` @@ -269,6 +275,10 @@ str_view(x, "^apple$") c. three letters long: `"^...$"` d. seven letters or more: `"......."` +</p><p> + +3. `"(or|ing$)"` matches words that either contain "or" or end with "ing", while `"(or|ing)$"` matches words that end either with "or" or "ing". + </p> </details> @@ -301,9 +311,8 @@ str_view(c("grey", "gray"), "gr(e|a)y") Create regular expressions to find all words that: 1. Start with a vowel. -2. That only contains consonants (Hint: thinking about matching "not"-vowels). -3. End with "ed", but not with "eed". -4. End with "ing" or "ise". +2. End with "ed", but not with "eed". +3. End with "ing" or "ise". ::: @@ -311,17 +320,10 @@ Create regular expressions to find all words that: <p> 1. start with a vowel: `"^[aeiouy]"` - -2. decomposition: - - start with a consonant: `"^[^aeiouy]"` - - contains one or more consonant: `"[^aeiouy]+"` - - end with a consonant: `"[^aeiouy]$"` - - result is: `"^[^aeiouy][^aeiouy]+[^aeiouy]$"`. -3. `"[^e]ed$"` +2. `"[^e]ed$"` -4. `"(ing|ise)$"` +3. `"(ing|ise)$"` </p> </details> @@ -369,6 +371,7 @@ str_view(x, "C{2,3}") a. Start with three consonants. b. Have three or more vowels in a row. c. Have two or more vowel-consonant pairs in a row. + d. Contain only consonants (Hint: thinking about matching "not"-vowels). ::: @@ -385,15 +388,16 @@ str_view(x, "C{2,3}") <p></p> 2. - a. `"^[^aeoiouy]{3}"` - b. `"[aeiou]{3,}"` - c. `"([aeiou][^aeiou]){2,}"` + a. `"^[^aeiouy]{3}"` + b. `"[aeiouy]{3,}"` + c. `"([aeiouy][^aeiouy]){2,}"` + d. `"^[^aeiouy]+$"` </p> </details> -### Grouping +### Capture group You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with back references, like `\1`, `\2` etc. @@ -459,7 +463,7 @@ sum(str_detect(words, "^t")) What proportion of common words ends with a vowel? ```{r str_view_match_c, eval=T, cache=T} -mean(str_detect(words, "[aeiou]$")) +mean(str_detect(words, "[aeiouy]$")) ``` ### Combining detection @@ -467,25 +471,21 @@ mean(str_detect(words, "[aeiou]$")) Find all words containing at least one vowel, and negate ```{r str_view_detection, eval=T, cache=T} -no_vowels_1 <- !str_detect(words, "[aeiou]") +no_vowels_1 <- !str_detect(words, "[aeiouy]") ``` Find all words consisting only of consonants (non-vowels) ```{r str_view_detection_b, eval=T, cache=T} -no_vowels_2 <- str_detect(words, "^[^aeiou]+$") +no_vowels_2 <- str_detect(words, "^[^aeiouy]+$") identical(no_vowels_1, no_vowels_2) ``` ### With tibble ```{r str_detecttibble, eval=T, cache=T} -df <- tibble( - word = words, - i = seq_along(word) -) -df %>% - filter(str_detect(word, "x$")) +df <- tibble(word = words) %>% mutate(i = rank(word)) +df %>% filter(str_detect(word, "x$")) ``` ### Extract matches @@ -502,14 +502,15 @@ colour_match <- str_c(colours, collapse = "|") colour_match ``` -### Extract matches - -We can select the sentences that contain a colour, and then extract the colour to figure out which one it is: +We can select the sentences that contain a colour, and then extract the first colour from each sentence: ```{r color_regex_extract, eval=T, cache=T} -has_colour <- str_subset(sentences, colour_match) -matches <- str_extract(has_colour, colour_match) -head(matches) +sentences %>% str_subset(colour_match) %>% str_extract(colour_match) +``` + +We can also extract all colours from each selected sentence, as a list of vectors: +```{r color_regex_extract_all, eval=F, cache=T} +sentences %>% str_subset(colour_match) %>% str_extract_all(colour_match) ``` ### Grouped matches -- GitLab