From 9e84c706d0ce6376a07db68179447ea85cbffa6d Mon Sep 17 00:00:00 2001 From: Gilquin <laurent.gilquin@ens-lyon.fr> Date: Wed, 17 Jul 2024 11:16:34 +0200 Subject: [PATCH] feat: reshape exercise code block * replace pencadre div by custom Quarto callout block * correct exercise typos (english, markdown) * add solutions to session 7 * reshape author(s) --- session_1/session_1.Rmd | 80 ++++++----- session_2/session_2.Rmd | 56 ++++---- session_3/session_3.Rmd | 84 +++++------ session_4/session_4.Rmd | 122 +++++++++------- session_5/session_5.Rmd | 59 ++++---- session_6/session_6.Rmd | 28 ++-- session_7/session_7.Rmd | 307 +++++++++++++++++++++++++++++++++------- session_8/session_8.Rmd | 9 +- 8 files changed, 484 insertions(+), 261 deletions(-) diff --git a/session_1/session_1.Rmd b/session_1/session_1.Rmd index a5b5154..d7ac9e3 100644 --- a/session_1/session_1.Rmd +++ b/session_1/session_1.Rmd @@ -1,7 +1,11 @@ --- title: 'R.1: Introduction to R and RStudio' -author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\n Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" +author: + - "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" + - "Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" date: "2022" +filters: + - callout-exercise --- ```{r include=FALSE} @@ -151,15 +155,14 @@ Now that we know what we should do and what to expect, we are going to try some - Exponents: `^` or `**` - Parentheses: `(`, `)` -<div class="pencadre"> <!-- TODO: replace with quarto callout --> Now Open RStudio. +::: {.callout-tip} You can `copy paste` but I advise you to practice writing directly in the terminal. Like all the languages, you will become more familiar with R by using it. To validate the line at the end of your command: press `Return`. -</div> - +::: ### First commands @@ -283,11 +286,11 @@ If we want our future programs to be able to perform automatic choices, we need Comparisons can be made with R. The result will return a `TRUE` or `FALSE` value (which is not a number as before but a `boolean` type). -<div class="pencadre"> <!-- TODO: replace with quarto callout --> +::: {.callout-exercise} Try the following operator to get a `TRUE` then change your command to get a `FALSE`. You can use the `↑` (upper arrow) key to edit the last command and go through your history of commands -</div> +::: - equality (note: two equal signs read as "is equal to") @@ -315,8 +318,8 @@ You can use the `↑` (upper arrow) key to edit the last command and go through 1 > 0 ``` -<div class="pencadre"> <!-- TODO: replace with quarto callout --> - **Summary so far** +::: {.callout-note} +## Summary so far - R is a programming language and free software environment for statistical computing and graphics (free & opensource) with a large library of external packages available for performing diverse tasks. @@ -324,7 +327,7 @@ computing and graphics (free & opensource) with a large library of external pack - R can be used as a calculator - R can perform comparisons -</div> +::: ## Variables and assignment @@ -418,7 +421,9 @@ camelCaseToSeparateWords What you use is up to you, but be consistent. -<div class="pencadre"> <!-- TODO: replace with quarto callout --> Which of the following are valid R variable names?</div> +::: {.callout-exercise} +Which of the following are valid R variable names? +::: ```{r eval=F, } min_height @@ -478,9 +483,9 @@ This block allows you to view the different outputs (?help, graphs, etc.).  -<div class="pencadre"> <!-- TODO: replace with quarto callout --> +::: {.callout-exercise} Test that your `logarithm` function can work in base 10 -</div> +::: <details><summary>Solution</summary> <p> @@ -540,8 +545,9 @@ function_name <- function(a, b){ - The order of arguments is important -<div class="pencadre"> <!-- TODO: replace with quarto callout --> +::: {.callout-exercise} Predict the result of R1, R2 and R3. +::: ```R minus <- function(a, b) { @@ -560,7 +566,6 @@ a <- 2 b <- 10 R3 <- minus(b, a) ``` -</div> <details><summary>Solution 1</summary> <p> @@ -596,8 +601,10 @@ minus(b, a) - Naming variables is more explicit and bypasses the order. -<div class="pencadre"> <!-- TODO: replace with quarto callout --> +::: {.callout-exercise} Predict the result of R1, R2, R3 and R4. +::: + ```R a <- 10 b <- 2 @@ -619,7 +626,6 @@ R3 <- a ## R4 R4 <- minus(b = b, a = a) ``` -</div> <details><summary>Solution 1</summary> <p> @@ -686,10 +692,9 @@ print_hw <- function() { } ``` -<div class="pencadre"> <!-- TODO: replace with quarto callout --> +::: {.callout-exercise} What is the difference between `print_hw` and `print_hw()` ? - -</div> +::: <details><summary>Solution</summary> <p> @@ -708,9 +713,8 @@ print_hw() </details> -### Some exercices +### Challenges -<div class="pencadre"> <!-- TODO: replace with quarto callout --> 1. Try a function (`rect_area`) to calculate the area of a rectangle of length "L" and width "W" 2. (more difficult) Try a function (`even_test`) to test if a number is even? @@ -723,8 +727,6 @@ of the modulo is equal to `0`. 3. Using your `even_test` function, write a new function `even_print` which will print the string "This number is even" or "This number is odd". You will need the `if`, `else` statements and the function `print`. Find help on how to use them. -</div> - <details><summary>Solution 1 </summary> <p> @@ -819,9 +821,9 @@ Check the documentation of this command. ls() ``` -<div class="pencadre"> <!-- TODO: replace with quarto callout --> +::: {.callout-exercise} Combine `rm` and `ls` to cleanup your *Environment* -</div> +::: <details><summary>Solution</summary> <p> @@ -835,7 +837,7 @@ rm(list = ls()) ls() ``` -<div class='pencadre'> <!-- TODO: replace with quarto callout --> +::: {.callout-note} **Summary so far:** - Assigning a variable is done with `<-`. @@ -844,7 +846,7 @@ ls() - Functions are also variables and can write in several forms. - An editing box is available on RStudio. -</div> +::: ## Complex variable type @@ -955,14 +957,14 @@ x[x > 5] <- 13 x ``` -<div class="pencadre"> <!-- TODO: replace with quarto callout --> +::: {.callout-note} **Summary so far** - A variable can be of different types : `numeric`, `character`, `vector`, `function`, etc. - Calculations and comparisons apply to vectors. - Do not hesitate to use the help box to understand functions! -</div> +::: We will see other complex variable types during this formation. @@ -993,16 +995,16 @@ or you can click on `Tools` and `Install Packages...`  -<!-- Install also the `ggplot2` package. --> +Install also the `ggplot2` package. -<!-- <details><summary>Solution</summary> --> -<!-- <p> --> -<!-- ```R --> -<!-- install.packages("ggplot2") --> -<!-- ``` --> -<!-- </p> --> -<!-- </details> --> +<details><summary>Solution</summary> +<p> +```R +install.packages("ggplot2") +``` +</p> +</details> #### From Bioconducor @@ -1057,9 +1059,9 @@ The command `sessionInfo` displays your session information. sessionInfo() ``` -<div class='pencadre'> <!-- TODO: replace with quarto callout --> +::: {.callout-exercise} Use the command `library` to load the `ggplot2` package and check your session -</div> +::: <details><summary>Solution</summary> <p> diff --git a/session_2/session_2.Rmd b/session_2/session_2.Rmd index 96516d4..31a153b 100644 --- a/session_2/session_2.Rmd +++ b/session_2/session_2.Rmd @@ -1,7 +1,11 @@ --- title: "R.2: introduction to Tidyverse" -author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nHélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" +author: + - "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" + - "Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" date: "2022" +filters: + - callout-exercise --- ```{r include=FALSE} @@ -216,9 +220,9 @@ ggplot(data = new_mpg) + geom_point(mapping = aes(x = displ, y = hwy)) ``` -<div class="pencadre"> +::: {.callout-exercise} Are cars with bigger engines less fuel efficient ? -</div> +::: `ggplot2` is a system for declaratively creating graphics, based on [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). You provide the data, tell `ggplot2` how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. @@ -234,9 +238,10 @@ ggplot(data = <DATA>) + ( for instance, `geom_point()` adds a layer with a scatterplot ) - each **geom** function in ggplot2 takes a `mapping` argument - the `mapping` argument is always paired with aesthetics `aes()` -<div class="pencadre"> + +::: {.callout-exercise} What happened when you only use the command `ggplot(data = mpg)` ? -</div> +::: <details><summary>Solution</summary> <p> @@ -247,9 +252,9 @@ ggplot(data = new_mpg) </details> -<div class="pencadre"> +::: {.callout-exercise} Make a scatterplot of `hwy` ( fuel efficiency ) vs. `cyl` ( number of cylinders ). -</div> +::: <details><summary>Solution</summary> @@ -261,9 +266,9 @@ ggplot(data = new_mpg, mapping = aes(x = hwy, y = cyl)) + </p> -<div class="pencadre"> +::: {.callout-exercise} What seems to be the problem ? -</div> +::: <details><summary>Solution</summary> <p> @@ -326,9 +331,9 @@ Here is a list of different shapes available in R: {width=300px} </center> -<div class="pencadre"> +::: {.callout-exercise} What's gone wrong with this code? Why are the points not blue? -</div> +::: ```{r new_mpg_plot_not_blue, cache = TRUE, fig.width=8, fig.height=4.5} ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = "blue")) + @@ -353,9 +358,9 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = cyl)) + geom_point() ``` -<div class="pencadre"> +::: {.callout-exercise} What happens if you map an aesthetic to something other than a variable name, like `color = displ < 5`? -</div> +::: <details><summary>Solution</summary> <p> @@ -380,9 +385,9 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + facet_wrap(~class, nrow = 2) ``` -<div class="pencadre"> +::: {.callout-exercise} Now try to facet your plot by `fuel + class` -</div> +::: <details><summary>Solution</summary> @@ -443,23 +448,19 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + geom_smooth(data = filter(mpg, class == "subcompact")) ``` -## Challenge ! +## Challenges ### First challenge -<div class="pencadre"> + Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions. -</div> ```R ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) + geom_point(show.legend = FALSE) + geom_smooth(se = FALSE) ``` -<div class="pencadre"> - What does `show.legend = FALSE` do? - What does the `se` argument to `geom_smooth()` do? -</div> - <details><summary>Solution</summary> <p> @@ -475,15 +476,12 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) + ### Second challenge -<div class="pencadre"> How being a `Two Seaters` car (*class column*) impact the engine size (*displ column*) versus fuel efficiency relationship (*hwy column*) ? 1. Make a plot of `hwy` in function of `displ ` 1. *Colorize* this plot in another color for `Two Seaters` class 2. *Split* this plot for each *class* -</div> - <details><summary>Solution 1</summary> <p> @@ -520,9 +518,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + </details> -<div class="pencadre"> -Write a `function` called `plot_color_a_class` that can take as argument the class and plot the same graph for this class -</div> +Write a `function` called `plot_color_a_class` that can take as argument the class and plot the same graph for this class. <details><summary>Solution</summary> <p> @@ -543,9 +539,7 @@ plot_color_a_class("Compact Cars") ### Third challenge -<div class="pencadre"> Recreate the R code necessary to generate the following graph (see "linetype" option of `geom_smooth`) -</div> ```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) + @@ -646,9 +640,9 @@ ggsave("test_plot_1_and_2.png", p_final, width = 20, height = 8, units = "cm") You can learn more features about `cowplot` on [https://wilkelab.org/cowplot/articles/introduction.html](its website). -<div class="pencadre"> +::: {.callout-exercise} Use the `cowplot` documentation to reproduce this plot and save it. -</div> +::: ```{r, echo=F} p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) + diff --git a/session_3/session_3.Rmd b/session_3/session_3.Rmd index 3cca44e..3a91854 100644 --- a/session_3/session_3.Rmd +++ b/session_3/session_3.Rmd @@ -1,7 +1,11 @@ --- title: 'R.3: Transformations with ggplot2' -author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" +author: + - "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" + - "Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" date: "2022" +filters: + - callout-exercise --- ```{r include=FALSE} @@ -45,13 +49,13 @@ For example, we may want to have coordinates on an axis proportional to the numb We are going to use the `diamonds` data set included in `tidyverse`. -<div class="pencadre"> +::: {.callout-exercise} - Use the `help` and `View` commands to explore this data set. - How many records does this dataset contain ? - Try the `str` command. What information is displayed ? -</div> +::: ```{r str_diamon} str(diamonds) @@ -109,9 +113,9 @@ demo <- tribble( to guess their meaning from the context, and you will learn exactly what they do soon!) -<div class="pencadre"> +::: {.callout-exercise} So instead of using the default `geom_bar` parameter `stat = "count"` try to use `"identity"` -</div> +::: <details><summary>Solution</summary> <p> @@ -129,9 +133,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = after_stat(prop), group = 1)) geom_bar() ``` -<div class="pencadre"> +::: {.callout-exercise} In our proportion bar chart, we need to set `group = 1`. Why? -</div> +::: <details><summary>Solution</summary> <p> @@ -146,24 +150,18 @@ If `group` is not used, the proportion is calculated with respect to the data th ### More details with `stat_summary` -<div class="pencadre"> You might want to draw greater attention to the statistical transformation in your code. You might use `stat_summary()`, which summarize the **y** values for each unique **x** value, to draw attention to the summary that you are computing. -</div> -<details><summary>Solution</summary> -<p> ```{r 3_c, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + stat_summary() ``` -</p> -</details> -<div class="pencadre"> +::: {.callout-exercise} Set the `fun.min`, `fun.max` and `fun` to the `min`, `max` and `median` function respectively. -</div> +::: <details><summary>Solution</summary> <p> @@ -182,9 +180,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`. -<div class="pencadre"> +::: {.callout-exercise} Try both approaches on a `cut`, histogram. -</div> +::: <details><summary>Solution</summary> <p> @@ -202,9 +200,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) + You can also use `fill` with another variable. -<div class="pencadre"> +::: {.callout-exercise} Try to color by `clarity`. Is `clarity` a continuous or categorical variable ? -</div> +::: <details><summary>Solution</summary> <p> @@ -219,9 +217,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + The stacking of the `fill` parameter is performed by the position adjustment `position`. -<div class="pencadre"> +::: {.callout-exercise} Try the following `position` parameter for `geom_bar`: `"fill"`, `"dodge"` and `"jitter"` -</div> +::: <details><summary>Solution</summary> @@ -245,9 +243,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + `jitter` is often used for plotting points when they are stacked on top of each other. -<div class="pencadre"> +::: {.callout-exercise} Compare `geom_point` to `geom_jitter` plot `cut` versus `depth` and color by `clarity` -</div> +::: <details><summary>Solution</summary> <p> @@ -263,9 +261,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + </p> </details> -<div class="pencadre"> +::: {.callout-exercise} What parameters of `geom_jitter` control the amount of jittering ? -</div> +::: <details><summary>Solution</summary> <p> @@ -276,7 +274,12 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + </p> </details> -In the `geom_jitter` plot that we made, we cannot really see the limits of the different clarity groups. Instead we can use the `geom_violin` to see their density. +In the `geom_jitter` plot that we made, we cannot really see the limits of the different clarity groups. +A `violin` plot can be used often to display their density. + +::: {.callout-exercise} +Use `geom_violin` instead of `geom_jitter`. +::: <details><summary>Solution</summary> <p> @@ -296,9 +299,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + geom_boxplot() ``` -<div class="pencardre"> +::: {.callout-exercise} Add the `coord_flip()` layer to the previous plot. -</div> +::: <details><summary>Solution</summary> <p> @@ -310,8 +313,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + </p> </details> -<div class="pencardre"> -Add the `coord_polar()` layer to this plot: +::: {.callout-exercise} +Add the `coord_polar()` layer to the following plot. +::: ```{r diamonds_bar, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE, eval=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) + @@ -319,7 +323,6 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) + theme(aspect.ratio = 1) + labs(x = NULL, y = NULL) ``` -</div> <details><summary>Solution</summary> <p> @@ -350,9 +353,9 @@ library(gganimate) library(gifski) ``` -<div class="pencardre"> +::: {.callout-exercise} Use the `openxlsx` package to save the [gapminder.xlsx](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx) file into the `gapminder` variable. -</div> +::: <details><summary>Solution</summary> <p> @@ -378,9 +381,9 @@ This dataset contains 4 variables of interest for us to display per country: - `pop` the population size - `contient` a factor with 5 levels -<div class="pencardre"> +::: {.callout-exercise} Using `ggplot2`, build a scatterplot of the `gdpPercap` vs `lifeExp`. Add the `pop` and `continent` information to this plot. -</div> +::: <details><summary>Solution</summary> <p> @@ -391,10 +394,10 @@ ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) + </p> </details> -<div class="pencardre"> +::: {.callout-exercise} What's wrong ? You can use the `scale_x_log10()` to display the `gdpPercap` on the `log10` scale. -</div> +::: <details><summary>Solution</summary> @@ -407,11 +410,12 @@ ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) + </p> </details> -<div class="pencardre"> + We would like to add the `year` information to the plots. We could use a `facet_wrap`, but instead we are going to use the `gganimate` package. -For this we need to add a `transition_time` layer that will take as an argument `year` to our plot. -</div> +::: {.callout-exercise} +Add a `transition_time` layer that will take as an argument `year` to our plot. +::: <details><summary>Solution</summary> <p> diff --git a/session_4/session_4.Rmd b/session_4/session_4.Rmd index 96ea42b..537de42 100644 --- a/session_4/session_4.Rmd +++ b/session_4/session_4.Rmd @@ -1,7 +1,11 @@ --- title: "R.4: data transformation" -author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" +author: + - "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" + - "Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)" date: "2022" +filters: + - callout-exercise --- ```{r include=FALSE} @@ -30,9 +34,9 @@ The objectives will be to: For this session, we are going to work with a new dataset included in the `nycflights13` package. -<div class="pencadre"> +::: {.callout-exercise} Install this package and load it. As usual you will also need the `tidyverse` library. -</div> +::: <details><summary>Solution</summary> <p> @@ -118,9 +122,9 @@ filter(flights, month %in% c(5, 6, 7, 12)) `dplyr` functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, `<-`. -<div class="pencadre"> +::: {.callout-exercise} Save the flights longer than 680 minutes in a `long_flights` variable. -</div> +::: <details><summary>Solution</summary> <p> @@ -146,8 +150,9 @@ In R you can use the symbols `&` (and), `|` (or), `!` (not) and the function `xo  -<div class="pencadre"> -Display the `long_flights` variable and predict the results of: +::: {.callout-exercise} +Display the `long_flights` variable and predict the results of the following operations. +::: ```{r logical_operators_exemples2, eval=FALSE} filter(long_flights, day <= 15 & carrier == "HA") @@ -155,8 +160,6 @@ filter(long_flights, day <= 15 | carrier == "HA") filter(long_flights, (day <= 15 | carrier == "HA") & (!month > 2)) ``` - -</div> <details><summary>Solution</summary> <p> @@ -173,8 +176,9 @@ filter(long_flights, (day <= 15 | carrier == "HA") & (!month > 2)) -<div class="pencadre"> +::: {.callout-exercise} Test the following operations and translate them with words. +::: ```{r filter_logical_operators_a, eval=FALSE} filter(flights, month == 11 | month == 12) @@ -196,20 +200,20 @@ filter(flights, arr_delay <= 120 & dep_delay <= 120) filter(flights, arr_delay <= 120, dep_delay <= 120) ``` -</div> - +::: {.callout-tip} Combining logical operators is a powerful programmatic way to select subset of data. However, keep in mind that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of a long one. +::: - -<div class="pencadre"> R either prints out the results, or saves them to a variable. + +::: {.callout-exercise} What happens when you put your variable assignment code between parenthesis `(` `)` ? +::: ```{r filter_month_day_sav_display, eval=FALSE} (dec25 <- filter(flights, month == 12, day == 25)) ``` -</div> ### Missing values @@ -245,12 +249,10 @@ filter(df, is.na(y) | y > 1) ### Challenges -<div class="pencadre"> Find all flights that: - Had an arrival delay (`arr_delay`) of two or more hours (you can check `?flights`) - Flew to Houston (IAH or HOU) -</div> <details><summary>Solution</summary> <p> @@ -260,9 +262,7 @@ filter(flights, arr_delay >= 120 & dest %in% c("IAH", "HOU")) </p> </details> -<div class="pencadre"> How many flights have a missing `dep_time` ? -</div> <details><summary>Solution</summary> <p> @@ -273,9 +273,7 @@ filter(flights, is.na(dep_time)) </p> </details> -<div class="pencadre"> Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!) -</div> <details><summary>Solution</summary> <p> @@ -321,14 +319,11 @@ arrange(df, desc(y)) ``` ### Challenges -<div class="pencadre"> - Find the most delayed flight at arrival (`arr_delay`). - Find the flight that left earliest (`dep_delay`). - How could you arrange all missing values to the start in the `df` tibble ? -</div> - <details><summary>Solution</summary> <p> @@ -391,9 +386,6 @@ See `?select` for more details. ### Challenges -<div class="pencadre"> -<p> - - Brainstorm as many ways as possible to select only `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`. You can associate several selections arguments with `|` , `&` and `!`. The simplest way to start: @@ -403,7 +395,6 @@ df_dep_arr <- select(flights, dep_time, dep_delay, arr_time, arr_delay) colnames(df_dep_arr) ``` - <details><summary>Other solutions</summary> <p> @@ -473,30 +464,24 @@ select(flights, contains("TIME", ignore.case = FALSE)) ``` </p> </details> - - -</p> -</div> - ## Add new variables with `mutate()` It's often useful to add new columns that are functions of existing columns. That's the job of `mutate()`. -<div class="pencadre"> -First let's create a thinner dataset to work on `flights_thin` that contains: +We will first create a thinner dataset `flights_thin_toy` to work on `flights_thin` that contains: - columns from `year` to `day` - columns that ends with `delay` - the `distance` and `air_time` columns - the `dep_time` and `sched_dep_time` columns -Then let's create an even smaller toy dataset to test your commands before using them on the larger one (It a good reflex to take). For that you can use the function `head` or `sample_n` for a random sampling alternative. - -- select only 5 rows +Then we will create an even smaller toy dataset `flights_thin_toy2` to test our commands before using them on the larger one (It a good reflex to take). For that you can use the function `head` or `sample_n` for a random sampling alternative. -</div> +::: {.callout-exercise} +Create both `flights_thin_toy` and `flights_thin_toy2`, select only 5 row for the latter. +::: <details><summary>Solution</summary> @@ -526,10 +511,9 @@ We can create a `gain` column, which can be the difference between departure and mutate(flights_thin_toy, gain = dep_delay - arr_delay) ``` -<div class="pencadre"> - -Using `mutate` to add a new column `gain` and `speed` that contains the average speed of the plane to the `flights_thin_toy` tibble (speed = distance / time). -</div> +::: {.callout-exercise} +Use `mutate` to add a new column `gain` and `speed` that contains the average speed of the plane to the `flights_thin_toy` tibble (speed = distance / time). +::: <details><summary>Solution</summary> <p> @@ -545,8 +529,11 @@ flights_thin_toy </details> -<div class="pencadre"> -Currently `dep_time` and `sched_dep_time` are convenient to look at, but difficult to work with, as they're not really continuous numbers (see the help to get more information on these columns). In the flight dataset, convert them to a more convenient representation of the number of minutes since midnight. +Currently `dep_time` and `sched_dep_time` are convenient to look at, but difficult to work with, as they're not really continuous numbers (see the help to get more information on these columns). + +::: {.callout-exercise} +In the flight dataset, convert `dep_time` and `sched_dep_time` to a more convenient representation of the number of minutes since midnight. +::: **Hints** : @@ -566,7 +553,6 @@ HH * 60 + MM It is always a good idea to decompose a problem in small parts. First, only start with `dep_time`. Build the HH and MM columns. Then, try to write both conversions in one row. -</div> <details><summary> Partial solution </summary> <p> @@ -679,7 +665,10 @@ Modify the colors representing the class of cars with the palettes `Dark2` of [R ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() ``` + +::: {.callout-exercise} Go to the links to find the appropriate function: they are very similar between the two packages. +::: <details><summary>Solution</summary> <p> @@ -714,7 +703,10 @@ For the next part, we will use a real data set. Anterior tibial muscle tissue wa First, we will use the gene count table of these samples, formatted for use in ggplot2 ( `pivot_longer()` [function](https://tidyr.tidyverse.org/reference/pivot_longer.html) ). -Open the csv file using the `read_csv2()` function. The file is located at "https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv". +::: {.callout-exercise} +Open the csv file using the `read_csv2()` function. The file is located at: +<span style="font-size:0.75em;"><i>https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv</i></span> +::: <details><summary>Solution</summary> <p> @@ -737,8 +729,10 @@ or you can read it from the following url: </p> </details> +::: {.callout-exercise} With this tibble, use `ggplot2` and the `geom_tile()` function to make a heatmap. Fit the samples on the x-axis and the genes on the y-axis. +::: **Tip**: Transform the counts into log10(x + 1) for a better visualization. @@ -767,7 +761,9 @@ R interprets a large number of colors, indicated in RGB, hexadecimal, or just by {width=400px} </center> +::: {.callout-exercise} With `scale_fill_gradient2()` function, change the colors of the gradient, taking "white" for the minimum value and "springgreen4" for the maximum value. +::: <details><summary>Solution</summary> <p> @@ -780,7 +776,10 @@ DM1_tile_base + scale_fill_gradient2(low = "white", high = "springgreen4") </details> It's better, but still not perfect! -Now let's use the [viridis color gradient](https://gotellilab.github.io/GotelliLabMeetingHacks/NickGotelli/ViridisColorPalette.html) for this graph. + +::: {.callout-exercise} +Use the [viridis color gradient](https://gotellilab.github.io/GotelliLabMeetingHacks/NickGotelli/ViridisColorPalette.html) for this graph. +::: <details><summary>Solution</summary> <p> @@ -795,7 +794,10 @@ DM1_tile_base + scale_fill_viridis_c() For this last exercise, we will use the results of the differential gene expression analysis between DM1 vs WT conditions. -Open the csv file using the `read_csv2()` function. The file is located at "http://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv". +::: {.callout-exercise} +Open the csv file using the `read_csv2()` function. The file is located at: +<span style="font-size:0.75em;"><i>http://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv</i></span> +::: <details><summary>Solution</summary> <p> @@ -821,10 +823,14 @@ To make a Volcano plot, displaying different information on the significance of With `mutate()` and `ifelse()` [fonctions](https://dplyr.tidyverse.org/reference/if_else.html), we will have to create: -- a column 'sig': it indicates if the gene is significant ( TRUE or FALSE ). +- a column `sig`: it indicates if the gene is significant ( TRUE or FALSE ). **Thresholds**: baseMean > 20 and padj < 0.05 and abs(log2FoldChange) >= 1.5 -- a column 'UpDown': it indicates if the gene is significantly up-regulated (Up), down-regulated (Down), or not significantly regulated (NO). +- a column `UpDown`: it indicates if the gene is significantly up-regulated (Up), down-regulated (Down), or not significantly regulated (NO). + +::: {.callout-exercise} +Create the columns `sig` and `UpDown`. +::: <details><summary>Solution</summary> <p> @@ -844,7 +850,10 @@ With `mutate()` and `ifelse()` [fonctions](https://dplyr.tidyverse.org/reference </details> We want to see the top10 DEGs on the graph. For this, we will use the package `ggrepel`. + +::: {.callout-exercise} Install and load the `ggrepel` package. +::: <details><summary>Solution</summary> <p> @@ -860,10 +869,13 @@ library(ggrepel) </details> -Let's **filter** out the table into a new variable, top10, to keep only the significant differentially expressed genes, those with the top 10 adjusted pvalue. The **smaller** the adjusted pvalue, the more significant the gene. +Let's **filter** out the table into a new variable, `top10`, to keep only the significant differentially expressed genes, those with the top 10 adjusted pvalue. The **smaller** the adjusted pvalue, the more significant the gene. +::: {.callout-exercise} +Create the new variable `top10`. +::: -**Tips**: You can use the [function](https://dplyr.tidyverse.org/reference/slice.html) `slice_min()` +**Tip**: You can use the [function](https://dplyr.tidyverse.org/reference/slice.html) `slice_min()`. <details><summary>Solution</summary> <p> @@ -883,9 +895,9 @@ Let's **filter** out the table into a new variable, top10, to keep only the sign The data is ready to be used to make a volcano plot! -<div class="pencadre"> +::: {.callout-exercise} To make the graph below, use `ggplot2`, the functions `geom_point()`, `geom_hline()`, `geom_vline()`, `theme_minimal()`, `theme()` (to remove the legend), `geom_label_repel()` and the function `scale_color_manual()` for the colors. -</div> +::: - **Tips 1**: Don't forget the transformation of the adjusted pvalue. diff --git a/session_5/session_5.Rmd b/session_5/session_5.Rmd index b2d1843..64c00e6 100644 --- a/session_5/session_5.Rmd +++ b/session_5/session_5.Rmd @@ -1,7 +1,10 @@ --- title: "R.5: Pipping and grouping" -author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" +author: + - "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" date: "2022" +filters: + - callout-exercise --- ```{r include=FALSE} @@ -26,11 +29,12 @@ The objectives will be to: - Combining multiple operations with the pipe `%>%` - Work on subgroup of the data with `group_by` -<div class="pencadre"> + For this session, we are going to work with a new dataset included in the `nycflights13` package. -Install this package and load it. -As usual you will also need the `tidyverse` library. -</div> + +::: {.callout-exercise} +Install this package and load it. As usual you will also need the `tidyverse` library. +::: <details><summary>Solution</summary> <p> @@ -43,9 +47,9 @@ library("nycflights13") ## Combining multiple operations with the pipe -<div id="pencadre"> +::: {.callout-exercise} Find the 10 most delayed flights using the ranking function `min_rank()`. -</div> +::: <details><summary>Solution</summary> <p> @@ -65,9 +69,9 @@ We don't want to create useless intermediate variables so we can use the pipe op Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. -<div id="pencadre"> +::: {.callout-exercise} Try to pipe operators to rewrite your precedent code with only **one** variable assignment. -</div> +::: <details><summary>Solution</summary> <p> @@ -128,15 +132,15 @@ ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) + theme(axis.text.x = element_blank()) ``` -<div class="pencadre"> +::: {.callout-exercise} Why did we `group_by` `year` and `month` and not only `year` ? -</div> +::: ### Missing values -<div class="pencadre"> +::: {.callout-exercise} You may have wondered about the `na.rm` argument we used above. What happens if we don't set it? -</div> +::: ```{r summarise_group_by_NA, include=TRUE} flights %>% @@ -170,9 +174,11 @@ ggplot(summ_delay_filghts, mapping = aes(x = avg_distance, y = avg_delay, size = theme(legend.position = "none") ``` -<div class="pencadre"> +::: {.callout-exercise} Imagine that we want to explore the relationship between the average distance (`distance`) and average delay (`arr_delay`) for each location (`dest`) and recreate the above figure. -Here are three steps to prepare those data: +::: + +**Hints** Here are the steps to prepare those data: 1. Group flights by destination. 2. Summarize to compute average distance (`avg_distance`), average delay (`avg_delay`), and number of flights using `n()` (`n_flights`). @@ -181,7 +187,7 @@ Here are three steps to prepare those data: 5. Create a `mapping` on `avg_distance`, `avg_delay` and `n_flights` as `size`. 6. Use the layer `geom_point()` and `geom_smooth()` (use method = lm) 7. We can hide the legend with the layer `theme(legend.position='none')` -</div> + <details><summary>Solution</summary> <p> @@ -208,9 +214,9 @@ flights %>% If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`. -<div class="pencadre"> -Try the following example -</div> +::: {.callout-exercise} +Try the following example. +::: ```{r ungroup, eval=T, message=FALSE, cache=T} flights %>% @@ -223,9 +229,6 @@ flights %>% ### First challenge -<div class="pencadre"> - - Look at the number of canceled flights per day. Is there a pattern? (A canceled flight is a flight where either the `dep_time` or the `arr_time` is `NA`) @@ -239,7 +242,6 @@ Look at the number of canceled flights per day. Is there a pattern? - We can use `geom_col` to have a barplot of the number of `cancel_day` for each. `wday` - You can use the function `fct_reorder()` to reorder the `wday` by number of `cancel_day` and make the plot easier to read. -</div> <details><summary>Solution</summary> <p> @@ -262,9 +264,7 @@ flights %>% ### Second challenge -<div class="pencadre"> Is the proportion of canceled flights by day of the week related to the average departure delay? -</div> <details><summary>Solution</summary> <p> @@ -288,10 +288,8 @@ Which day would you prefer to book a flight ? </p> </details> -<div class="pencadre"> We can add error bars to this plot to justify our decision. Brainstorm a way to have access to the mean and standard deviation or the `prop_cancel_day` and `av_delay`. -</div> <details><summary>Solution</summary> <p> @@ -331,9 +329,8 @@ flights %>% </p> </details> -<div class="pencadre"> + Now that you are aware of the interest of using `geom_errorbar`, what `hour` of the day should you fly if you want to avoid delays as much as possible? -</div> <details><summary>Solution</summary> <p> @@ -364,9 +361,7 @@ flights %>% ### Third challenge -<div class="pencadre"> Which carrier has the worst delays? -</div> <details><summary>Solution</summary> <p> @@ -383,9 +378,7 @@ flights %>% </p> </details> -<div class="pencadre"> Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n=n())`) -</div> <details><summary>Solution</summary> <p> diff --git a/session_6/session_6.Rmd b/session_6/session_6.Rmd index 449f989..3413668 100644 --- a/session_6/session_6.Rmd +++ b/session_6/session_6.Rmd @@ -1,7 +1,11 @@ --- title: "R.6: tidydata" -author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nCarine Rey [carine.rey@ens-lyon.fr](mailto:carine.rey@ens-lyon.fr)" +author: + - "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" + - "Carine Rey [carine.rey@ens-lyon.fr](mailto:carine.rey@ens-lyon.fr)" date: "2022" +filters: + - callout-exercise --- ```{r include=FALSE} @@ -32,9 +36,9 @@ Doing this kind and transformation is often called **data wrangling**, due to th But once this step is finish most of the subsequent analysis will be really fast to do ! -<div class="pencadre"> +::: {.callout-exercise} As usual we will need the `tidyverse` library. -</div> +::: <details><summary>Solution</summary> <p> @@ -46,9 +50,9 @@ library(tidyverse) For this session, we are going to use the `table*` set of datasets which demonstrate multiple ways to layout the same tabular data. -<div class="pencadre"> -Use the help to know more about `table1` dataset -</div> +::: {.callout-exercise} +Use the help to know more about `table1` dataset. +::: <details><summary>Solution</summary> @@ -105,15 +109,15 @@ long_example <- wide_example %>% #### Exercice -<div class="pencadre"> Visualize the `table4a` dataset (you can use the `View()` function). ```{r table4a, eval=F, message=T} View(table4a) ``` +::: {.callout-exercise} Is the data **tidy** ? How would you transform this dataset to make it **tidy** ? -</div> +::: <details><summary>Solution</summary> @@ -161,10 +165,12 @@ long_example %>% pivot_wider( #### Exercice -<div class="pencadre"> -Visualize the `table2` dataset + +Visualize the `table2` dataset. + +::: {.callout-exercise} Is the data **tidy** ? How would you transform this dataset to make it **tidy** ? (you can now make also make a guess from the name of the subsection) -</div> +::: <details><summary>Solution</summary> <p> diff --git a/session_7/session_7.Rmd b/session_7/session_7.Rmd index 3521cba..f869258 100644 --- a/session_7/session_7.Rmd +++ b/session_7/session_7.Rmd @@ -1,7 +1,10 @@ --- title: "R.7: String & RegExp" -author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" +author: + - "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" date: "2022" +filters: + - callout-exercise --- ```{r include=FALSE} @@ -25,9 +28,9 @@ In R a sequence of characters is stored as a string. In this session you will learn the distinctive features of the string type and how we can use string of characters within a programming language which is composed of particular string of characters as function names, variables. -<div class="pencadre"> +::: {.callout-exercise} As usual we will need the `tidyverse` library. -</div> +::: <details><summary>Solution</summary> <p> @@ -126,9 +129,10 @@ regexps form a very terse language that allows you to describe patterns in strin To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. -<div class="pencadre"> +::: {.callout-exercise} + You need to install the `htmlwidgets` packages to use these functions. -</div> +::: <details><summary>Solution</summary> <p> @@ -182,11 +186,38 @@ writeLines(x) str_view(x, "\\\\") ``` -### Exercises +::: {.callout-exercise} + +## Exercises + +1. Explain why each of these strings doesn't match a \: `"\"`, `"\\"`, `"\\\"`. +2. How would you match the sequence `"'\`? +3. What patterns will the regular expression `\..\..\..` match? How would you represent it as a string? -- Explain why each of these strings doesn't match a \: "`\`", "`\\`", "`\\\`". -- How would you match the sequence `"'\`? -- What patterns will the regular expression `\..\..\..` match? How would you represent it as a string? +::: + +<details><summary>Solution</summary> +<p> + +1. + - `"\"`: would leave an open quote as `\"` would be interpreted as a literal double quote, + - `"\\"`: would escape the second `\` so we would be left with a blank, + - `"\\\"`: `\"` would again escape the quote so we would be left with an open quote. + +<p></p> +2. We would need the following pattern `"\\\"'\\\\"`: + + - `\\\"` to escape the double quote, + - `'` doesn't need to be escaped (because the string is defined within double quote), + - `\\\\` to escape `\`. +<p></p> +3. It would match a string of the form: ".(anychar).(anychar).(anychar)" + ```{r str_dotstring, eval=F, message=FALSE, cache=T} + x <- c("alf.r.e.dd.ss..lsdf.d.kj") + str_view(x, "\\..\\..\\..") + ``` +</p> +</details> ### Anchors @@ -209,16 +240,37 @@ x <- c("apple pie", "apple", "apple cake") str_view(x, "^apple$") ``` -### Exercices +::: {.callout-exercise} + +## Exercises + +1. How would you match the literal string `"$^$"`? -- How would you match the literal string `"$^$"`? -- Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: - - Start with "y". - - End with "x". - - Are exactly three letters long (Don't cheat by using `str_length()`!). - - Have seven letters or more. +2. Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: -Since this list is long, you might want to use the match argument to `str_view()` to show only the matching or non-matching words. + a. Start with "y". + b. End with "x". + c. Are exactly three letters long (Don't cheat by using `str_length()`!). + d. Have seven letters or more. + + Since this list is long, you might want to use the match argument to `str_view()` to show only the matching or non-matching words. +::: + +<details><summary>Solution</summary> +<p> + +1. We would need the pattern `"\\$\\^\\$"` + +<p></p> +2. + + a. start with "y": `"^y"` + b. end with "x": `"x$"` + c. three letters long: `"^...$"` + d. seven letters or more: `"......."` + +</p> +</details> ### Character classes and alternatives @@ -242,14 +294,38 @@ Like with mathematical expressions, if alternations ever get confusing, use pare str_view(c("grey", "gray"), "gr(e|a)y") ``` -### Exercices +::: {.callout-exercise} + +## Exercises Create regular expressions to find all words that: -- Start with a vowel. -- That only contains consonants. (Hint: thinking about matching "not"-vowels.) -- End with "ed", but not with "eed". -- End with "ing" or "ise". +1. Start with a vowel. +2. That only contains consonants (Hint: thinking about matching "not"-vowels). +3. End with "ed", but not with "eed". +4. End with "ing" or "ise". + +::: + +<details><summary>Solution</summary> +<p> + +1. start with a vowel: `"^[aeiouy]"` + +2. decomposition: + - start with a consonant: `"^[^aeiouy]"` + - contains one or more consonant: `"[^aeiouy]+"` + - end with a consonant: `"[^aeiouy]$"` + + result is: `"^[^aeiouy][^aeiouy]+[^aeiouy]$"`. + +3. `"[^e]ed$"` + +4. `"(ing|ise)$"` + +</p> +</details> + ### Repetition @@ -279,17 +355,42 @@ str_view(x, "C{2,}") str_view(x, "C{2,3}") ``` -### Exercices +::: {.callout-exercise} + +1. Describe in words what these regular expressions match (read carefully to see if I'm using a regular expression or a string that defines a regular expression): + + a. `^.*$` + b. `"\\{.+\\}"` + c. `\d{4}-\d{2}-\d{2}` + d. `"\\\\{4}"` + +2. Create regular expressions to find all words that: + + a. Start with three consonants. + b. Have three or more vowels in a row. + c. Have two or more vowel-consonant pairs in a row. + +::: + +<details><summary>Solution</summary> +<p> + +1. + + a. (regex) starts with anything and ends with anything, matches whole thing + b. (string regex) matches non-empty text in brackets + c. (regex) matches date in format `yyyy-mm-dd` + d. (string regex) matches string that contains `\` repeated 4 times + +<p></p> +2. -- Describe in words what these regular expressions match (read carefully to see if I'm using a regular expression or a string that defines a regular expression): - - `^.*$` - - `"\\{.+\\}"` - - `\d{4}-\d{2}-\d{2}` - - `"\\\\{4}"` -- Create regular expressions to find all words that: - - Start with three consonants. - - Have three or more vowels in a row. - - Have two or more vowel-consonant pairs in a row. + a. `"^[^aeoiouy]{3}"` + b. `"[aeiou]{3,}"` + c. `"([aeiou][^aeiou]){2,}"` + +</p> +</details> ### Grouping @@ -300,18 +401,47 @@ You learned about parentheses as a way to disambiguate complex expressions. Pare str_view(fruit, "(..)\\1", match = TRUE) ``` -### Exercices +::: {.callout-exercise} + +## Exercises + +1. Describe, in words, what these expressions will match: + + a. `"(.)\\1\\1"` + b. `"(.)(.)\\2\\1"` + c. `"(..)\\1"` + d. `"(.).\\1.\\1"` + e. `"(.)(.)(.).*\\3\\2\\1"` + +2. Construct regular expressions to match words that: + + a. Start and end with the same character. + b. Contain a repeated pair of letters (e.g. `"church"` contains `"ch"` repeated twice). + c. Contain one letter repeated in at least three places (e.g. `"eleven"` contains three `"e"`s). + +::: + +<details><summary>Solution</summary> +<p> + +1. + + a. matches a character repeated thrice + b. matches two characters followed by their reverse order ("abba") + c. matches two characters repeated twice (not each) + d. matches a character repeated 3 times with one character between each repeat + e. matches 3 characters, followed by any characters, then the 3 characters in reverse order + +<p></p> +2. + + a. `"^(.).*\\1$"` + b. `"([A-Za-z]{2}).*\\1"` + c. `"([A-Za-z]).*\\1.*\\1"` + +</p> +</details> -- Describe, in words, what these expressions will match: - - `"(.)\1\1"` - - `"(.)(.)\\2\\1"` - - `"(..)\1"` - - `"(.).\\1.\\1"` - - `"(.)(.)(.).*\\3\\2\\1"` -- Construct regular expressions to match words that: - - Start and end with the same character. - - Contain a repeated pair of letters (e.g. `"church"` contains `"ch"` repeated twice). - - Contain one letter repeated in at least three places (e.g. `"eleven"` contains three `"e"`s). ### Detect matches @@ -402,9 +532,54 @@ has_noun %>% str_match(noun) ``` -### Exercises +::: {.callout-exercise} +Find all words that come after a `number` like `one`, `two`, `three` etc. Pull out both the number and the word. +::: -- Find all words that come after a `number` like `one`, `two`, `three` etc. Pull out both the number and the word. +<details><summary>Solution</summary> +<p> + +Start by creating a vector of words defining digits: +```{r digit_vec, eval=T, cache=T} +nums <- c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine") +``` + +Next, create the corresponding regular expression to catch any worded digit: +```{r digit_regex, eval=T, cache=T} +nums_c <- str_c(nums, collapse = "|") +``` + +Then, construct the full regular expression where: +`(?<![Y])X` means capture string `X` only if not preceded by string `Y`. +Here, `X` corresponds to our worded digit expression and `Y` is any letter (`:alpha:`). + +This way, `(?<![:alpha:]) (one|two|three|four|five|six|seven|eight|nine)` will match any of our digit only if not preceded by a letter. + +We then add a blank space and `[A-Za-z]+` to capture the word following our worded digit: +```{r digit_regex_full, eval=T, cache=T} +re_str <- str_c("(?<![:alpha:])", "(", nums_c, ")", " ", "([A-Za-z]+)", sep = "") +``` + +Let's apply it to our sentences: +```{r sentences_digit_regex, eval=T, cache=T} +sentences %>% + # get the subset of sentences where a match occurred + str_subset(regex(re_str, ignore_case = TRUE)) %>% + # for each sentence get the matched string + str_extract_all(regex(re_str, ignore_case = TRUE)) %>% + # convert to vector + unlist() %>% + # convert to tibble + as_tibble_col(column_name = "expr") %>% + # split matched strings into components + tidyr::separate( + col = "expr", + into = c("digit", "word"), + remove = FALSE + ) +``` +</p> +</details> ### Replacing matches @@ -416,11 +591,45 @@ sentences %>% head(5) ``` -### Exercices +::: {.callout-exercise} + +## Exercises + +1. Replace all forward slashes in a string with backslashes. +2. Implement a simple version of `str_to_lower()` using `str_replace_all()`. +3. Switch the first and last letters in words. Which of those strings are still words? + +::: + +<details><summary>Solution</summary> +<p> + +1. We can use the function `str_replace_all` with a replacement string: + ```{r replacing_slashes, eval=T, cache=T} + test_str <- "/test/" + writeLines(test_str) + + test_str %>% + str_replace_all(pattern = "/", replacement = "\\\\") %>% + writeLines() + ``` + +2. We also can use the function `str_replace_all` with a replacement function: + ```{r replacing_to_lower, eval=T, cache=T} + sentences %>% + str_replace_all(pattern = "([A-Z])", replacement = tolower) %>% + head(5) + ``` + +3. Any words that start and end with the same letter and a few other examples like "war –> raw": + ```{r switching_words, eval=T, cache=T} + words %>% + str_replace(pattern = "(^.)(.*)(.$)", replacement = "\\3\\2\\1") %>% + head(5) + ``` -- Replace all forward slashes in a string with backslashes. -- Implement a simple version of `str_to_lower()` using `replace_all()`. -- Switch the first and last letters in words. Which of those strings are still words? +</p> +</details> ### Splitting diff --git a/session_8/session_8.Rmd b/session_8/session_8.Rmd index e875124..bf8f341 100644 --- a/session_8/session_8.Rmd +++ b/session_8/session_8.Rmd @@ -1,7 +1,10 @@ --- title: "R.8: Factors" -author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" +author: + - "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)" date: "2022" +filters: + - callout-exercise --- ```{r include=FALSE} @@ -24,9 +27,9 @@ In this session, you will learn more about the factor type in R. Factors can be very useful, but you have to be mindful of the implicit conversions from simple vector to factor ! They are the source of lot of pain for R programmers. -<div class="pencadre"> +::: {.callout-exercise} As usual we will need the `tidyverse` library. -</div> +::: <details><summary>Solution</summary> <p> -- GitLab