diff --git a/session_1/session_1.Rmd b/session_1/session_1.Rmd index a5995184ad45cb6598fe8e41d162a5f0c3e233e0..0739122a14522ca3f68e753eaccf11f1499ba244 100644 --- a/session_1/session_1.Rmd +++ b/session_1/session_1.Rmd @@ -59,7 +59,7 @@ computing and graphics supported by the *R Foundation for Statistical Computing* Reasons to use it: - It's open source, which means that we have access to every bit of underlying computer code to prove that our results are correct (which is always a good point in science). -- It’s free, well documented, and runs almost everywhere +- It's free, well documented, and runs almost everywhere - It has a large (and growing) user base among scientists - It has a large library of external packages available for performing diverse tasks. @@ -86,7 +86,7 @@ Unlike other statistical software programs like Excel, SPSS, or Minitab that pro This means that you have to write instructions for R. Which means that you are going to learn to write code / program in R. -R is usually used in a terminal in which you can type or paste your R code: +R is generally used in a terminal in which you can type or paste your R code:  @@ -95,13 +95,13 @@ But navigating between your terminal, your code and your plots can be tedious, t ### RStudio, the R Integrated development environment (*IDE*) An IDE application provides **comprehensive facilities** to computer programmers for -software development. Rstudio is **free** and **open-source**. +software development. RStudio is **free** and **open-source**. To open RStudio, you can install the [RStudio application](https://www.rstudio.com/products/rstudio/) and open the app. -Otherwise you can use the link and the login details provided to you by email. The web version of Rstudio is the same as the application expect that you can open it any recent browser. +Otherwise you can use the link and the login details provided to you by email. The web version of RStudio is the same as the application except that you can open it in any recent browser. -#### Rstudio interface +#### RStudio interface  @@ -118,7 +118,7 @@ The same console as before (in Red box) We are now going to write our first commands. We could do it directly in the R console, with multi-line commands but this process is tedious. -Instead we are going to use the Rstudio code editor panel, to write our code. +Instead we are going to use the RStudio code editor panel, to write our code. You can go to **File > New File > R script** to open your editor panel. Beside, you can keep your code history. @@ -126,15 +126,15 @@ Beside, you can keep your code history.  -### How to execute R code in Rstudio ? +### How to execute R code in RStudio ? -RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can +RStudio gives you great flexibility in running code from the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can: -- click on the `Run button` above the editor panel, or +- click on the `Run` button above the editor panel, or - select `Run Selected Lines` from the `Code` menu, or -- hit `Ctrl`+`Return` in Windows or Linux or `Cmd`+`Return` on OS X. To run a block of code, select it and then Run. +- hit `Ctrl`+`Return` in Windows or Linux or `Cmd`+`Return` on OS X. To run a block of code, select it and then click on `Run`. -If you have modified a line of code within a block of code you have just run, there is no need to reselect the section and Run, you can use the next button along, Rerun the previous region. This will run the previous code block including the modifications you have made. +If you have modified a line of code within a block of code you have just run, there is no need to re-select the section and press `Run`. Instead, you can use the next button `Re-run the previous code region`. This will run the previous code block including the modifications you have made. ## R as a Calculator @@ -148,10 +148,10 @@ Now that we know what we should do and what to expect, we are going to try some - Exponents: `^` or `**` - Parentheses: `(`, `)` -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> Now Open RStudio. -You can `copy paste` but I advise you to practice writing directly in the terminal. +You can `copy paste` but I advise you to practice writing directly in the terminal. Like all the languages, you will become more familiar with R by using it. To validate the line at the end of your command: press `Return`. @@ -165,7 +165,7 @@ You should see a `>` character before a blinking cursor. The `>` is called a pro 1 + 100 ``` -For classical output R will write the results with a `[N]` with `N` the row number. +For classical output, R will write the results with a `[N]` with `N` the row number. Here you have a one-line results `[1]` ```{r calculatorstep1res, echo=F, eval=T} @@ -195,7 +195,7 @@ It is waiting for the next command. Write just `100` and press `⏎`: The R console is a textual interface, which means that you will enter code, but it also means that R is going to write information back to you and that you will have to pay attention at what is written. -There are 3 categories of messages that R can send you: **Errors** prefaced with `Error in…`, **Warnings** prefaced with `Warning:` and **Messages** which don’t start with either `Error` or `Warning`. +There are 3 categories of messages that R can send you: **Errors** prefaced with `Error in…`, **Warnings** prefaced with `Warning:` and **Messages** which don't start with either `Error` or `Warning`. - **Errors**, you must consider them as red light. You must figure out what is causing it. Usually you can find useful clues in the errors message about how to solve it. - **Warning**, warnings are yellow light. The code is running but you have to pay attention. It's almost always a good idea to try to fix warnings. @@ -216,7 +216,7 @@ You can use parenthesis `(` `)` to change this order. (3 + 5) * 2 ``` -But to much parenthesis can be hard to read +But too much parenthesis can be hard to read ```{r calculatorstep5, include=TRUE} (3 + (5 * (2 ^ 2))) # hard to read @@ -243,7 +243,7 @@ You can use `e` to write your own scientific notation. ### Mathematical functions R is distributed with a large number of existing functions. -To call mathematical function you must with `function_name(<number>)`. +To call a mathematical function, you must use `function_name(<number>)`. For example, for the natural logarithm: @@ -280,13 +280,13 @@ If we want our future programs to be able to perform automatic choices, we need Comparisons can be made with R. The result will return a `TRUE` or `FALSE` value (which is not a number as before but a `boolean` type). -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> Try the following operator to get a `TRUE` then change your command to get a `FALSE`. You can use the `↑` (upper arrow) key to edit the last command and go through your history of commands </div> -- equality (note two equal signs read as "is equal to") +- equality (note: two equal signs read as "is equal to") ```{r calculatorstep13, include=TRUE} 1 == 1 @@ -312,7 +312,7 @@ You can use the `↑` (upper arrow) key to edit the last command and go through 1 > 0 ``` -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> **Summary so far** - R is a programming language and free software environment for statistical @@ -325,14 +325,14 @@ computing and graphics (free & opensource) with a large library of external pack ## Variables and assignment -In addition to being able to perform a huge number of computations very fast, computers can also store information to memory. +In addition to being able to perform a huge number of computations very fast, computers can also store information in memory. This is a mandatory function to load your data and store intermediate states in your analysis. -In R `<-` is the assignment operator (read as left members take right member value). +In R `<-` is the assignment operator (read as left member take right member value). -` = ` Also exists but is **not recommended!** It will be used preferentially in other cases. (*We will see them later*). +`=` Also exists but is **not recommended!** It will be used preferentially in other cases. (*We will see them later*). If you really don't want to press two consecutive keys for assignment, you can press `alt` + `-` to write `<-`. -Rstudio provides lots of such shortcuts (you can display them by pressing `alt` + `shift` + `k`). +RStudio provides lots of such shortcuts (you can display them by pressing `alt` + `shift` + `k`). We assign a value to `x`, `x` is called a variable. @@ -351,6 +351,7 @@ x You now see the `x` value in the environment box (*in red*).  + This **variable** is present in your work environment. You can use it to perform different mathematical applications. @@ -372,7 +373,7 @@ y A variable can be assigned a `numeric` value as well as a `character` value. -Just put our character (or string) between double quote `"` when you assign this value. +Just put the character (or string) between double quote `"` when you assign this value. ```{r VandAstep6, include=TRUE} z <- "x" # One character z @@ -394,8 +395,8 @@ b typeof(b) ``` -You can type `is.` and press `tabulation`. -Rstudio will show you a list of function whose names start with `is.`. +You can type `is.` and press the `tabulation` key (`↹`). +RStudio will show you a list of function whose names start with `is.`. This is called autocompletion, don't hesitate to spam your `tabulation` key as you write R code. ### Variables names @@ -414,7 +415,7 @@ camelCaseToSeparateWords What you use is up to you, but be consistent. -<div class="pencadre"> Which of the following are valid R variable names?</div> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> Which of the following are valid R variable names?</div> ```{r eval=F, } min_height @@ -454,7 +455,7 @@ A R function can have different arguments function (x, base = exp(1)) ``` -- `base` is a named argument are read from left to right +- `base` is a named argument read from left to right - named arguments breaks the reading order - named arguments make your code more readable @@ -474,7 +475,7 @@ This block allows you to view the different outputs (?help, graphs, etc.).  -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> Test that your `logarithm` function can work in base 10 </div> @@ -489,7 +490,7 @@ Test that your `logarithm` function can work in base 10 ### Writing function -We can define our own function with : +We can define our own function with: - function name, - declaration of function type: `function`, @@ -508,7 +509,7 @@ function_name <- function(a, b){ - a series of operations, The argument `a` and `b` are accessible from within the function body as the variable `a` and `b`. -In the function body argument are independant of the global environment. +In the function body argument are independent of the global environment. ```R function_name <- function(a, b){ @@ -518,7 +519,7 @@ function_name <- function(a, b){ } ``` -- `return` operation +- `return` operation, At the end of a function we want to return a result, so function calls will be equal to this result. @@ -532,11 +533,11 @@ function_name <- function(a, b){ **Note: ** if you don't use `return` by default the evaluation of the last line of your function body is returned. -**Note: ** The function variables (here `a` and `b`) are independant of the global environment: They define to which values the operation will be applied in the function body. +**Note: ** The function variables (here `a` and `b`) are independent of the global environment: They define to which values the operation will be applied in the function body. - The order of arguments is important -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> Predict the result of R1, R2 and R3. ```R @@ -592,7 +593,7 @@ minus(b,a) - Naming variables is more explicit and bypasses the order. -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> Predict the result of R1, R2, R3 and R4. ```R a <- 10 @@ -662,7 +663,7 @@ R4 </details> -- Default values for arguments may be set at definition and the Default value is used when argument is not provided. +- Default values for arguments may be set at definition and the default value is used when argument is not provided. ```{r minus10, include=TRUE} minus_10 <- function(a, b=10){ @@ -682,14 +683,14 @@ print_hw <- function(){ } ``` -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> What is the difference between `print_hw` and `print_hw()` ? </div> <details><summary>Solution</summary> <p> -`print_hw` is considered as an environment variable, and R return the definition of `print_hw`. +`print_hw` is considered as an environment variable, and R returns the definition of `print_hw`. You need to add `()` to execute it ```{r print_hw_env, include=TRUE} @@ -706,18 +707,18 @@ print_hw() ### Some exercices -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> 1. Try a function (`rect_area`) to calculate the area of a rectangle of length "L" and width "W" 2. (more difficult) Try a function (`even_test`) to test if a number is even? -For that, you can use the `%%` modulo operators to get the remainder of an euclidean division and use the `==` comparison to test if the results +For that, you can use the modulo operator `%%` to get the remainder of an euclidean division and use the comparison `==` to test if the results of the modulo is equal to `0`. ```{r modulo, include=TRUE} 13 %% 2 ``` -3. Using your `even_test` function, write a new function `even_print` which will print "This number is even" or "This number is odd". You will need the `if else` statement and the function `print`. Find help on how to use them. +3. Using your `even_test` function, write a new function `even_print` which will print the string "This number is even" or "This number is odd". You will need the `if`, `else` statements and the function `print`. Find help on how to use them. </div> @@ -794,14 +795,14 @@ even_print(3) ### Cleaning up -We can now clean your environment +We can now clean our environment ```{r VandAstep15, include=TRUE} rm(minus) ``` -What appenned in the *Environment* panel ? -Check the documentation of this command +What happened in the *Environment* panel ? +Check the documentation of this command. <details><summary>Solution</summary> <p> @@ -815,7 +816,7 @@ Check the documentation of this command ls() ``` -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> Combine `rm` and `ls` to cleanup your *Environment* </div> @@ -831,25 +832,25 @@ rm(list = ls()) ls() ``` -<div class='pencadre'> +<div class='pencadre'> <!-- TODO: replace with quarto callout --> **Summary so far:** - - Assigning a variable is done with ` <- `. + - Assigning a variable is done with `<-`. - The assigned variables are listed in the environment box. - Variable names can contain letters, numbers, underscores and periods. - - Functions are also variable and can write in several forms - - An editing box is available on Rstudio. + - Functions are also variables and can write in several forms. + - An editing box is available on RStudio. </div> ## Complex variable type You can only go so far with the variables we have already seen. -In R there are also **complex variable type**, which can be seen as combination of simple variable type. +In R there are also **complex variable type**, which can be seen as a combination of simple variable types. ### Vector (aka list) -Vectors are simple list of variable of the same type +Vectors are simple list of variable of the same type: ```{r Vecstep1, include=TRUE} c(1, 2, 3, 4, 5) @@ -909,20 +910,20 @@ x == y ### Accessing values There are multiple ways to access or replace values in vectors or other data structures. The most common approach is to use "indexing". -In the below, note that brackets `[ ]` are used for indexing, whereas you have already seen that parentheses `( )` are used to call a function and `{ }` to define function. It is very important not to mix these up. +In what follows, note that brackets `[ ]` are used for indexing, whereas you have already seen that parentheses `( )` are used to call a function and `{ }` to define function. It is very important not to mix these up. Here are some examples that show how elements of vectors can be obtained by indexing. -You can use the position(s) of the value(s) in the vector +You can use the position(s) of the value(s) in the vector: ```{r index1, include=TRUE} x <- c(1,5,7,8) x[4] x[c(1,3,4)] ``` -You can use booleans to define which values should be kept. +You can use booleans to define which values should be kept: ```{r index2, include=TRUE} x <- c(1,5,7,8,15) @@ -933,14 +934,14 @@ y <- c(TRUE,FALSE,FALSE,FALSE,TRUE) x[y] ``` -You can use names in the case of a named vector. +You can use names in the case of a named vector: ```{r index3, include=TRUE} x <-c(a = 1, b = 2, c = 3, d = 4, e = 5) x[c("a","c")] ``` -You can also use an index to change values +You can also use an index to change values: ```{r index4, include=TRUE} x <- c(1,5,7,8,15) @@ -951,7 +952,7 @@ x[x>5] <- 13 x ``` -<div class="pencadre"> +<div class="pencadre"> <!-- TODO: replace with quarto callout --> **Summary so far** - A variable can be of different types : `numeric`, `character`, `vector`, `function`, etc. @@ -960,26 +961,26 @@ x </div> -We will see other complex variables type during this formation. +We will see other complex variable types during this formation. ## Packages -R base is like a new smartphone, you can do loots of things with it but you can also install new apps to a huge range of other things. +R base is like a new smartphone, you can do lots of things with it but you can also install new apps to a huge range of other things. In R those apps are called **packages**. There are different sources to get packages from: - The [CRAN](https://cran.r-project.org/) which is the default source - [Bioconducor](http://www.bioconductor.org) which is another source specialized for biology packages -- Directly from [github](https://github.com/) +- Directly from [GitHub](https://github.com/) -To install packages from [Bioconducor](http://www.bioconductor.org) and [github](https://github.com/) you will need to install specific packages from the [CRAN](https://cran.r-project.org/). +To install packages from [Bioconducor](http://www.bioconductor.org) and [GitHub](https://github.com/) you will need to install specific packages from the [CRAN](https://cran.r-project.org/). ### Installing packages #### From CRAN -To install packages, you can use the `install.packages` function (don't forget to use tabulation for long variable names). +To install packages, you can use the `install.packages` function (don't forget to use tabulation for long variable names). For instance: ```R install.packages("tidyverse") @@ -989,16 +990,16 @@ or you can click on `Tools` and `Install Packages...`  -Install also the `ggplot2` package. +<!-- Install also the `ggplot2` package. --> -<details><summary>Solution</summary> -<p> -```R -install.packages("ggplot2") -``` -</p> -</details> +<!-- <details><summary>Solution</summary> --> +<!-- <p> --> +<!-- ```R --> +<!-- install.packages("ggplot2") --> +<!-- ``` --> +<!-- </p> --> +<!-- </details> --> #### From Bioconducor @@ -1017,9 +1018,9 @@ Then to install, for example "tximport", you just have to write: BiocManager::install("tximport") ``` -#### From github +#### From GitHub -If you need to install a package that is not available on the CRAN but on a github repository, you can do it using the "remotes" package. Indeed this package imports functions that will allow you to install a package available on [github](https://github.com/) or bitbucket or gitlab directly on your computer. +If you need to install a package that is not available on the CRAN but on a GitHub repository, you can do it using the "remotes" package. Indeed this package imports functions that will allow you to install a package available on [GitHub](https://github.com/) or Bitbucket or GitLab directly on your computer. To use the "remotes" packages, you must first install it: @@ -1027,33 +1028,33 @@ To use the "remotes" packages, you must first install it: install.packages("remotes") ``` -Once "remotes" is installed, you will be able to install all R package from github or from their URL. +Once "remotes" is installed, you will be able to install all R packages from GitHub or from their URL. -For example, if you want to install the last version of a "gganimate", which allow you to animate ggplot2 graphes, you can use : +For example, if you want to install the last version of a "gganimate", which allow you to animate ggplot2 graphs, you can use: ```R remotes::install_github("thomasp85/gganimate") ``` -By default the latest version of the package is installed, if you want a given version you can specify it : +By default the latest version of the package is installed, if you want a given version you can specify it: ```R remotes::install_github("thomasp85/gganimate@v1.0.7") ``` -You can find more information in the documentation of remotes : [https://remotes.r-lib.org](https://remotes.r-lib.org) +You can find more information in the documentation of remotes: [https://remotes.r-lib.org](https://remotes.r-lib.org) ### Loading packages Once a package is installed, you need to load it in your R session to be able to use it. -The command `sessionInfo` display your session information. +The command `sessionInfo` displays your session information. ```{r packagesstep1, include=TRUE} sessionInfo() ``` -<div class='pencadre'> +<div class='pencadre'> <!-- TODO: replace with quarto callout --> Use the command `library` to load the `ggplot2` package and check your session </div> @@ -1068,7 +1069,7 @@ sessionInfo() ### Unloading packages -Sometime, you may want to unload package from your session instead of relaunching R. +Sometimes, you may want to unload a package from your session instead of relaunching R. ```{r packagesstep4, include=TRUE} unloadNamespace("ggplot2") diff --git a/session_2/session_2.Rmd b/session_2/session_2.Rmd index 04e443e87bcaaabba5f1c5c58e04ea3aaf4df173..16c7ba6f6a66da1ff1106b27c3979593408d22dc 100644 --- a/session_2/session_2.Rmd +++ b/session_2/session_2.Rmd @@ -110,7 +110,7 @@ read_csv("data-raw/vehicles.csv") %>% ## Introduction -In the last session, we have gone through the basis of R. +In the last session, we have gone through the basics of R. Instead of continuing to learn more about R programming, in this session we are going to jump directly to rendering plots. We make this choice for three reasons: @@ -142,7 +142,7 @@ All packages share an underlying design philosophy, grammar, and data structures install.packages("tidyverse") ``` -Luckily for you, `tidyverse` is preinstalled on your Rstudio server. So you just have to load the ` library` +Luckily for you, `tidyverse` is pre-installed on your RStudio server. So you just have to load the ` library`: ```{R load_tidyverse} library("tidyverse") @@ -150,7 +150,7 @@ library("tidyverse") ### Toy data set `mpg` -This dataset contains a subset of the fuel economy data that the EPA makes available on [fueleconomy.gov](http://fueleconomy.gov). +This dataset contains a subset of the fuel economy data that the EPA made available on [fueleconomy.gov](http://fueleconomy.gov). It contains only models which had a new release every year between 1999 and 2008. You can use the `?` command to know more about this dataset. @@ -159,10 +159,10 @@ You can use the `?` command to know more about this dataset. ?mpg ``` -But instead of using a dataset included in a R package, you may want to be able to use any dataset with the same format. -For that we are going to use the command `read_csv` which is able to read a [csv](https://en.wikipedia.org/wiki/Comma-separated_values) file. +But instead of using a dataset included in a R package, you may want to use any dataset with the same format. +For that, we are going to use the command `read_csv` which is able to read a [csv](https://en.wikipedia.org/wiki/Comma-separated_values) file. -This command also works for file URL +This command also works for file URL: ```{r mpg_download_local, cache=TRUE, message=FALSE, echo = F, include=F} new_mpg <- read_csv("./mpg.csv") @@ -180,34 +180,34 @@ You can check the number of lines and columns of the data with `dim`: dim(new_mpg) ``` -To visualize the data in Rstudio you can use the command. `View` +To visualize the data in RStudio you can use the command `View`, ```R View(new_mpg) ``` -Or by simply calling the variable. -Like for simple data type calling a variable print it. +Or simply calling the variable. +As with a simple data type, calling a variable prints it. But complex data type like `new_mpg` can use complex print function. ```{r mpg_inspect3, include=TRUE} new_mpg ``` -Here we can see that `new_mpg` is a `tibble` we will come back to `tibble` later. +Here we can see that `new_mpg` is a `tibble`. We will come back to `tibble` later. ### New script -Like in the last session, instead of typing your commands directly in the console, you are going to write them in an R script. +As in the last session, instead of typing your commands directly in the console, you will write them in an R script.  ## First plot with `ggplot2` -We are going to make the simplest plot possible to study the relationship between two variables: the scatterplot. +We are going to make the simplest plot possible to study the relationship between two variables: a scatterplot. -The following command generates a plot between engine size `displ` and fuel efficiency `hwy` present the `new_mpg` `tibble`. +The following command generates a plot between engine size `displ` and fuel efficiency `hwy` from the `new_mpg` `tibble`. ```{r new_mpg_plot_a, cache = TRUE, fig.width=8, fig.height=4.5} ggplot(data = new_mpg) + @@ -220,19 +220,20 @@ Are cars with bigger engines less fuel efficient ? `ggplot2` is a system for declaratively creating graphics, based on [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). You provide the data, tell `ggplot2` how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details. +All ggplot2 plots begin with the same call: + ``` ggplot(data = <DATA>) + <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) ``` -- you begin a plot with the function `ggplot()` -- you complete your graph by adding one or more layers -- `geom_point()` adds a layer with a scatterplot -- each **geom **function in `ggplot2` takes a `mapping` argument -- the `mapping` argument is always paired with `aes()` - +- you instantiate a plot with the function `ggplot()` +- you complete your graph by adding, with `+`, one or more layers + ( for instance, `geom_point()` adds a layer with a scatterplot ) +- each **geom** function in ggplot2 takes a `mapping` argument +- the `mapping` argument is always paired with aesthetics `aes()` <div class="pencadre"> -What happend when you use only the command `ggplot(data = mpg)` ? +What happened when you only use the command `ggplot(data = mpg)` ? </div> <details><summary>Solution</summary> @@ -276,13 +277,13 @@ Dots with the same coordinates are superposed. `ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. `ggplot2` will also add a legend that explains which levels correspond to which values. -Try the following aesthetic: +Try the following aesthetics: - `size` - `alpha` - `shape` -### `color` mapping +### `color` mapping {#sec-color-mapping} ```{r new_mpg_plot_e, cache = TRUE, fig.width=8, fig.height=4.5} ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) + @@ -324,7 +325,7 @@ Here is a list of different shapes available in R: </center> <div class="pencadre"> -What’s gone wrong with this code? Why are the points not blue? +What's gone wrong with this code? Why are the points not blue? </div> ```{r new_mpg_plot_not_blue, cache = TRUE, fig.width=8, fig.height=4.5} @@ -341,7 +342,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + </p> </details> -### Mapping a **continuous** variable to a color. +### Mapping a **continuous** variable to a color You can also map continuous variable to a color @@ -367,7 +368,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = displ < 5)) + You can create multiple plots at once by faceting. For this you can use the command `facet_wrap`. This command takes a formula as input. -We will come back to formulas in R later, for now, you have to know that formulas start with a `~` symbol. +We will come back to formulas in R later, for now, you just have to know that formulas start with a `~` symbol. To make a scatterplot of `displ` versus `hwy` per car `class` you can use the following code: @@ -396,7 +397,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + ## Composition -There are different ways to represent the information : +There are different ways to represent the information: ```{r new_mpg_plot_o, cache = TRUE, fig.width=8, fig.height=4.5} ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + @@ -412,7 +413,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + \ -We can add as many layers as we want +We can add as many layers as we want: ```{r new_mpg_plot_q, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + @@ -422,7 +423,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + \ -We can make `mapping` layer specific +We can make `mapping` layer specific: ```{r new_mpg_plot_s, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + @@ -541,7 +542,7 @@ plot_color_a_class("Compact Cars") ### Third challenge <div class="pencadre"> -Recreate the R code necessary to generate the following graph (see "linetype" option of "geom_smooth") +Recreate the R code necessary to generate the following graph (see "linetype" option of `geom_smooth`) </div> ```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} @@ -576,7 +577,7 @@ p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() ``` -Then save it in the wanted format: +Then save it in the format of your choice: ```{r, eval=F} ggsave("test_plot_1.png", p1, width = 12, height = 8, units = "cm") diff --git a/session_3/session_3.Rmd b/session_3/session_3.Rmd index 09a9513e052738db7f31c05bfcdff83901d5f894..2b57277d068a00a8c1e76fddfd535e06d9fd7f74 100644 --- a/session_3/session_3.Rmd +++ b/session_3/session_3.Rmd @@ -16,9 +16,9 @@ knitr::opts_chunk$set(comment = NA) ## Introduction -In the last session, we have seen how to use `ggplot2` and [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). The goal of this practical is to practices more advanced features of `ggplot2`. +In the last session, we have seen how to use `ggplot2` and [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). The goal of this session is to practice more advanced features of `ggplot2`. -The objectives of this session will be to: +The objectives will be to: - learn about statistical transformations - practices position adjustments @@ -39,7 +39,7 @@ Like in the previous sessions, it's good practice to create a new **.R** file to ## `ggplot2` statistical transformations -In the previous session, we have plotted the data as they are by using the variable values as **x** or **y** coordinates, color shade, size or transparency. +In the previous session, we have plotted the data as they are by using the variable values as **x** or **y** coordinates, color shade, size or transparency. When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations. For example, we may want to have coordinates on an axis proportional to the number of records for a given category. @@ -47,9 +47,9 @@ We are going to use the `diamonds` data set included in `tidyverse`. <div class="pencadre"> -- Use the `help` and `View` command to explore this data set. -- How much records does this dataset contain ? -- Try the `str` command, which information are displayed ? +- Use the `help` and `View` commands to explore this data set. +- How many records does this dataset contain ? +- Try the `str` command. What information is displayed ? </div> @@ -59,8 +59,8 @@ str(diamonds) ### Introduction to `geom_bar` -We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`). -Now barplot with `geom_bar()` : +We saw scatterplot (`geom_point()`) and smoothplot (`geom_smooth()`). +We can also use `geom_bar()` to draw barplot: ```{r diamonds_barplot, cache = TRUE, fig.width=8, fig.height=4.5} ggplot(data = diamonds, mapping = aes(x = cut)) + @@ -86,11 +86,12 @@ ggplot(data = diamonds, mapping = aes(x = cut)) + stat_count() ``` -Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly: +Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three main reasons you might need to use a **stat** explicitly, we discuss them in the next two sections. ### Why **stat** ? You might want to override the default stat. + For example, in the following `demo` dataset we already have a variable for the **counts** per `cut`. ```{r 3_a, include=TRUE, fig.width=8, fig.height=4.5} @@ -104,8 +105,8 @@ demo <- tribble( ) ``` -(Don't worry that you haven't seen `tribble()` before. You might be able -to guess at their meaning from the context, and you will learn exactly what +(Don't worry that you haven't seen `tribble()` before. You may be able +to guess their meaning from the context, and you will learn exactly what they do soon!) <div class="pencadre"> @@ -139,7 +140,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop..)) + geom_bar() ``` -If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, the proportion of an ideal cut in the ideal cut specific data will be 1. +If `group` is not used, the proportion is calculated with respect to the data that contain that field and is ultimately going to be 100% in any case. For instance, the proportion of an ideal cut in the ideal cut specific data will be 1. </p> </details> @@ -147,8 +148,8 @@ If group is not used, the proportion is calculated with respect to the data that <div class="pencadre"> You might want to draw greater attention to the statistical transformation in your code. -you might use `stat_summary()`, which summarize the **y** values for each unique **x** -value, to draw attention to the summary that you are computing +You might use `stat_summary()`, which summarize the **y** values for each unique **x** +value, to draw attention to the summary that you are computing. </div> <details><summary>Solution</summary> @@ -162,7 +163,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + </details> <div class="pencadre"> -Set the `fun.min`, `fun.max` and `fun` to the `min`, `max` and `median` function respectively +Set the `fun.min`, `fun.max` and `fun` to the `min`, `max` and `median` function respectively. </div> <details><summary>Solution</summary> @@ -180,9 +181,10 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) + ## Coloring area plots +You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`. + <div class="pencadre"> -You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`: -Try both solutions on a `cut`, histogram. +Try both approaches on a `cut`, histogram. </div> <details><summary>Solution</summary> @@ -199,9 +201,10 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) + </p> </details> +You can also use `fill` with another variable. + <div class="pencadre"> -You can also use `fill` with another variable: -Try to color by `clarity`. Is `clarity` a continuous or categorial variable ? +Try to color by `clarity`. Is `clarity` a continuous or categorical variable ? </div> <details><summary>Solution</summary> @@ -215,7 +218,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + ## Position adjustments -The stacking of the `fill` parameter is performed by the position adjustment `position` +The stacking of the `fill` parameter is performed by the position adjustment `position`. <div class="pencadre"> Try the following `position` parameter for `geom_bar`: `"fill"`, `"dodge"` and `"jitter"` @@ -287,7 +290,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + ## Coordinate systems -Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful. +A Cartesian coordinate system is a coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful. ```{r dia_boxplot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + @@ -295,7 +298,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) + ``` <div class="pencardre"> -Add the `coord_flip()` layer to the previous plot +Add the `coord_flip()` layer to the previous plot. </div> <details><summary>Solution</summary> @@ -335,9 +338,9 @@ By combining the right **geom**, **coordinates** and **faceting** functions, you ## See you in [R.4: data transformation](/session_4/session_4.html) {.unnumbered .unlisted} -## To go further: animated plots from xls files +## To go further: animated plots from xls files -In order to be able to read information from a xls file, we will use the `openxlsx` packages. To generate animation we will use the `ggannimate` package. The additional `gifski` package will allow R to save your animation in the gif format (Graphics Interchange Format) +In order to be able to read information from a xls file, we will use the `openxlsx` packages. To generate animation we will use the `ggannimate` package. The additional `gifski` package will allow R to save your animation in the GIF (Graphics Interchange Format) format. ```{r install_readxl, eval=F} install.packages(c("openxlsx", "gganimate", "gifski")) @@ -349,19 +352,19 @@ library(gifski) ``` <div class="pencardre"> -Use the `openxlsx` package to save the [https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx) file to the `gapminder` variable +Use the `openxlsx` package to save the [gapminder.xlsx](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx) file into the `gapminder` variable. </div> <details><summary>Solution</summary> <p> 2 solutions : -Use directly the url +Use directly the url: ```{r load_xlsx_url, eval = F} gapminder <- read.xlsx("https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx") ``` -Dowload the file, save it in the same directory as your script then use the local path +Download the file, save it in the same directory as your script then use the local path: ```{r load_xlsx} gapminder <- read.xlsx("gapminder.xlsx") ``` @@ -370,6 +373,7 @@ gapminder <- read.xlsx("gapminder.xlsx") </details> This dataset contains 4 variables of interest for us to display per country: + - `gdpPercap` the GDP par capita (US$, inflation-adjusted) - `lifeExp` the life expectancy at birth, in years - `pop` the population size diff --git a/session_4/session_4.Rmd b/session_4/session_4.Rmd index ed26d29ba5749f9ed8abf5f3140d6a65ef7b3223..3f2c517d849470f5ddc0205452268267a84dfa1e 100644 --- a/session_4/session_4.Rmd +++ b/session_4/session_4.Rmd @@ -19,18 +19,18 @@ knitr::opts_chunk$set(comment = NA) ## Introduction -The goal of this practical is to practice data transformation with `tidyverse`. -The objectives of this session will be to: +The goal of this session is to practice data transformation with `tidyverse`. +The objectives will be to: - Filter rows with `filter()` - Arrange rows with `arrange()` - Select columns with `select()` - Add new variables with `mutate()` +For this session, we are going to work with a new dataset included in the `nycflights13` package. + <div class="pencadre"> -For this session we are going to work with a new dataset included in the `nycflights13` package. -Install this package and load it. -As usual you will also need the `tidyverse` library. +Install this package and load it. As usual you will also need the `tidyverse` library. </div> <details><summary>Solution</summary> @@ -49,7 +49,7 @@ library("nycflights13") ### Data set : nycflights13 -`nycflights13::flights` Contains all 336,776 flights that departed from New York City in 2013. +`nycflights13::flights` contains all $336 \ 776$ flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in `?flights` ```R @@ -62,7 +62,7 @@ You can display the first rows of the dataset to have an overview of the data. flights ``` -To know all the colnames of a table you can use the function `colnames(dataset)` +You can use the function `colnames(dataset)` to get all the column names of a table: ```{r display_colnames, include=TRUE} colnames(flights) @@ -71,9 +71,9 @@ colnames(flights) ### Data type -In programming languages, all variables are not equal. +In programming languages, variables can have different types. When you display a `tibble` you can see the **type** of a column. -Here is a list of common variable **types** that you will encounter +Here is a list of common variable **types** that you will encounter: - **int** stands for integers. - **dbl** stands for doubles or real numbers. @@ -83,7 +83,8 @@ Here is a list of common variable **types** that you will encounter - **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values. - **date** stands for dates. -You cannot add an **int** to a **chr**, but you can add an **int** to a **dbl** the results will be a **dbl**. +It's important for you to know about and understand the different types because certain operations are only allowed between certain types. +For instance, you cannot add an **int** to a **chr**, but you can add an **int** to a **dbl** the results will be a **dbl**. ## `filter` rows @@ -106,7 +107,7 @@ filter(flights, air_time >= 680) filter(flights, carrier == "HA") filter(flights, origin != "JFK") ``` -The operator `%in%` is very usefull to test if a value is in a list. +The operator `%in%` is very useful to test if a value is in a list. ```{r filter_sup_inf, include=TRUE, eval=FALSE} filter(flights, carrier %in% c("OO","AS")) @@ -114,10 +115,10 @@ filter(flights, month %in% c(5,6,7,12)) ``` -`dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-` +`dplyr` functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, `<-`. <div class="pencadre"> -Save the flights longer than 680 minutes in a `long_flights` variable +Save the flights longer than 680 minutes in a `long_flights` variable. </div> <details><summary>Solution</summary> @@ -145,7 +146,7 @@ In R you can use the symbols `&` (and), `|` (or), `!` (not) and the function `xo <div class="pencadre"> -Display the `long_flights` variable and predict the results of +Display the `long_flights` variable and predict the results of: ```{r logical_operators_exemples2, eval=FALSE} filter(long_flights, day <= 15 & carrier == "HA") @@ -172,7 +173,7 @@ filter(long_flights, (day <= 15 | carrier == "HA") & (! month > 2)) <div class="pencadre"> -Test the following operations and translate them with words +Test the following operations and translate them with words. ```{r filter_logical_operators_a, eval=FALSE} filter(flights, month == 11 | month == 12) @@ -196,8 +197,8 @@ filter(flights, arr_delay <= 120, dep_delay <= 120) </div> -Combinations of logical operators is a powerful programmatic way to select subset of data. -Keep in mind, however, that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of one long one. +Combining logical operators is a powerful programmatic way to select subset of data. +However, keep in mind that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of a long one. <div class="pencadre"> @@ -211,11 +212,11 @@ What happens when you put your variable assignment code between parenthesis `(` ### Missing values -One important feature of R that can make comparison tricky is missing values, or `NA`s for **Not Availables**. -Indeed each of the variable type can contain either a value of this type (i.e., `2` for an **int**) or nothing. +One important feature of R that can make comparison tricky are missing values, or `NA`s for **Not Availables**. +Indeed, each of the variable type can contain either a value of this type (i.e., `2` for an **int**) or nothing. The *nothing recorded in a variable* status is represented with the `NA` symbol. -As operations with `NA` values don't make sense, if you have `NA` somewhere in your operation, the results will be `NA` +As operations with `NA` values don't make sense, if you have `NA` somewhere in your operation, the results will be `NA`: ```{r filter_logical_operators_NA, include=TRUE} NA > 5 @@ -244,6 +245,7 @@ filter(df, is.na(y) | y > 1) <div class="pencadre"> Find all flights that: + - Had an arrival delay (`arr_delay`) of two or more hours (you can check `?flights`) - Flew to Houston (IAH or HOU) </div> @@ -327,7 +329,7 @@ arrange(df, desc(y)) <details><summary>Solution</summary> <p> -Find the most delayed flight at arrival +Find the most delayed flight at arrival. ```{r chalange_arrange_desc_a, include=TRUE} arrange(flights, desc(arr_delay)) ``` @@ -336,7 +338,6 @@ Find the flight that left earliest. arrange(flights, dep_delay) ``` How could you arrange all missing values to the start in the `df` tibble ? - ```{r chalange_arrange_desc_c, include=TRUE} arrange(df, desc(is.na(y))) ``` @@ -346,27 +347,27 @@ arrange(df, desc(is.na(y))) ## Select columns with `select()` -`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables. +`select()` lets you quickly zoom in on a useful subset using operations based on variable names. -You can select by column names +You can select by column names: ```{r select_ymd_a, include=TRUE} select(flights, year, month, day) ``` -By defining a range of columns +By defining a range of columns: ```{r select_ymd_b, include=TRUE} select(flights, year:day) ``` -Or, you can do a negative (`-`) to remove columns. +Or, you can use a negative (`-`) to remove columns: ```{r select_ymd_c, include=TRUE} select(flights, -(year:day)) ``` -And, you can also rename column names on the fly. +You can also rename column names on the fly: ```{r select_ymd_d, include=TRUE} select(flights, Y = year, M = month, D = day) @@ -375,13 +376,13 @@ select(flights, Y = year, M = month, D = day) ### Helper functions -here are a number of helper functions you can use within `select()`: +Here are a number of helper functions you can use within `select()`: - `starts_with("abc")`: matches column names that begin with `"abc"`. - `ends_with("xyz")`: matches column names that end with `"xyz"`. - `contains("ijk")`: matches column names that contain `"ijk"`. - `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`. -- `where(test_function)`: select columns for which the result is TRUE. +- `where(test_function)`: selects columns for which the result is TRUE. See `?select` for more details. @@ -400,7 +401,6 @@ colnames(df_dep_arr) ``` - <details><summary>Other solutions</summary> <p> @@ -432,8 +432,8 @@ select(flights, all_of(vars)) From the help message (`?all_of()`) : - - all_of() is for strict selection. If any of the variables in the character vector is missing, an error is thrown. - - any_of() doesn't check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed. + - `all_of()` is for strict selection. If any of the variables in the character vector is missing, an error is thrown. + - `any_of()` doesn't check for missing variables. It is particularly useful with negative selections, when you would like to make sure a variable is removed. ```{r challenge_select_b2, eval=FALSE} vars <- c(vars, "toto") @@ -443,8 +443,7 @@ select(flights, all_of(vars)) </p> </details> -- Select all columns wich contain character values ? numeric values ? - +- Select all columns which contain character values ? numeric values ? <details><summary>Solution</summary> @@ -480,17 +479,17 @@ select(flights, contains("TIME", ignore.case = FALSE)) ## Add new variables with `mutate()` -It’s often useful to add new columns that are functions of existing columns. That’s the job of `mutate()`. +It's often useful to add new columns that are functions of existing columns. That's the job of `mutate()`. <div class="pencadre"> -First let's create a thiner dataset to work on `flights_thin` that contains +First let's create a thinner dataset to work on `flights_thin` that contains: - columns from `year` to `day` - columns that ends with `delays` - the `distance` and `air_time` columns - the `dep_time` and `sched_dep_time` columns -Then let's create an even smaller dataset as toy dataset to test your commands before using them on the large dataset (It a good reflex to take). For that you can use the function `head` or `sample_n` for a more random sampling. +Then let's create an even smaller toy dataset to test your commands before using them on the larger one (It a good reflex to take). For that you can use the function `head` or `sample_n` for a random sampling alternative. - select only 5 rows @@ -518,8 +517,7 @@ mutate(tbl, new_var_a = opperation_a, ..., new_var_n = opperation_n) `mutate()` allows you to add new columns (`new_var_a`, ... , `new_var_n`) and to fill them with the results of an operation. - -We can create a `gain` column whic can be the difference betwenn the delay at the departure and at the arrival to check if the pilot managed to compensate is departure delay. +We can create a `gain` column, which can be the difference between departure and arrival delays, to check whether the pilot has managed to compensate for his departure delay. ```{r mutate_gain} mutate(flights_thin_toy, gain = dep_delay - arr_delay) @@ -545,15 +543,15 @@ flights_thin_toy <div class="pencadre"> -Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. (see the help to get more information on these columns) In the flight dataset, convert them to a more convenient representation of the number of minutes since midnight. +Currently `dep_time` and `sched_dep_time` are convenient to look at, but difficult to work with, as they're not really continuous numbers (see the help to get more information on these columns). In the flight dataset, convert them to a more convenient representation of the number of minutes since midnight. **Hints** : - `dep_time` and `sched_dep_time` are in the HHMM format (see the help to get these information). So you have to first get the number of hours `HH`, convert them in minutes and then add the number of minutes `MM`. - - For exemple : `20:03` will be display `2003`, so to convert it in minutes you have to do `20 * 60 + 03 (= 1203) `. + - For example: `20:03` will be display `2003`, so to convert it in minutes you have to do `20 * 60 + 03 (= 1203)`. - - To split the number `HHMM` in hours (`HH`) and minutes (`MM`) you have to use an eucledean division of HHMM by 100 to get the number of hours as the divisor and the number of minute as the remainder. For that use the modulo operator `%%` to get the remainder and it's friend `%/%` which return the divisor. + - To split the number `HHMM` in hours (`HH`) and minutes (`MM`) you have to use an euclidean division of HHMM by 100 to get the number of hours as the divisor and the number of minute as the remainder. For that, use the modulo operator `%%` to get the remainder and it's friend `%/%` which returns the divisor. ```{r mutate_exemple, include=TRUE} HH <- 2003 %/% 100 @@ -563,7 +561,7 @@ MM HH * 60 + MM ``` It is always a good idea to decompose a problem in small parts. -First train you only on `dep_time`. Build the HH and MM columns. Then try to do the convertions in one row. +First, only start with `dep_time`. Build the HH and MM columns. Then, try to write both conversions in one row. </div> @@ -579,7 +577,7 @@ mutate( ) ``` -** Note ** You can use the `.after` option to tell where to put the new columns +**Note**: You can use the `.after` option to tell where to put the new columns, ```{r mutate_challenges_a2, include=TRUE} mutate( @@ -590,7 +588,7 @@ mutate( .after = "dep_time" ) ``` -or `.keep = "used"` to keep only the columns used for the calculus which can be usefull for debugging +or `.keep = "used"` to keep only the columns used for the calculus which can be usefull for debugging, ```{r mutate_challenges_a21, include=TRUE} mutate( @@ -610,7 +608,7 @@ mutate( .after = "dep_time" ) ``` -** Note ** You can also directly replace a column by the result of the mutate operation. +**Note**: You can also directly replace a column by the result of the mutate operation, ```{r mutate_challenges_a4, include=TRUE, eval = F} mutate( @@ -634,17 +632,17 @@ mutate( ``` - </p> </details> ### Useful creation functions -- Offsets: lead(x) and lag(x) allow you to refer to the previous or next values of the column x. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`). +- Offsets: `lead(x)` and `lag(x)` allow you to refer to the previous or next values of the column x. + This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`). - R provides functions for running cumulative sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. -- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==` -- Ranking: there are a number of ranking functions, the most frequently used being min_rank(). They differ by the way ties are treated, etc. Try ?mutate, ?min_rank, ?rank, for more information. +- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`. +- Ranking: there are a number of ranking functions, the most frequently used being `min_rank()`. They differ by the way ties are treated, etc. Try ?mutate, ?min_rank, ?rank, for more information. ## See you in [R.5: Pipping and grouping](/session_5/session_5.html) {.unnumbered .unlisted} @@ -669,7 +667,7 @@ library(viridis) ### RColorBrewer & Ghibli -Using `mpg` and the 'ggplot2' package, reproduce the graph studied in session 2, 3.1: color mapping. +Using `mpg` and the ggplot2 package, reproduce the graph studied in @sec-color-mapping. Modify the colors representing the class of cars with the palettes `Dark2` of [RColorBrewer](https://www.datanovia.com/en/fr/blog/palette-de-couleurs-rcolorbrewer-de-a-a-z/), then `MononokeMedium` from [Ghibli](https://github.com/ewenme/ghibli). ```{r mpg_color} @@ -679,7 +677,7 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + Go to the links to find the appropriate function: they are very similar between the two packages. <details><summary>Solution</summary> - <p> +<p> ```{r mpg_color1} ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + @@ -692,10 +690,10 @@ ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + geom_point() + scale_colour_ghibli_d("MononokeMedium") ``` - </p> +</p> </details> -The choice of colors is very important for the comprehension of a graphic. Some palettes are not suitable for everyone. For example, for people with color blindness, color gradients from green to red, or from yellow to blue should be avoided. +The choice of colors is very important for the comprehension of a graphic. Some palettes are not suitable for everyone. For example, for people with color blindness, color gradients from green to red, or from yellow to blue should be avoided. To display only Brewer palettes that are colorblind friendly, specify the option `colorblindFriendly = TRUE` as follows: @@ -705,9 +703,9 @@ display.brewer.all(colorblindFriendly = TRUE) ### Viridis -`viridis` package provide a series of color maps that are designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency. +The `viridis` package provides a series of color maps that are designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency. -For the next part, we will use a real data set. Anterior tibial muscle tissue was collected from 20 patients, with or without confirmed myotonic dystrophy type 1 (DM1). Illumina RNAseq was performed on these samples and the sequencing data are available on GEO with the identifier GSE86356. +For the next part, we will use a real data set. Anterior tibial muscle tissue was collected from 20 patients, with or without confirmed myotonic dystrophy type 1 (DM1). Illumina RNAseq was performed on these samples and the sequencing data are available on GEO with the identifier GSE86356. First, we will use the gene count table of these samples, formatted for use in ggplot2 ( `pivot_longer()` [function](https://tidyr.tidyverse.org/reference/pivot_longer.html) ). @@ -716,7 +714,7 @@ Open the csv file using the `read_csv2()` function. The file is located at "http <details><summary>Solution</summary> <p> -Download the Expression_matrice_pivot_longer_DEGs_GSE86356.csv file and save it in your working directory. +Download the file "Expression_matrice_pivot_longer_DEGs_GSE86356.csv" and save it in your working directory. You may have to set you working directory using `setwd()` ```{r read_csv1} @@ -725,7 +723,7 @@ expr_DM1 <- read_csv2("Expression_matrice_pivot_longer_DEGs_GSE86356.csv") expr_DM1 ``` -or you can read it from the url +or you can read it from the following url: ```{r read_csv1_url, eval = F} (expr_DM1 <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv")) @@ -737,10 +735,10 @@ or you can read it from the url With this tibble, use `ggplot2` and the `geom_tile()` function to make a heatmap. Fit the samples on the x-axis and the genes on the y-axis. -**Tip:** Transform the counts into log10(x + 1) for a better visualization. +**Tip**: Transform the counts into log10(x + 1) for a better visualization. <details><summary>Solution</summary> - <p> +<p> ```{r heatmap1} (DM1_tile_base <- @@ -752,19 +750,19 @@ Fit the samples on the x-axis and the genes on the y-axis. axis.text.x = element_text(size = 6, angle = 90) )) ``` -**Nota bene :** The elements of the axes, and the theme in general, are modified in the `theme()` function. - </p> +**Nota bene**: The elements of the axes, and the theme in general, are modified in the `theme()` function. +</p> </details> With the default color gradient, even with the transformation, the heatmap is difficult to study. -R interprets a large number of colors, indicated in RGB, hexadimal, or just by name. For example : +R interprets a large number of colors, indicated in RGB, hexadecimal, or just by name. For example : <center> {width=400px} </center> -With `scale_fill_gradient2()` function, change the colors of the gradient, taking white for the minimum value and 'springgreen4' for the maximum value. +With `scale_fill_gradient2()` function, change the colors of the gradient, taking "white" for the minimum value and "springgreen4" for the maximum value. <details><summary>Solution</summary> <p> @@ -776,16 +774,16 @@ DM1_tile_base + scale_fill_gradient2(low = "white", high = "springgreen4") </p> </details> -It s better, but still not perfect! -Now let s use the [viridis color gradient](https://gotellilab.github.io/GotelliLabMeetingHacks/NickGotelli/ViridisColorPalette.html) for this graph. +It's better, but still not perfect! +Now let's use the [viridis color gradient](https://gotellilab.github.io/GotelliLabMeetingHacks/NickGotelli/ViridisColorPalette.html) for this graph. <details><summary>Solution</summary> - <p> +<p> ```{r heatmapViridis} DM1_tile_base + scale_fill_viridis_c() ``` - </p> +</p> </details> ### Volcano Plot @@ -795,9 +793,9 @@ For this last exercise, we will use the results of the differential gene express Open the csv file using the `read_csv2()` function. The file is located at "http://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv". <details><summary>Solution</summary> - <p> +<p> -Download the "EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv" file and save it in your working directory. +Download the file "EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv" and save it in your working directory. ```{r read_csv2} tab <- read_csv2("EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv") @@ -811,22 +809,20 @@ tab ``` - - - </p> +</p> </details> -To make a Volcano plot, displaying different information about the significativity of the variation thanks to the colors, we will have to make a series of modifications on this table. +To make a Volcano plot, displaying different information on the significance of variation using colors, we will have to make a series of modifications on this table. -With `mutate()` and `ifelse()` [fonctions](https://dplyr.tidyverse.org/reference/if_else.html), we will have to create : +With `mutate()` and `ifelse()` [fonctions](https://dplyr.tidyverse.org/reference/if_else.html), we will have to create: -- a column 'sig' : it indicates if the gene is significant ( TRUE or FALSE ). -**Thresholds :** baseMean > 20 and padj < 0.05 and abs(log2FoldChange) >= 1.5 +- a column 'sig': it indicates if the gene is significant ( TRUE or FALSE ). + **Thresholds**: baseMean > 20 and padj < 0.05 and abs(log2FoldChange) >= 1.5 -- a column 'UpDown' : it indicates if the gene is significantly up-regulated (Up), down-regulated (Down), or not significantly regulated (NO). +- a column 'UpDown': it indicates if the gene is significantly up-regulated (Up), down-regulated (Down), or not significantly regulated (NO). <details><summary>Solution</summary> - <p> +<p> ```{r sig} (tab.sig <- mutate(tab, @@ -836,14 +832,14 @@ With `mutate()` and `ifelse()` [fonctions](https://dplyr.tidyverse.org/reference ) ) ``` - </p> +</p> </details> We want to see the top10 DEGs on the graph. For this, we will use the package `ggrepel`. Install and load the `ggrepel` package. <details><summary>Solution</summary> - <p> +<p> ```{r ggrepel, eval = F} install.packages("ggrepel") @@ -852,14 +848,14 @@ install.packages("ggrepel") ```{r ggrepel2} library(ggrepel) ``` - </p> +</p> </details> -Let's **filter** out table into a new variable, top10, to keep only the significant differentialy expressed genes with the top 10 adjusted pvalue. The **smaller** the adjusted pvalue, the more significant. +Let's **filter** out the table into a new variable, top10, to keep only the significant differentially expressed genes, those with the top 10 adjusted pvalue. The **smaller** the adjusted pvalue, the more significant the gene. -**Tips :** You can use the [function](https://dplyr.tidyverse.org/reference/slice.html) `slice_min()` +**Tips**: You can use the [function](https://dplyr.tidyverse.org/reference/slice.html) `slice_min()` <details><summary>Solution</summary> <p> @@ -884,9 +880,9 @@ To make the graph below, use `ggplot2`, the functions `geom_point()`, `geom_hlin </div> -- **Tips 1 :** Don t forget the transformation of the adjusted pvalue. -- **Tips 2 :** Feel free to search your favorite Web browser for help. -- **Tips 3 :** `geom_label_repel()` function needs a new parameter 'data' and 'label' in aes parameters. +- **Tips 1**: Don't forget the transformation of the adjusted pvalue. +- **Tips 2**: Feel free to search your favorite Web browser for help. +- **Tips 3**: `geom_label_repel()` function needs a new parameter 'data' and 'label' in `aes` parameters. ```{r VolcanoPlotDemo, echo = FALSE} @@ -903,7 +899,7 @@ ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) + ``` <details><summary>Solution</summary> - <p> +<p> ```{r VolcanoPlotSolut, echo = TRUE, results = 'hide'} ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) + @@ -917,6 +913,6 @@ ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) + geom_label_repel(data = top10, mapping = aes(label = gene_symbol)) ``` - </p> +</p> </details> diff --git a/session_5/session_5.Rmd b/session_5/session_5.Rmd index 704c36433b67475357d02b5c885facc4b4454fe5..7dbb74f71fd4a71fd52b690150947e369a773d21 100644 --- a/session_5/session_5.Rmd +++ b/session_5/session_5.Rmd @@ -19,14 +19,14 @@ knitr::opts_chunk$set(comment = NA) ## Introduction -The goal of this practical is to practice combining data transformation with `tidyverse`. -The objectives of this session will be to: +The goal of this session is to practice combining data transformation with `tidyverse`. +The objectives will be to: - Combining multiple operations with the pipe `%>%` - Work on subgroup of the data with `group_by` <div class="pencadre"> -For this session we are going to work with a new dataset included in the `nycflights13` package. +For this session, we are going to work with a new dataset included in the `nycflights13` package. Install this package and load it. As usual you will also need the `tidyverse` library. </div> @@ -43,7 +43,7 @@ library("nycflights13") ## Combining multiple operations with the pipe <div id="pencadre"> -Find the 10 most delayed flights using a ranking function. `min_rank()` +Find the 10 most delayed flights using the ranking function `min_rank()`. </div> <details><summary>Solution</summary> @@ -57,7 +57,6 @@ flights_md <- arrange(flights_md, most_delay) </p> </details> - We don't want to create useless intermediate variables so we can use the pipe operator: `%>%` (or `ctrl + shift + M`). @@ -78,14 +77,19 @@ flights_md2 <- flights %>% </p> </details> -Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered and use `+` instead of `%>%`. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet. +Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered and use `+` instead of `%>%`. +<!-- ggvis project is dormant +Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn't quite ready for prime time yet. +--> -The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when: +The pipe is a powerful tool, but it's not the only tool at your disposal, and it doesn't solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. ### When not to use the pipe +You should reach for another tool when: + - Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent. -- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results. +- You have multiple inputs or outputs. If there isn't one primary object being transformed, but two or more objects being combined together, don't use the pipe. You can create a function that combines or split the results. ## Grouping variable @@ -99,13 +103,13 @@ flights %>% summarise(delay = mean(dep_delay, na.rm = TRUE)) ``` -Where mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column. +Whereas mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column. ### The power of `summarise()` with `group_by()` The `group_by()` function changes the unit of analysis from the complete dataset to individual groups. -Individual groups are defined by categorial variable or **factors**. -Then, when you use the function you already know on grouped data frame and they’ll be automatically applied *by groups*. +Individual groups are defined by categorical variable or **factors**. +Then, when you use aggregation functions on the grouped data frame, they'll be automatically applied *by groups*. You can use the following code to compute the average delay per months across years. @@ -128,7 +132,7 @@ Why did we `group_by` `year` and `month` and not only `year` ? ### Missing values <div class="pencadre"> -You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it? +You may have wondered about the `na.rm` argument we used above. What happens if we don't set it? </div> ```{r summarise_group_by_NA, include=TRUE} @@ -140,11 +144,11 @@ flights %>% ) ``` -Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**. +Aggregation functions obey the usual rule of missing values: **if there's any missing value in the input, the output will be a missing value**. ### Counts -Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data. +Whenever you do any aggregation, it's always a good idea to include a count (`n()`). This way, you can check that you're not drawing conclusions based on very small amounts of data. ```{r summarise_group_by_count, include = T, echo=F, warning=F, message=F, fig.width=8, fig.height=3.5} summ_delay_filghts <- flights %>% @@ -165,11 +169,11 @@ ggplot(summ_delay_filghts, mapping = aes(x = avg_distance, y = avg_delay, size = <div class="pencadre"> Imagine that we want to explore the relationship between the average distance (`distance`) and average delay (`arr_delay`) for each location (`dest`) and recreate the above figure. -here are three steps to prepare this data: +Here are three steps to prepare those data: 1. Group flights by destination. 2. Summarize to compute average distance (`avg_distance`), average delay (`avg_delay`), and number of flights using `n()` (`n_flights`). -3. Filter to remove Honolulu airport, which is almost twice as far away as the next closest airport. +3. Filter to remove Honolulu airport ("HNL"), which is almost twice as far away as the next closest airport. 4. Filter to remove noisy points with delay superior to 40 or inferior to -20 5. Create a `mapping` on `avg_distance`, `avg_delay` and `n_flights` as `size`. 6. Use the layer `geom_point()` and `geom_smooth()` (use method = lm) @@ -221,7 +225,7 @@ flights %>% Look at the number of canceled flights per day. Is there a pattern? -(A canceled flight is a flight where the `dep_time` or the `arr_time` is `NA`) +(A canceled flight is a flight where either the `dep_time` or the `arr_time` is `NA`) **Remember to always try to decompose complex questions into smaller and simple problems** diff --git a/session_6/session_6.Rmd b/session_6/session_6.Rmd index d7f621f4040d292b1de94f1f9fe77beaafed9cd7..7d0c2e986eaabcffc45b7aee5612f16a3216a62f 100644 --- a/session_6/session_6.Rmd +++ b/session_6/session_6.Rmd @@ -16,9 +16,9 @@ knitr::opts_chunk$set(comment = NA) ## Introduction -Until now we have worked with data already formated in a *nice way*. -In the `tidyverse` data formated in a *nice way* are called **tidy** -The goal of this practical is to understand how to transform an hugly blob of information into a **tidy** data set. +Until now we have worked with data already formatted in a *nice way*. +In the `tidyverse` data formatted in a *nice way* are called **tidy** +The goal of this session is to understand how to transform an ugly blob of information into a **tidy** data set. ### Tidydata @@ -28,9 +28,9 @@ There are three interrelated rules which make a dataset tidy: - Each observation must have its own row. - Each value must have its own cell. -Doing this kind and transformation is often called **data wrangling**, due to the felling that we have to *wrangle* with the data to force them into a **tidy** format. +Doing this kind and transformation is often called **data wrangling**, due to the feeling that we have to *wrangle* (struggle) with the data to force them into a **tidy** format. -But once this step is finish most of the subsequent analysis will be realy fast to do ! +But once this step is finish most of the subsequent analysis will be really fast to do ! <div class="pencadre"> As usual we will need the `tidyverse` library. @@ -44,7 +44,7 @@ library(tidyverse) </p> </details> -For this practical we are going to use the `table` set of datasets which demonstrate multiple ways to layout the same tabular data. +For this session, we are going to use the `table*` set of datasets which demonstrate multiple ways to layout the same tabular data. <div class="pencadre"> Use the help to know more about `table1` dataset @@ -80,7 +80,7 @@ wide_example <- tibble(X1 = c("A","B"), If you have a wide dataset, such as `wide_example`, that you want to make longer, you will use the `pivot_longer()` function. -You have to specify the names of the columns you want to pivot into longer format (X2,X3,X4): +You have to specify the names of the columns you want to pivot into longer format (X2, X3, X4): ```{r, eval = F} wide_example %>% @@ -116,11 +116,11 @@ Is the data **tidy** ? How would you transform this dataset to make it **tidy** <details><summary>Solution</summary> <p> -We have information about 3 variables in the `table4a`: `country`, `year` and number of `cases`. +We have information about 3 variables in `table4a`: `country`, `year` and number of `cases`. However, the variable information (`year`) is stored as column names. We want to pivot the horizontal column year, vertically and make the table longer. -You can use the `pivot_longer` fonction to make your table longer and have one observation per row and one variable per column. +You can use the `pivot_longer` function to make your table longer and have one observation per row and one variable per column. For this we need to : @@ -164,7 +164,7 @@ Is the data **tidy** ? How would you transform this dataset to make it **tidy** <p> The column `count` store two types of information: the `population` size of the country and the number of `cases` in the country. -You can use the `pivot_wider` fonction to make your table wider and have one observation per row and one variable per column. +You can use the `pivot_wider` function to make your table wider and have one observation per row and one variable per column. ```{r pivot_wider, eval=T, message=T} table2 %>% @@ -178,7 +178,7 @@ table2 %>% ### Relational data -To avoid having a huge table and to save space, information is often splited between different tables. +To avoid having a huge table and to save space, information is often split between different tables. In our `flights` dataset, information about the `carrier` or the `airports` (origin and dest) are saved in a separate table (`airlines`, `airports`). @@ -194,7 +194,7 @@ flights2 <- flights %>% ### Relational schema -The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. +Relationships between tables can be displayed in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation. ```{r airlines_dag, echo=FALSE, out.width='100%'} knitr::include_graphics('img/relational-nycflights.png') @@ -202,9 +202,9 @@ knitr::include_graphics('img/relational-nycflights.png') ### Joints -If you have to combine data from 2 tables in a a new table, you will use `joints`. +If you have to combine data from two tables in a new one, you will use `*_joint` functions. -There are several types of joints depending of what you want to get. +There are several types of joints depending of what you want to get. ```{r joints, echo=FALSE, out.width='100%'} knitr::include_graphics('img/join-venn.png') @@ -218,7 +218,7 @@ knitr::include_graphics('img/overview_joins.png') #### `inner_joint()` -keeps observations in `x` AND `y` +Keeps observations in `x` AND `y` ```{r inner_joint, eval=T} flights2 %>% @@ -227,7 +227,7 @@ flights2 %>% #### `left_joint()` -keeps all observations in `x` +Keeps all observations in `x` ```{r left_joint, eval=T} flights2 %>% @@ -236,7 +236,7 @@ flights2 %>% #### `right_joint()` -keeps all observations in `y` +Keeps all observations in `y` ```{r right_joint, eval=T} flights2 %>% @@ -245,7 +245,7 @@ flights2 %>% #### `full_joint()` -keeps all observations in `x` and `y` +Keeps all observations in `x` and `y` ```{r full_joint, eval=T} flights2 %>% @@ -275,7 +275,7 @@ flights2 %>% left_join(airports, c("dest" = "faa")) ``` -If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix `.x` and `.y` because all column names must be different in the output table. +If two columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix `.x` and `.y` because all column names must be different in the output table. ```{r , eval=T, echo = T} flights2 %>% @@ -308,7 +308,7 @@ flights %>% ### Set operations -These expect the x and y inputs to have the same variables, and treat the observations like sets: +These expect the `x` and `y` inputs to have the same variables, and treat the observations like sets: - `intersect(x, y)`: return only observations in both `x` and `y`. - `union(x, y)`: return unique observations in `x` and `y`. diff --git a/session_7/session_7.Rmd b/session_7/session_7.Rmd index 0e1ae139c24bbcf88fb94a292b170e3ee1a13614..79dc194dcc1bcf648709904944451920f82bfcd0 100644 --- a/session_7/session_7.Rmd +++ b/session_7/session_7.Rmd @@ -22,7 +22,7 @@ knitr::opts_chunk$set(comment = NA) In the previous session, we have often overlooked a particular type of data, the **string**. In R a sequence of characters is stored as a string. -In this session you will learn the distinctive features of the string type and how we can use string of character within a programming language which is composed of particular string of characters as function names, variables. +In this session you will learn the distinctive features of the string type and how we can use string of characters within a programming language which is composed of particular string of characters as function names, variables. <div class="pencadre"> As usual we will need the `tidyverse` library. @@ -40,7 +40,7 @@ library(tidyverse) ### String definition -A string can be defined within double `"` or simple `'` quote +A string can be defined within double `"` or simple `'` quote: ```{r string_def, eval=F, message=T} string1 <- "This is a string" @@ -48,7 +48,7 @@ string2 <- 'If I want to include a "quote" inside a string, I use single quotes' ``` -If you forget to close a quote, you’ll see +, the continuation character: +If you forget to close a quote, you'll see `+`, the continuation character: ``` > "This is a string without a closing quote @@ -65,12 +65,12 @@ To include a literal single or double quote in a string you can use \\ to *escap double_quote <- "\"" # or '"' single_quote <- '\'' # or "'" ``` -If you want to include a literal backslash, you’ll need to double it up: `"\\"`. +If you want to include a literal backslash, you'll need to double it up: `"\\"`. ### String representation -The printed representation of a string is not the same as string itself +The printed representation of a string is not the same as a string itself: ```{r string_rep_escape_a, eval=T, message=T} x <- c("\"", "\\") @@ -104,8 +104,7 @@ x <- c("Apple", "Banana", "Pear") str_sub(x, 1, 3) ``` -- Subsetting strings -negative numbers count backwards from the end +- Subsetting strings negative numbers count backwards from the end ```{r str_sub2, eval=T, message=FALSE, cache=T} str_sub(x, -3, -1) ``` @@ -115,19 +114,19 @@ str_sub(x, -3, -1) str_to_lower(x) ``` -- ordering +- Ordering ```{r str_sort, eval=T, message=FALSE, cache=T} str_sort(x) ``` -## Matching patterns with regular expressions +## Matching patterns with REGular EXpressions (regex) -Regexps are a very terse language that allows you to describe patterns in strings. +regexps form a very terse language that allows you to describe patterns in strings. -To learn regular expressions, we’ll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. +To learn regular expressions, we'll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match. <div class="pencadre"> -You need to install the `htmlwidgets` packages to use these functions +You need to install the `htmlwidgets` packages to use these functions. </div> <details><summary>Solution</summary> @@ -159,8 +158,8 @@ x <- c("apple", "banana", "pear") str_view(x, ".a.") ``` -But if “`.`” matches any character, how do you match the character “`.`”? -You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behavior. +But if `.` matches any character, how do you match the character "`.`"? +You need to use an "escape" to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an `.`, you need the regexp `\.`. Unfortunately this creates a problem. @@ -184,7 +183,7 @@ str_view(x, "\\\\") ### Exercises -- Explain why each of these strings doesn’t match a \: "`\`", "`\\`", "`\\\`". +- Explain why each of these strings doesn't match a \: "`\`", "`\\`", "`\\\`". - How would you match the sequence `"'\`? - What patterns will the regular expression `\..\..\..` match? How would you represent it as a string? @@ -212,10 +211,10 @@ str_view(x, "^apple$") ### Exercices - How would you match the literal string `"$^$"`? -- Given the corpus of common words in stringr::words, create regular expressions that find all words that: - -Start with “y”. - - End with “x” - - Are exactly three letters long. (Don’t cheat by using `str_length()`!) +- Given the corpus of common words in `stringr::words`, create regular expressions that find all words that: + - Start with "y". + - End with "x". + - Are exactly three letters long (Don't cheat by using `str_length()`!). - Have seven letters or more. Since this list is long, you might want to use the match argument to `str_view()` to show only the matching or non-matching words. @@ -235,7 +234,8 @@ str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") ``` -You can use alternations to pick between one or more alternative patterns. For example, `abc|d..f` will match either `abc`, or `deaf`. Note that the precedent for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. Like with mathematical expressions, if presidents ever get confusing, use parentheses to make it clear what you want: +You can use alternations to pick between one or more alternative patterns. For example, `abc|d..f` will match either `abc`, or `deaf`. Note that the precedent for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. +Like with mathematical expressions, if alternations ever get confusing, use parentheses to make it clear what you want: ```{r str_viewanchorsstartend_c, eval=T, cache=T} str_view(c("grey", "gray"), "gr(e|a)y") @@ -246,9 +246,9 @@ str_view(c("grey", "gray"), "gr(e|a)y") Create regular expressions to find all words that: - Start with a vowel. -- That only contains consonants. (Hint: thinking about matching “not”-vowels.) -- End with ed, but not with eed. -- End with ing or ise. +- That only contains consonants. (Hint: thinking about matching "not"-vowels.) +- End with "ed", but not with "eed". +- End with "ing" or "ise". ### Repetition @@ -280,7 +280,7 @@ str_view(x, "C{2,3}") ### Exercices -- Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.) +- Describe in words what these regular expressions match (read carefully to see if I'm using a regular expression or a string that defines a regular expression): - `^.*$` - `"\\{.+\\}"` - `\d{4}-\d{2}-\d{2}` @@ -309,8 +309,8 @@ str_view(fruit, "(..)\\1", match = TRUE) - `"(.)(.)(.).*\\3\\2\\1"` - Construct regular expressions to match words that: - Start and end with the same character. - - Contain a repeated pair of letters (e.g. `“church”` contains `“ch”` repeated twice.) - - Contain one letter repeated in at least three places (e.g. `“eleven”` contains three `“e”`s.) + - Contain a repeated pair of letters (e.g. `"church"` contains `"ch"` repeated twice). + - Contain one letter repeated in at least three places (e.g. `"eleven"` contains three `"e"`s). ### Detect matches @@ -319,7 +319,7 @@ x <- c("apple", "banana", "pear") str_detect(x, "e") ``` -How many common words start with t? +How many common words start with "t"? ```{r str_view_match_b, eval=T, cache=T} sum(str_detect(words, "^t")) @@ -383,7 +383,7 @@ head(matches) ### Grouped matches -Imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”. +Imagine we want to extract nouns from the sentences. As a heuristic, we'll look for any word that comes after "a" or "the". ```{r noun_regex, eval=T, cache=T} noun <- "(a|the) ([^ ]+)" diff --git a/session_8/session_8.Rmd b/session_8/session_8.Rmd index ad0e404819c2fd68d9dcc1dda142c475f11e8c38..a6aae5a8d5a0bf6e37a91aa02f0a02d84ed90662 100644 --- a/session_8/session_8.Rmd +++ b/session_8/session_8.Rmd @@ -21,7 +21,7 @@ knitr::opts_chunk$set(comment = NA) In this session, you will learn more about the factor type in R. Factors can be very useful, but you have to be mindful of the implicit conversions from simple vector to factor ! -They are the source of loot of pain for R programmers. +They are the source of lot of pain for R programmers. <div class="pencadre"> As usual we will need the `tidyverse` library. @@ -45,13 +45,13 @@ x1 <- c("Dec", "Apr", "Jan", "Mar") Using a string to record this variable has two problems: -1. There are only twelve possible months, and there’s nothing saving you from typos: +1. There are only twelve possible months, and there's nothing saving you from typos: ```{r declare_month2, eval=T, cache=T} x2 <- c("Dec", "Apr", "Jam", "Mar") ``` -2. It doesn’t sort in a useful way: +2. It doesn't sort in a useful way: ```{r sort_month, eval=T, cache=T} sort(x1) @@ -76,7 +76,7 @@ y2 <- parse_factor(x2, levels = month_levels) y2 ``` -Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data. +Sometimes you'd prefer that the order of the levels match the order of the first appearance in the data. ```{r inorder_month_factor, eval=T, cache=T} f2 <- x1 %>% factor() %>% fct_inorder() @@ -91,7 +91,7 @@ gss_cat %>% count(race) ``` -By default, `ggplot2` will drop levels that don’t have any values. You can force them to display with: +By default, `ggplot2` will drop levels that don't have any values. You can force them to display with: ```{r race_plot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(gss_cat, aes(x = race)) + @@ -101,7 +101,7 @@ ggplot(gss_cat, aes(x = race)) + ## Modifying factor order -It’s often useful to change the order of the factor levels in a visualisation. +It's often useful to change the order of the factor levels in a visualisation. ```{r tv_hour, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} relig_summary <- gss_cat %>% @@ -114,18 +114,18 @@ relig_summary <- gss_cat %>% ggplot(relig_summary, aes(x = tvhours, y = relig)) + geom_point() ``` -It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using `fct_reorder()`. `fct_reorder()` takes three arguments: +It is difficult to interpret this plot because there's no overall pattern. We can improve it by reordering the levels of the factor relig using `fct_reorder()`. `fct_reorder()` takes three arguments: - `f`, the factor whose levels you want to modify. - `x`, a numeric vector that you want to use to reorder the levels. -- Optionally, `fun`, a function that’s used if there are multiple values of `x` for each value of `f`. The default value is `median`. +- Optionally, `fun`, a function that's used if there are multiple values of `x` for each value of `f`. The default value is `median`. ```{r tv_hour_order, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) + geom_point() ``` -As you start making more complicated transformations, I’d recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as: +As you start making more complicated transformations, I would recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as: ```{r tv_hour_order_mutate, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE} relig_summary %>% @@ -136,7 +136,7 @@ relig_summary %>% ## `fct_reorder2()` -Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend. +Another useful type of reordering is when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend. ```{r fct_reorder2, eval=T, plot=T} by_age <- gss_cat %>% @@ -159,21 +159,21 @@ ggplot(by_age, aes(x = age, y = prop, colour = fct_reorder2(marital, age, prop)) ## Materials -There are lots of material online for R and more particularly on `tidyverse` and `Rstudio` +There are lots of material online for R and more particularly on `tidyverse` and `RStudio` You can find cheat sheet for all the packages of the `tidyverse` on this page: -[https://www.rstudio.com/resources/cheatsheets/](https://www.rstudio.com/resources/cheatsheets/) +[https://posit.co/resources/cheatsheets/](https://posit.co/resources/cheatsheets/) -The `Rstudio` websites are also a good place to learn more about R and the meta-package maintenained by the `Rstudio` community: +The `RStudio` websites are also a good place to learn more about R and the meta-package maintained by the `RStudio` community: -- [https://www.rstudio.com/resources/webinars/](https://www.rstudio.com/resources/webinars/) -- [https://www.rstudio.com/products/rpackages/](https://www.rstudio.com/products/rpackages/) +- [webinars](https://posit.co/resources/videos/) +- [R packages](https://posit.co/products/open-source/rpackages/) For example [rmarkdown](https://rmarkdown.rstudio.com/) is a great way to turn your analyses into high quality documents, reports, presentations and dashboards: - - A comprehensive guide: [https://bookdown.org/yihui/rmarkdown/](https://bookdown.org/yihui/rmarkdown/) - - The cheatsheet [https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf) + - [a comprehensive guide](https://bookdown.org/yihui/rmarkdown/) + - [the cheatsheet](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf) -In addition most packages will provide **vignette**s on how to perform an analysis from scratch. On the [bioconductor.org](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) website (specialised on R packages for biologists), you will have direct links to the packages vignette. +In addition most packages will provide **vignette**s on how to perform an analysis from scratch. On the [cran.r-project.org](https://cran.r-project.org/web/packages/ggplot2/index.html) or [bioconductor.org](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) websites (specialised on R packages for biologists), you will have direct links to a package vignettes. -Finally, don't forget to search the web for your problems or error in R websites like [stackoverflow](https://stackoverflow.com/) contains high quality and well-curated answers. \ No newline at end of file +Finally, don't forget to search the web for your problems or error in R, for instance [stackoverflow](https://stackoverflow.com/) contains high quality and well-curated answers. \ No newline at end of file