Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
  • master
2 results

Target

Select target project
  • LBMC/hub/formations/R_basis
  • can/R_basis
2 results
Select Git revision
  • main
  • master
  • quarto-rebuild
3 results
Show changes
Showing
with 19274 additions and 654 deletions
---
title: 'R.1: Installing packages from github'
author: "Carine Rey [carine.rey@ens-lyon.fr](mailto:carine.rey@ens-lyon.fr), Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
If you need to install a package that is not available on the CRAN but on a github repository, you can do it using the "remotes" package. Indeed this package imports functions that will allow you to install a package available on [github](https://github.com/) or bitbucket or gitlab directly on your computer.
To use the "remotes" packages, you must first install it:
```R
install.packages("remotes")
```
Once "remotes" is installed, you will be able to install all R package from github or from their URL.
For example, if you want to install the last version of a "gganimate", which allow you to animate ggplot2 graphes, you can use :
```R
remotes::install_github("thomasp85/gganimate")
```
By default the latest version of the package is installed, if you want a given version you can specify it :
```R
remotes::install_github("thomasp85/gganimate@v1.0.7")
```
You can find more information in the documentation of remotes : [https://remotes.r-lib.org](https://remotes.r-lib.org)
This diff is collapsed.
---
title: "R.2: introduction to Tidyverse"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nHélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
```{r download_data, include=FALSE, eval=FALSE}
```{r download_data, include=FALSE, eval=T}
library("tidyverse")
tmp <- tempfile(fileext = ".zip")
download.file("http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
......@@ -114,7 +108,7 @@ read_csv("data-raw/vehicles.csv") %>%
write_csv("mpg.csv")
```
# Introduction
## Introduction
In the last session, we have gone through the basis of R.
Instead of continuing to learn more about R programming, in this session we are going to jump directly to rendering plots.
......@@ -133,7 +127,7 @@ The objectives of this session will be to:
- Learn the different aesthetics in R plots
- Compose complex graphics
## Tidyverse
### Tidyverse
The `tidyverse` package is a collection of R packages designed for data science that include `ggplot2`.
......@@ -154,7 +148,7 @@ Luckily for you, `tidyverse` is preinstalled on your Rstudio server. So you just
library("tidyverse")
```
## Toy data set `mpg`
### Toy data set `mpg`
This dataset contains a subset of the fuel economy data that the EPA makes available on [fueleconomy.gov](http://fueleconomy.gov).
It contains only models which had a new release every year between 1999 and 2008.
......@@ -170,9 +164,13 @@ For that we are going to use the command `read_csv` which is able to read a [csv
This command also works for file URL
```{r mpg_download, cache=TRUE, message=FALSE}
```{r mpg_download_local, cache=TRUE, message=FALSE, echo = F, include=F}
new_mpg <- read_csv("./mpg.csv")
```
```{r mpg_download, cache=TRUE, message=FALSE, eval = F}
new_mpg <- read_csv(
"http://perso.ens-lyon.fr/laurent.modolo/R/session_2/mpg.csv"
"https://can.gitbiopages.ens-lyon.fr/R_basis/session_2/mpg.csv"
)
```
......@@ -199,13 +197,13 @@ new_mpg
Here we can see that `new_mpg` is a `tibble` we will come back to `tibble` later.
## New script
### New script
Like in the last session, instead of typing your commands directly in the console, you are going to write them in an R script.
![](./img/formationR_session2_scriptR.png)
# First plot with `ggplot2`
## First plot with `ggplot2`
We are going to make the simplest plot possible to study the relationship between two variables: the scatterplot.
......@@ -237,7 +235,15 @@ ggplot(data = <DATA>) +
What happend when you use only the command `ggplot(data = mpg)` ?
</div>
<details><summary>Solution</summary>
<p>
```{r only_ggplot, cache = TRUE, fig.width=4.5, fig.height=2}
ggplot(data = new_mpg)
```
</p>
</details>
<div class="pencadre">
Make a scatterplot of `hwy` ( fuel efficiency ) vs. `cyl` ( number of cylinders ).
</div>
......@@ -251,12 +257,22 @@ ggplot(data = new_mpg, mapping = aes(x = hwy, y = cyl)) +
```
</p>
<div class="pencadre">
What seems to be the problem ?
</div>
<details><summary>Solution</summary>
<p>
Dots with the same coordinates are superposed.
</p>
</details>
</details>
# Aesthetic mappings
## Aesthetic mappings
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. `ggplot2` will also add a legend that explains which levels correspond to which values.
......@@ -266,7 +282,7 @@ Try the following aesthetic:
- `alpha`
- `shape`
## `color` mapping
### `color` mapping
```{r new_mpg_plot_e, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
......@@ -274,21 +290,21 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
```
## `size` mapping
### `size` mapping
```{r new_mpg_plot_f, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, size = class)) +
geom_point()
```
## `alpha` mapping
### `alpha` mapping
```{r new_mpg_plot_g, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, alpha = class)) +
geom_point()
```
## `shape` mapping
### `shape` mapping
```{r new_mpg_plot_h, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, shape = class)) +
......@@ -325,7 +341,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
</p>
</details>
## Mapping a **continuous** variable to a color.
### Mapping a **continuous** variable to a color.
You can also map continuous variable to a color
......@@ -347,7 +363,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = displ < 5)) +
</p>
</details>
# Facets
## Facets
You can create multiple plots at once by faceting. For this you can use the command `facet_wrap`.
This command takes a formula as input.
......@@ -362,7 +378,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
```
<div class="pencadre">
Now try to facet your plot by `fl + class`
Now try to facet your plot by `fuel + class`
</div>
......@@ -371,14 +387,14 @@ Now try to facet your plot by `fl + class`
Formulas allow you to express complex relationship between variables in R !
```{r new_mpg_plot_l, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fl + class, nrow = 2)
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fuel + class, nrow = 2)
```
</p>
</details>
# Composition
## Composition
There are different ways to represent the information :
......@@ -416,7 +432,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
\
We can use different `data` for different layers (you will lean more on `filter()` later)
We can use different `data` (here new_mpg and mpg tables) for different layers (you will lean more on `filter()` later)
```{r new_mpg_plot_t, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
......@@ -424,14 +440,14 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = filter(mpg, class == "subcompact"))
```
# Challenge !
## Challenge !
## First challenge
### First challenge
<div class="pencadre">
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
</div>
```R
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
......@@ -441,71 +457,117 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
- What does the `se` argument to `geom_smooth()` do?
</div>
## Second challenge
<details><summary>Solution</summary>
<p>
```{r soluce_challenge_1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
</p>
</details>
### Second challenge
<div class="pencadre">
How being a `2seater` car impact the engine size versus fuel efficiency relationship ?
How being a `Two Seaters` car (*class column*) impact the engine size (*displ column*) versus fuel efficiency relationship (*hwy column*) ?
1. Make a plot of `hwy` in function of `displ `
1. *Colorize* this plot in another color for `Two Seaters` class
2. *Split* this plot for each *class*
Make a plot *colorizing* this information
</div>
<details><summary>Solution</summary>
<details><summary>Solution 1</summary>
<p>
```{r new_mpg_plot_color_2seater1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
```
</p>
</details>
<details><summary>Solution 2</summary>
<p>
```{r new_mpg_plot_color_2seater, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
```{r new_mpg_plot_color_2seater2, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red")
```
</p>
</details>
<details><summary>Solution 3</summary>
<p>
```{r new_mpg_plot_color_2seater_facet, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red") +
facet_wrap(~class)
```
</p>
</details>
<div class="pencadre">
Write a `function` called `plot_color_2seater` that can take as sol argument the variable `mpg` and plot the same graph.
Write a `function` called `plot_color_a_class` that can take as argument the class and plot the same graph for this class
</div>
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_color_2seater_fx, cache = TRUE, fig.width=8, fig.height=4.5}
plot_color_2seater <- function(mpg) {
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
plot_color_a_class <- function(my_class) {
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == my_class), color = "red") +
facet_wrap(~class)
}
plot_color_2seater(mpg)
plot_color_a_class("Two Seaters")
plot_color_a_class("Compact Cars")
```
</p>
</details>
## Third challenge
### Third challenge
<div class="pencadre">
Recreate the R code necessary to generate the following graph
Recreate the R code necessary to generate the following graph (see "linetype" option of "geom_smooth")
</div>
```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_v, eval=F}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
</p>
</details>
## See you in [R.3: Transformations with ggplot2](http://perso.ens-lyon.fr/laurent.modolo/R/session_3/)
### See you in [R.3: Transformations with ggplot2](/session_3/session_3.html)
# To go further: publication ready plots
## To go further: publication ready plots
Once you have created the graph you need for your publication, you have to save it.
You can do it with the the `ggsave` function.
You can do it with the `ggsave` function.
First save your plot in a variable :
......@@ -537,10 +599,17 @@ p1 + theme_minimal()
You may have to combine several plots, for that you can use the `cowplot` package which is a `ggplot2` extension.
First install it :
```{r, eval=F}
install.packages("cowplot")
```
```{r, include=F, echo =F}
if (! require("cowplot")) {
install.packages("cowplot")
}
```
Then you can use the function `plot` grid to combine plots in a publication ready style:
```{r,message=FALSE}
......
---
title: 'R.3: Transformations with ggplot2'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
In the last session, we have seen how to use `ggplot2` and [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). The goal of this practical is to practices more advanced features of `ggplot2`.
......@@ -47,7 +37,7 @@ library("tidyverse")
Like in the previous sessions, it's good practice to create a new **.R** file to write your code instead of using the R terminal directly.
# `ggplot2` statistical transformations
## `ggplot2` statistical transformations
In the previous session, we have plotted the data as they are by using the variable values as **x** or **y** coordinates, color shade, size or transparency.
When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations.
......@@ -67,7 +57,7 @@ We are going to use the `diamonds` data set included in `tidyverse`.
str(diamonds)
```
## Introduction to `geom_bar`
### Introduction to `geom_bar`
We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`).
Now barplot with `geom_bar()` :
......@@ -82,7 +72,7 @@ More diamonds are available with high quality cuts.
On the x-axis, the chart displays **cut**, a variable from diamonds. On the y-axis, it displays **count**, **but count is not a variable in diamonds!**
## **geom** and **stat**
### **geom** and **stat**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
......@@ -98,7 +88,7 @@ ggplot(data = diamonds, mapping = aes(x = cut)) +
Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly:
## Why **stat** ?
### Why **stat** ?
You might want to override the default stat.
For example, in the following `demo` dataset we already have a variable for the **counts** per `cut`.
......@@ -153,7 +143,7 @@ If group is not used, the proportion is calculated with respect to the data that
</p>
</details>
## More details with `stat_summary`
### More details with `stat_summary`
<div class="pencadre">
You might want to draw greater attention to the statistical transformation in your code.
......@@ -188,7 +178,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
</p>
</details>
# Coloring area plots
## Coloring area plots
<div class="pencadre">
You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`:
......@@ -223,7 +213,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
</p>
</details>
# Position adjustments
## Position adjustments
The stacking of the `fill` parameter is performed by the position adjustment `position`
......@@ -295,7 +285,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
</p>
</details>
# Coordinate systems
## Coordinate systems
Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
......@@ -343,9 +333,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
By combining the right **geom**, **coordinates** and **faceting** functions, you can build a large number of different plots to present your results.
# See you in [R.4: data transformation](http://perso.ens-lyon.fr/laurent.modolo/R/session_4/)
## See you in [R.4: data transformation](/session_4/session_4.html)
# To go further: animated plots from xls files
## To go further: animated plots from xls files
In order to be able to read information from a xls file, we will use the `openxlsx` packages. To generate animation we will use the `ggannimate` package. The additional `gifski` package will allow R to save your animation in the gif format (Graphics Interchange Format)
......@@ -359,14 +349,23 @@ library(gifski)
```
<div class="pencardre">
Use the `openxlsx` package to save the [http://perso.ens-lyon.fr/laurent.modolo/R/session_3/gapminder.xlsx](http://perso.ens-lyon.fr/laurent.modolo/R/session_3/gapminder.xlsx) file to the `gapminder` variable
Use the `openxlsx` package to save the [https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx) file to the `gapminder` variable
</div>
<details><summary>Solution</summary>
<p>
2 solutions :
Use directly the url
```{r load_xlsx_url, eval = F}
gapminder <- read.xlsx("https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx")
```
Dowload the file, save it in the same directory as your script then use the local path
```{r load_xlsx}
gapminder <- read.xlsx("http://perso.ens-lyon.fr/laurent.modolo/R/session_3/gapminder.xlsx")
gapminder <- read.xlsx("gapminder.xlsx")
```
</p>
</details>
......@@ -417,7 +416,8 @@ For this we need to add a `transition_time` layer that will take as an argument
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10() +
transition_time(year)
transition_time(year) +
labs(title = 'Year: {as.integer(frame_time)}')
```
</p>
</details>
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
session_4/img/transform-logical.png

70.2 KiB | W: 0px | H: 0px

session_4/img/transform-logical.png

82.8 KiB | W: 0px | H: 0px

session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
  • 2-up
  • Swipe
  • Onion skin
This diff is collapsed.
---
title: "R#5: Pipping and grouping"
title: "R.5: Pipping and grouping"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
The goal of this practical is to practice combining data transformation with `tidyverse`.
The objectives of this session will be to:
......@@ -47,7 +40,7 @@ library("nycflights13")
</p>
</details>
# Combining multiple operations with the pipe
## Combining multiple operations with the pipe
<div id="pencadre">
Find the 10 most delayed flights using a ranking function. `min_rank()`
......@@ -78,8 +71,8 @@ Try to pipe operators to rewrite your precedent code with only **one** variable
<p>
```{r pipe_example_b, include=TRUE}
flights_md2 <- flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)
```
</p>
......@@ -89,12 +82,12 @@ Working with the pipe is one of the key criteria for belonging to the `tidyverse
The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
## When not to use the pipe
### When not to use the pipe
- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
# Grouping variable
## Grouping variable
The `summarise()` function collapses a data frame to a single row.
Check the difference between `summarise()` and `mutate()` with the following commands:
......@@ -108,7 +101,7 @@ flights %>%
Where mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column.
## The power of `summarise()` with `group_by()`
### The power of `summarise()` with `group_by()`
The `group_by()` function changes the unit of analysis from the complete dataset to individual groups.
Individual groups are defined by categorial variable or **factors**.
......@@ -132,7 +125,7 @@ ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
Why did we `group_by` `year` and `month` and not only `year` ?
</div>
## Missing values
### Missing values
<div class="pencadre">
You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it?
......@@ -149,7 +142,7 @@ flights %>%
Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**.
## Counts
### Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
......@@ -157,31 +150,29 @@ Whenever you do any aggregation, it’s always a good idea to include either a c
summ_delay_filghts <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20)
filter(avg_delay < 40 & avg_delay > -20)
ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
ggplot(summ_delay_filghts, mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
<div class="pencadre">
Imagine that we want to explore the relationship between the distance and average delay for each location and recreate the above figure.
Imagine that we want to explore the relationship between the average distance (`distance`) and average delay (`arr_delay`) for each location (`dest`) and recreate the above figure.
here are three steps to prepare this data:
1. Group flights by destination.
2. Summarize to compute distance, average delay, and number of flights using `n()`.
3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
2. Summarize to compute average distance (`avg_distance`), average delay (`avg_delay`), and number of flights using `n()` (`n_flights`).
3. Filter to remove Honolulu airport, which is almost twice as far away as the next closest airport.
4. Filter to remove noisy points with delay superior to 40 or inferior to -20
5. Create a `mapping` on `dist`, `delay` and `count` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()`
5. Create a `mapping` on `avg_distance`, `avg_delay` and `n_flights` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()` (use method = lm)
7. We can hide the legend with the layer `theme(legend.position='none')`
</div>
......@@ -191,13 +182,13 @@ here are three steps to prepare this data:
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20) %>%
ggplot(mapping = aes(x = dist, y = delay, size = count)) +
filter(avg_delay < 40 & avg_delay > -20) %>%
ggplot(mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
......@@ -206,7 +197,7 @@ flights %>%
</details>
## Ungrouping
### Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
......@@ -221,19 +212,21 @@ flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
# Grouping challenges
## Grouping challenges
## First challenge
### First challenge
<div class="pencadre">
Look at the number of canceled flights per day. Is there a pattern?
(A canceled flight is a flight where the `dep_time` or the `arr_time` is `NA`)
**Remember to always try to decompose complex questions into smaller and simple problems**
- What are `canceled` flights?
- Who can I `canceled` flights?
- We need to define the day of the week `wday` variable (`strftime(x,'%A')` give you the name of the day from a POSIXct date).
- How can you create a `canceled` flights variable which will be TRUE if the flight is canceled or FALSE if not?
- We need to define the day of the week `wday` variable (Monday, Tuesday, ...). To do that, you can use `strftime(x,'%A')` to get the name of the day of a `x` date in the POSIXct format as in the `time_hour` column, ex: `strftime("2013-01-01 05:00:00 EST",'%A')` return "Tuesday" ).
- We can count the number of canceled flight (`cancel_day`) by day of the week (`wday`).
- We can pipe transformed and filtered tibble into a `ggplot` function.
- We can use `geom_col` to have a barplot of the number of `cancel_day` for each. `wday`
......@@ -260,7 +253,7 @@ flights %>%
</p>
</details>
## Second challenge
### Second challenge
<div class="pencadre">
Is the proportion of canceled flights by day of the week related to the average departure delay?
......@@ -275,8 +268,8 @@ flights %>%
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
mutate(
prop_cancel_day = sum(canceled)/sum(!canceled),
summarise(
prop_cancel_day = sum(canceled)/n(),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
ungroup() %>%
......@@ -353,7 +346,7 @@ flights %>%
</p>
</details>
## Third challenge
### Third challenge
<div class="pencadre">
Which carrier has the worst delays?
......@@ -375,7 +368,7 @@ flights %>%
</details>
<div class="pencadre">
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n())`)
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n=n())`)
</div>
<details><summary>Solution</summary>
......@@ -395,4 +388,4 @@ flights %>%
</p>
</details>
## See you in [R.6: tidydata](http://perso.ens-lyon.fr/laurent.modolo/R/session_6/)
### See you in [R.6: tidydata](/session_6/session_6.html)
session_6/img/overview_joins.png

50.5 KiB

session_6/img/overview_set.png

11.5 KiB

session_6/img/pivot_longer.png

21.1 KiB

session_6/img/pivot_wider.png

21.7 KiB

---
title: "R.6: tidydata"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nCarine Rey [carine.rey@ens-lyon.fr](mailto:carine.rey@ens-lyon.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
Until now we have worked with data already formated in a *nice way*.
In the `tidyverse` data formated in a *nice way* are called **tidy**
The goal of this practical is to understand how to transform an hugly blob of information into a **tidy** data set.
## Tidydata
### Tidydata
There are three interrelated rules which make a dataset tidy:
......@@ -54,13 +44,18 @@ library(tidyverse)
</p>
</details>
For this practical we are going to use the `table` dataset which demonstrate multiple ways to layout the same tabular data.
For this practical we are going to use the `table` set of datasets which demonstrate multiple ways to layout the same tabular data.
<div class="pencadre">
Use the help to know more about this dataset
Use the help to know more about `table1` dataset
</div>
<details><summary>Solution</summary>
```{r}
?table1
```
<p>
`table1`, `table2`, `table3`, `table4a`, `table4b`, and `table5` all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.
......@@ -68,9 +63,44 @@ The data is a subset of the data contained in the World Health Organization Glob
</p>
</details>
# Pivoting data
## Pivoting data
### pivot longer
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_longer.png')
```
```{r, eval = F}
wide_example <- tibble(X1 = c("A","B"),
X2 = c(1,2),
X3 = c(0.1,0.2),
X4 = c(10,20))
```
If you have a wide dataset, such as `wide_example`, that you want to make longer, you will use the `pivot_longer()` function.
You have to specify the names of the columns you want to pivot into longer format (X2,X3,X4):
```{r, eval = F}
wide_example %>%
pivot_longer(c(X2,X3,X4))
```
... or the reverse selection (-X1):
```{r, eval = F}
wide_example %>% pivot_longer(-X1)
```
You can specify the names of the columns where the data will be tidy (by default, it is `names` and `value`):
```{r, eval = F}
long_example <- wide_example %>%
pivot_longer(-X1, names_to = "V1", values_to = "V2")
```
## pivot longer
#### Exercice
<div class="pencadre">
Visualize the `table4a` dataset (you can use the `View()` function).
......@@ -107,7 +137,23 @@ table4a %>%
</p>
</details>
## pivot wider
### pivot wider
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_wider.png')
```
If you have a long dataset, that you want to make wider, you will use the `pivot_wider()` function.
You have to specify which column contains the name of the output column (`names_from`), and which column contains the cell values from (`values_from`).
```{r, eval = F}
long_example %>% pivot_wider(names_from = V1,
values_from = V2)
```
#### Exercice
<div class="pencadre">
Visualize the `table2` dataset
......@@ -128,11 +174,13 @@ table2 %>%
</p>
</details>
# Merging data
## Merging data
## Relational data
### Relational data
Sometime the information can be split between different table
To avoid having a huge table and to save space, information is often splited between different tables.
In our `flights` dataset, information about the `carrier` or the `airports` (origin and dest) are saved in a separate table (`airlines`, `airports`).
```{r airlines, eval=T, echo = T}
library(nycflights13)
......@@ -144,27 +192,40 @@ flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
## Relational data
### Relational schema
The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
```{r airlines_dag, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/relational-nycflights.png')
```
## joints
### Joints
If you have to combine data from 2 tables in a a new table, you will use `joints`.
There are several types of joints depending of what you want to get.
```{r joints, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/join-venn.png')
```
## `inner_joint()`
Small concrete examples:
Matches pairs of observations whenever their keys are equal
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_joins.png')
```
#### `inner_joint()`
keeps observations in `x` AND `y`
```{r inner_joint, eval=T}
flights2 %>%
inner_join(airlines)
```
## `left_joint()`
#### `left_joint()`
keeps all observations in `x`
......@@ -173,7 +234,7 @@ flights2 %>%
left_join(airlines)
```
## `right_joint()`
#### `right_joint()`
keeps all observations in `y`
......@@ -182,7 +243,7 @@ flights2 %>%
right_join(airlines)
```
## `full_joint()`
#### `full_joint()`
keeps all observations in `x` and `y`
......@@ -191,43 +252,52 @@ flights2 %>%
full_join(airlines)
```
## Defining the key columns
### Defining the key columns
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_weather, eval=T}
```{r , eval=T}
flights2 %>%
left_join(weather)
```
## Defining the key columns
If the two tables contain columns with the same names but corresponding to different things (such as `year` in `flights2` and `planes`) you have to manually define the key or the keys.
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_tailnum, eval=T, echo = T}
```{r , eval=T, echo = T}
flights2 %>%
left_join(planes, by = "tailnum")
```
## Defining the key columns
A named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
```{r left_join_airport, eval=T, echo = T}
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa"))
```
## Filtering joins
If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix `.x` and `.y` because all column names must be different in the output table.
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa")) %>%
left_join(airports, c("origin" = "faa"))
```
You can change the suffix using the option `suffix`
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin"))
```
### Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
- `semi_join(x, y)` keeps all observations in `x` that have a match in `y`.
- `anti_join(x, y)` drops all observations in `x` that have a match in `y`.
## Filtering joins
```{r top_dest, eval=T, echo = T}
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
......@@ -236,7 +306,7 @@ flights %>%
semi_join(top_dest)
```
## Set operations
### Set operations
These expect the x and y inputs to have the same variables, and treat the observations like sets:
......@@ -244,4 +314,8 @@ These expect the x and y inputs to have the same variables, and treat the observ
- `union(x, y)`: return unique observations in `x` and `y`.
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
## See you in [R.7: String & RegExp](http://perso.ens-lyon.fr/laurent.modolo/R/session_7/)
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_set.png')
```
### See you in [R.7: String & RegExp](/session_7/session_7.html)
---
title: '#7 String & RegExp'
title: "R.7: String & RegExp"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "08 Nov 2019"
always_allow_html: yes
output:
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
slidy_presentation:
highlight: tango
date: "2022"
---
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
## String basics
## Introduction
In the previous session, we have often overlooked a particular type of data, the **string**.
In R a sequence of characters is stored as a string.
In this session you will learn the distinctive features of the string type and how we can use string of character within a programming language which is composed of particular string of characters as function names, variables.
<div class="pencadre">
As usual we will need the `tidyverse` library.
</div>
<details><summary>Solution</summary>
<p>
```{r load_data, eval=T, message=F}
library(tidyverse)
```
</p>
</details>
## String basics
### String definition
A string can be defined within double `"` or simple `'` quote
```{r string_def, eval=F, message=T}
string1 <- "This is a string"
string2 <- 'If I want to include a "quote"
inside a string, I use single quotes'
......@@ -37,38 +57,35 @@ If you forget to close a quote, you’ll see +, the continuation character:
+ HELP I'M STUCK
```
If this happen to you, press Escape and try again!
If this happens to you, press `Escape` and try again!
## String basics
To include a literal single or double quote in a string you can use \ to “escape” it:
To include a literal single or double quote in a string you can use \\ to *escape* it:
```
```{r string_def_escape, eval=F, message=T}
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
```
if you want to include a literal backslash, you’ll need to double it up: `"\\"`.
If you want to include a literal backslash, you’ll need to double it up: `"\\"`.
## String basics
### String representation
the printed representation of a string is not the same as string itself
The printed representation of a string is not the same as string itself
```
```{r string_rep_escape_a, eval=T, message=T}
x <- c("\"", "\\")
x
#> [1] "\"" "\\"
```
```{r string_rep_escape_b, eval=T, message=T}
writeLines(x)
#> "
#> \
```
## String basics
Some characters have a special representation, they are called **special characters**.
The most common are `"\n"`, newline, and `"\t"`, tabulation, but you can see the complete list by requesting help on `"`: `?'"'`
Special characters:
### String operation
The most common are `"\n"`, newline, and `"\t"`, tab, but you can see the complete list by requesting help on `"`: `?'"'`
## String basics
You can perform basic operation on strings like
- String length
......@@ -87,9 +104,8 @@ x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
```
## String basics
- Subsetting strings
negative numbers count backwards from end
negative numbers count backwards from the end
```{r str_sub2, eval=T, message=FALSE, cache=T}
str_sub(x, -3, -1)
```
......@@ -106,11 +122,30 @@ str_sort(x)
## Matching patterns with regular expressions
Regexps are a very terse language that allow you to describe patterns in strings.
Regexps are a very terse language that allows you to describe patterns in strings.
To learn regular expressions, we’ll use `str_view()` and `str_view_all()`. These functions take a character vector and a regular expression, and show you how they match.
## Matching patterns with regular expressions
<div class="pencadre">
You need to install the `htmlwidgets` packages to use these functions
</div>
<details><summary>Solution</summary>
<p>
```{r install_htmlwidgets, eval=T, message=F, include=F, echo=F}
if (! require("htmlwidgets")) {
install.packages("htmlwidgets")
}
```
```{r load_htmlwidgets, eval=T, message=F}
library(htmlwidgets)
```
</p>
</details>
The most basic regular expression is the exact match.
```{r str_view, eval=T, message=FALSE, cache=T}
x <- c("apple", "banana", "pear")
......@@ -124,12 +159,14 @@ x <- c("apple", "banana", "pear")
str_view(x, ".a.")
```
But if “`.`” matches any character, how do you match the character “`.`”?
You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behavior.
## Matching patterns with regular expressions
But if “`.`” matches any character, how do you match the character “`.`”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, `\`, to escape special behaviour. So to match an ., you need the regexp `\.`. Unfortunately this creates a problem. We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings. So to create the regular expression `\.` we need the string "`\\.`".
Like strings, regexps use the backslash, `\`, to escape special behaviour.
So to match an `.`, you need the regexp `\.`. Unfortunately this creates a problem.
## Matching patterns with regular expressions
We use strings to represent regular expressions, and `\` is also used as an escape symbol in strings.
So to create the regular expression `\.` we need the string "`\\.`".
```{r str_viewdotescape, eval=T, message=FALSE, cache=T}
dot <- "\\."
......@@ -137,12 +174,7 @@ writeLines(dot)
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
## Matching patterns with regular expressions
If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write "`\\\\`" — you need four backslashes to match one!
## Matching patterns with regular expressions
If `\` is used as an escape character in regular expressions, how do you match a literal `\`? Well, you need to escape it, creating the regular expression `\\`. To create that regular expression, you need to use a string, which also needs to escape `\`. That means to match a literal `\` you need to write "`\\\\`" — you need four backslashes to match one!
```{r str_viewbackslashescape, eval=T, message=FALSE, cache=T}
x <- "a\\b"
......@@ -150,100 +182,89 @@ writeLines(x)
str_view(x, "\\\\")
```
## Exercises
### Exercises
- Explain why each of these strings don’t match a \: "`\`", "`\\`", "`\\\`".
- Explain why each of these strings doesn’t match a \: "`\`", "`\\`", "`\\\`".
- How would you match the sequence `"'\`?
- What patterns will the regular expression `\..\..\..` match? How would you represent it as a string?
## Anchors
### Anchors
- `^` match the start of the string.
- `$` match the end of the string.
Until now we searched for patterns anywhere in the target string. But we can use anchors to be more precise.
- `^` Match the start of the string.
- `$` Match the end of the string.
```{r str_viewanchors, eval=T, cache=T}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
```
## Anchors
- `^` match the start of the string.
- `$` match the end of the string.
```{r str_viewanchorsend, eval=T, cache=T}
str_view(x, "a$")
```
## Anchors
- `^` match the start of the string.
- `$` match the end of the string.
```{r str_viewanchorsstartend, eval=T, cache=T}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "^apple$")
```
## Exercices
### Exercices
- How would you match the literal string `"$^$"`?
- Given the corpus of common words in stringr::words, create regular expressions that find all words that:
-Start with “y”.
- End with “x”
- Are exactly three letters long. (Don’t cheat by using `str_length()`!)
- Have seven letters or more.
Since this list is long, you might want to use the match argument to str_view() to show only the matching or non-matching words.
Since this list is long, you might want to use the match argument to `str_view()` to show only the matching or non-matching words.
## Character classes and alternatives
### Character classes and alternatives
In regular expression we have special character and patterns that match groups of characters.
- `\d`: matches any digit.
- `\s`: matches any whitespace (e.g. space, tab, newline).
- `[abc]`: matches a, b, or c.
- `[^abc]`: matches anything except a, b, or c.
```
```{r str_viewanchorsstartend_b, eval=T, cache=T}
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")
```
## Character classes and alternatives
You can use alternations to pick between one or more alternative patterns. For example, `abc|d..f` will match either `abc`, or `deaf`. Note that the precedent for `|` is low, so that `abc|xyz` matches `abc` or `xyz` not `abcyz` or `abxyz`. Like with mathematical expressions, if presidents ever get confusing, use parentheses to make it clear what you want:
You can use alternation to pick between one or more alternative patterns. For example, abc|d..f will match either ‘“abc”’, or "deaf". Note that the precedence for | is low, so that abc|xyz matches abc or xyz not abcyz or abxyz. Like with mathematical expressions, if precedence ever gets confusing, use parentheses to make it clear what you want:
```
```{r str_viewanchorsstartend_c, eval=T, cache=T}
str_view(c("grey", "gray"), "gr(e|a)y")
```
## Exercices
### Exercices
Create regular expressions to find all words that:
- Start with a vowel.
- That only contain consonants. (Hint: thinking about matching “not”-vowels.)
- That only contains consonants. (Hint: thinking about matching “not”-vowels.)
- End with ed, but not with eed.
- End with ing or ise.
## Repetition
### Repetition
Now that you know how to search for groups of characters you can define the number of times you want to see them.
- `?`: 0 or 1
- `+`: 1 or more
- `*`: 0 or more
```
```{r str_view_repetition, eval=T, cache=T}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, 'C[LX]+')
```
## Repetition
You can also specify the number of matches precisely:
- `{n}`: exactly n
......@@ -251,13 +272,13 @@ You can also specify the number of matches precisely:
- `{,m}`: at most m
- `{n,m}`: between n and m
```
```{r str_view_repetition_b, eval=T, cache=T}
str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
```
## Exercices
### Exercices
- Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
- `^.*$`
......@@ -270,17 +291,15 @@ str_view(x, "C{2,3}")
- Have two or more vowel-consonant pairs in a row.
## Grouping
### Grouping
You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with backreferences, like `\1`, `\2` etc.
You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with back references, like `\1`, `\2` etc.
```
```{r str_view_grouping, eval=T, cache=T}
str_view(fruit, "(..)\\1", match = TRUE)
```
## Exercices
### Exercices
- Describe, in words, what these expressions will match:
- `"(.)\1\1"`
......@@ -293,39 +312,41 @@ str_view(fruit, "(..)\\1", match = TRUE)
- Contain a repeated pair of letters (e.g. `“church”` contains `“ch”` repeated twice.)
- Contain one letter repeated in at least three places (e.g. `“eleven”` contains three `“e”`s.)
## Detect matches
### Detect matches
```
```{r str_view_match, eval=T, cache=T}
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```
How many common words start with t?
```
```{r str_view_match_b, eval=T, cache=T}
sum(str_detect(words, "^t"))
```
What proportion of common words end with a vowel?
What proportion of common words ends with a vowel?
```
```{r str_view_match_c, eval=T, cache=T}
mean(str_detect(words, "[aeiou]$"))
```
## Combining detection
### Combining detection
Find all words containing at least one vowel, and negate
```
```{r str_view_detection, eval=T, cache=T}
no_vowels_1 <- !str_detect(words, "[aeiou]")
```
Find all words consisting only of consonants (non-vowels)
```
```{r str_view_detection_b, eval=T, cache=T}
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
```
## With tibble
### With tibble
```{r str_detecttibble, eval=T, cache=T}
df <- tibble(
......@@ -336,7 +357,7 @@ df %>%
filter(str_detect(word, "x$"))
```
## Extract matches
### Extract matches
```{r str_sentences, eval=T, cache=T}
head(sentences)
......@@ -350,7 +371,7 @@ colour_match <- str_c(colours, collapse = "|")
colour_match
```
## Extract matches
### Extract matches
We can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
......@@ -360,7 +381,7 @@ matches <- str_extract(has_colour, colour_match)
head(matches)
```
## Grouped matches
### Grouped matches
Imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”.
......@@ -373,8 +394,6 @@ has_noun %>%
str_extract(noun)
```
## Grouped matches
`str_extract()` gives us the complete match; `str_match()` gives each individual component.
```{r noun_regex_match, eval=T, cache=T}
......@@ -382,13 +401,13 @@ has_noun %>%
str_match(noun)
```
## Exercises
### Exercises
- Find all words that come after a number like one, two, three etc. Pull out both the number and the word.
- Find all words that come after a `number` like `one`, `two`, `three` etc. Pull out both the number and the word.
## Replacing matches
### Replacing matches
Instead of replacing with a fixed string you can use backreferences to insert components of the match. In the following code, I flip the order of the second and third words.
Instead of replacing with a fixed string, you can use back references to insert components of the match. In the following code, I flip the order of the second and third words.
```{r replacing_matches, eval=T, cache=T}
sentences %>%
......@@ -396,16 +415,18 @@ sentences %>%
head(5)
```
## Exercices
### Exercices
- Replace all forward slashes in a string with backslashes.
- Implement a simple version of `str_to_lower()` using `replace_all()`.
- Switch the first and last letters in words. Which of those strings are still words?
## Splitting
### Splitting
```{r splitting, eval=T, cache=T}
sentences %>%
head(5) %>%
str_split("\\s")
```
\ No newline at end of file
```
### See you in [R.8: Factors](/session_8/session_8.html)
\ No newline at end of file
FROM rocker/tidyverse
RUN apt-get update \
&& apt-get install -y \
libxt6 \
cargo
RUN Rscript -e "install.packages('rmdformats')"
#session 1
RUN Rscript -e "install.packages('rvest')"
RUN Rscript -e "install.packages('remotes'); remotes::install_github('rlesur/klippy')"
#session 3
RUN Rscript -e "install.packages('gganimate')"
RUN Rscript -e "install.packages('gifski')"
RUN Rscript -e "install.packages('openxlsx')"
#session4
RUN Rscript -e "install.packages(c('ghibli', 'nycflights13','viridis','ggrepel'))"
\ No newline at end of file
#!/bin/bash
set -euo pipefail +o nounset
TAG="v2022"
IMAGE_NAME="r_for_beginners"
DOCKERFILE_DIR="."
REPO=carinerey/$IMAGE_NAME
echo "## Build docker: $REPO:$TAG ##"
docker build -t $REPO:$TAG $DOCKERFILE_DIR
echo "## Build docker: $REPO ##"
docker build -t $REPO $DOCKERFILE_DIR
if [[ $1 == "push_yes" ]]
then
echo "## Push docker ##"
docker push $REPO:$TAG
docker push $REPO
fi
#! /usr/bin/bash
# USAGE
# wget https://gitbio.ens-lyon.fr/can/R_basis/-/raw/main/src/create_users_from_user_list_csv.sh
# upload r_user_list_<day_number>_<day>.csv from your computer to the rstudio server
# sudo bash create_users_from_user_list_csv.sh r_user_list_<day_number>_<day>.csv
USER_PASSWORD_FILENAME=$@
while IFS=';' read -r NAME SURNAME EMAIL LAB COMMENT STATUS USERNAME PASSWD ; do
if [[ $EMAIL =~ "@" ]]
then
echo "=========================================="
echo user: $NAME $SURNAME $EMAIL $LAB
echo r_login: $USERNAME
echo r_passwd: $PASSWD
adduser ${USERNAME} --gecos 'First Last,RoomNumber,WorkPhone,HomePhone' --disabled-password --force-badname > /dev/null
echo "${USERNAME}:${PASSWD}" | chpasswd > /dev/null
fi
done < $USER_PASSWORD_FILENAME
echo "=========================================="
#! /usr/bin/bash
# USAGE
# wget https://gitbio.ens-lyon.fr/can/R_basis/-/raw/master/src/create_users_from_user_pwd_list.sh
# upload X_user_pwd_list.tsv from your computer to the rstudio server
# sudo bash create_users_from_user_pwd_list.sh X_user_pwd_list.tsv
USER_PASSWORD_FILENAME=$@
while IFS=$'\t' read -r GROUPE NAME SURNAME MAIL LOGIN_CBP PASSWD_CBP LABO R_USERNAME R_PASSWD ; do
if [[ $MAIL =~ "@" ]]
then
echo "=========================================="
echo user: $NAME $SURNAME $MAIL $LABO group:$GROUPE
if ! [[ $GROUPE =~ "L" ]]
then
echo computer_login: $LOGIN_CBP
echo computer_passwd: $PASSWD_CBP
else
echo computer_login: "TP"
echo computer_passwd:
fi
echo r_login: $R_USERNAME
echo r_passwd: $R_PASSWD
adduser ${R_USERNAME} --gecos 'First Last,RoomNumber,WorkPhone,HomePhone' --disabled-password --force-badname > /dev/null
echo "${R_USERNAME}:${R_PASSWD}" | chpasswd > /dev/null
fi
done < $USER_PASSWORD_FILENAME
echo "=========================================="
\ No newline at end of file