Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
  • master
2 results

Target

Select target project
  • LBMC/hub/formations/R_basis
  • can/R_basis
2 results
Select Git revision
  • main
  • master
  • quarto-rebuild
3 results
Show changes
Showing
with 20196 additions and 571 deletions
---
title: 'R.1: Installing packages from Bioconductor'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: false
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
To install packages from [Bioconducor](http://www.bioconductor.org) you need to
\ No newline at end of file
---
title: 'R.1: Installing packages from github'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: false
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
To install packages from [github](https://github.com/) you need to
\ No newline at end of file
This diff is collapsed.
---
title: "R.2: introduction to Tidyverse"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nHélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
```{r download_data, include=FALSE, eval=FALSE}
```{r download_data, include=FALSE, eval=T}
library("tidyverse")
tmp <- tempfile(fileext = ".zip")
download.file("http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
......@@ -114,7 +108,7 @@ read_csv("data-raw/vehicles.csv") %>%
write_csv("mpg.csv")
```
# Introduction
## Introduction
In the last session, we have gone through the basis of R.
Instead of continuing to learn more about R programming, in this session we are going to jump directly to rendering plots.
......@@ -122,7 +116,7 @@ Instead of continuing to learn more about R programming, in this session we are
We make this choice for three reasons:
- Rendering nice plots is directly rewarding
- You will be able to apply what you learn in this session to your own data (given that they are *correctly formated*)
- You will be able to apply what you learn in this session to your own data (given that they are *correctly formatted*)
- We will come back to R programming later, when you have all the necessary tools to visualize your results.
......@@ -133,7 +127,7 @@ The objectives of this session will be to:
- Learn the different aesthetics in R plots
- Compose complex graphics
## Tidyverse
### Tidyverse
The `tidyverse` package is a collection of R packages designed for data science that include `ggplot2`.
......@@ -154,7 +148,7 @@ Luckily for you, `tidyverse` is preinstalled on your Rstudio server. So you just
library("tidyverse")
```
## Toy data set `mpg`
### Toy data set `mpg`
This dataset contains a subset of the fuel economy data that the EPA makes available on [fueleconomy.gov](http://fueleconomy.gov).
It contains only models which had a new release every year between 1999 and 2008.
......@@ -170,9 +164,13 @@ For that we are going to use the command `read_csv` which is able to read a [csv
This command also works for file URL
```{r mpg_download, cache=TRUE, message=FALSE}
```{r mpg_download_local, cache=TRUE, message=FALSE, echo = F, include=F}
new_mpg <- read_csv("./mpg.csv")
```
```{r mpg_download, cache=TRUE, message=FALSE, eval = F}
new_mpg <- read_csv(
"http://perso.ens-lyon.fr/laurent.modolo/R/mpg.csv"
"https://can.gitbiopages.ens-lyon.fr/R_basis/session_2/mpg.csv"
)
```
......@@ -199,13 +197,13 @@ new_mpg
Here we can see that `new_mpg` is a `tibble` we will come back to `tibble` later.
## New script
### New script
Like in the last session, instead of typing your commands directly in the console, you are going to write them in an R script.
![](./img/formationR_session2_scriptR.png)
# First plot with `ggplot2`
## First plot with `ggplot2`
We are going to make the simplest plot possible to study the relationship between two variables: the scatterplot.
......@@ -237,7 +235,15 @@ ggplot(data = <DATA>) +
What happend when you use only the command `ggplot(data = mpg)` ?
</div>
<details><summary>Solution</summary>
<p>
```{r only_ggplot, cache = TRUE, fig.width=4.5, fig.height=2}
ggplot(data = new_mpg)
```
</p>
</details>
<div class="pencadre">
Make a scatterplot of `hwy` ( fuel efficiency ) vs. `cyl` ( number of cylinders ).
</div>
......@@ -251,12 +257,22 @@ ggplot(data = new_mpg, mapping = aes(x = hwy, y = cyl)) +
```
</p>
<div class="pencadre">
What seems to be the problem ?
</div>
<details><summary>Solution</summary>
<p>
Dots with the same coordinates are superposed.
</p>
</details>
</details>
# Aesthetic mappings
## Aesthetic mappings
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. `ggplot2` will also add a legend that explains which levels correspond to which values.
......@@ -266,7 +282,7 @@ Try the following aesthetic:
- `alpha`
- `shape`
## `color` mapping
### `color` mapping
```{r new_mpg_plot_e, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
......@@ -274,21 +290,21 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
```
## `size` mapping
### `size` mapping
```{r new_mpg_plot_f, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, size = class)) +
geom_point()
```
## `alpha` mapping
### `alpha` mapping
```{r new_mpg_plot_g, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, alpha = class)) +
geom_point()
```
## `shape` mapping
### `shape` mapping
```{r new_mpg_plot_h, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, shape = class)) +
......@@ -325,7 +341,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
</p>
</details>
## Mapping a **continuous** variable to a color.
### Mapping a **continuous** variable to a color.
You can also map continuous variable to a color
......@@ -347,7 +363,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = displ < 5)) +
</p>
</details>
# Facets
## Facets
You can create multiple plots at once by faceting. For this you can use the command `facet_wrap`.
This command takes a formula as input.
......@@ -362,7 +378,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
```
<div class="pencadre">
Now try to facet your plot by `fl + class`
Now try to facet your plot by `fuel + class`
</div>
......@@ -371,14 +387,14 @@ Now try to facet your plot by `fl + class`
Formulas allow you to express complex relationship between variables in R !
```{r new_mpg_plot_l, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fl + class, nrow = 2)
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fuel + class, nrow = 2)
```
</p>
</details>
# Composition
## Composition
There are different ways to represent the information :
......@@ -416,7 +432,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
\
We can use different `data` for different layers (you will lean more on `filter()` later)
We can use different `data` (here new_mpg and mpg tables) for different layers (you will lean more on `filter()` later)
```{r new_mpg_plot_t, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
......@@ -424,14 +440,14 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = filter(mpg, class == "subcompact"))
```
# Challenge !
## Challenge !
## First challenge
### First challenge
<div class="pencadre">
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
</div>
```R
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
......@@ -441,61 +457,231 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
- What does the `se` argument to `geom_smooth()` do?
</div>
## Second challenge
<details><summary>Solution</summary>
<p>
```{r soluce_challenge_1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
</p>
</details>
### Second challenge
<div class="pencadre">
How being a `2seater` car impact the engine size versus fuel efficiency relationship ?
How being a `Two Seaters` car (*class column*) impact the engine size (*displ column*) versus fuel efficiency relationship (*hwy column*) ?
1. Make a plot of `hwy` in function of `displ `
1. *Colorize* this plot in another color for `Two Seaters` class
2. *Split* this plot for each *class*
Make a plot *colorizing* this information
</div>
<details><summary>Solution</summary>
<details><summary>Solution 1</summary>
<p>
```{r new_mpg_plot_color_2seater1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
```
</p>
</details>
<details><summary>Solution 2</summary>
<p>
```{r new_mpg_plot_color_2seater, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
```{r new_mpg_plot_color_2seater2, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red")
```
</p>
</details>
<details><summary>Solution 3</summary>
<p>
```{r new_mpg_plot_color_2seater_facet, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red") +
facet_wrap(~class)
```
</p>
</details>
<div class="pencadre">
Write a `function` called `plot_color_2seater` that can take as sol argument the variable `mpg` and plot the same graph.
Write a `function` called `plot_color_a_class` that can take as argument the class and plot the same graph for this class
</div>
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_color_2seater_fx, cache = TRUE, fig.width=8, fig.height=4.5}
plot_color_2seater <- function(mpg) {
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
plot_color_a_class <- function(my_class) {
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == my_class), color = "red") +
facet_wrap(~class)
}
plot_color_2seater(mpg)
plot_color_a_class("Two Seaters")
plot_color_a_class("Compact Cars")
```
</p>
</details>
## Third challenge
### Third challenge
<div class="pencadre">
Recreate the R code necessary to generate the following graph
Recreate the R code necessary to generate the following graph (see "linetype" option of "geom_smooth")
</div>
```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_v, eval=F}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
</p>
</details>
\ No newline at end of file
</details>
### See you in [R.3: Transformations with ggplot2](/session_3/session_3.html)
## To go further: publication ready plots
Once you have created the graph you need for your publication, you have to save it.
You can do it with the `ggsave` function.
First save your plot in a variable :
```{r}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()
```
Then save it in the wanted format:
```{r, eval=F}
ggsave("test_plot_1.png", p1, width = 12, height = 8, units = "cm")
```
```{r, eval=F}
ggsave("test_plot_1.pdf", p1, width = 12, height = 8, units = "cm")
```
You may also change the appearance of your plot by adding a `theme` layer to your plot:
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 + theme_bw()
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 + theme_minimal()
```
You may have to combine several plots, for that you can use the `cowplot` package which is a `ggplot2` extension.
First install it :
```{r, eval=F}
install.packages("cowplot")
```
```{r, include=F, echo =F}
if (! require("cowplot")) {
install.packages("cowplot")
}
```
Then you can use the function `plot` grid to combine plots in a publication ready style:
```{r,message=FALSE}
library(cowplot)
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 <- ggplot(data = new_mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
p1
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
p2
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
```
You can also save it in a file.
```{r, eval=F}
p_final = plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
ggsave("test_plot_1_and_2.png", p_final, width = 20, height = 8, units = "cm")
```
You can learn more features about `cowplot` on [https://wilkelab.org/cowplot/articles/introduction.html](its website).
<div class="pencadre">
Use the `cowplot` documentation to reproduce this plot and save it.
</div>
```{r, echo=F}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() + theme_bw()
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_point() + theme_bw()
p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))
plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
```
<details><summary>Solution</summary>
<p>
```{r , echo = TRUE, eval = F}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() + theme_bw()
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_point() + theme_bw()
p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))
p_final <- plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
p_final
```
```{r , echo = TRUE, eval = F}
ggsave("plot_1_2_and_legend.png", p_final, width = 20, height = 8, units = "cm")
```
</p>
</details>
There are a lot of other available `ggplot2` extensions which can be useful (and also beautiful).
You can take a look at them here: [https://exts.ggplot2.tidyverse.org/gallery/]( ggplot2 gallery)
File added
---
title: 'R.3: Transformations with ggplot2'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
In the last session, we have seen how to use `ggplot2` and [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). The goal of this practical is to practices more advanced features of `ggplot2`.
......@@ -47,7 +37,7 @@ library("tidyverse")
Like in the previous sessions, it's good practice to create a new **.R** file to write your code instead of using the R terminal directly.
# `ggplot2` statistical transformations
## `ggplot2` statistical transformations
In the previous session, we have plotted the data as they are by using the variable values as **x** or **y** coordinates, color shade, size or transparency.
When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations.
......@@ -67,7 +57,7 @@ We are going to use the `diamonds` data set included in `tidyverse`.
str(diamonds)
```
## Introduction to `geom_bar`
### Introduction to `geom_bar`
We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`).
Now barplot with `geom_bar()` :
......@@ -82,7 +72,7 @@ More diamonds are available with high quality cuts.
On the x-axis, the chart displays **cut**, a variable from diamonds. On the y-axis, it displays **count**, **but count is not a variable in diamonds!**
## **geom** and **stat**
### **geom** and **stat**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
......@@ -98,7 +88,7 @@ ggplot(data = diamonds, mapping = aes(x = cut)) +
Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly:
## Why **stat** ?
### Why **stat** ?
You might want to override the default stat.
For example, in the following `demo` dataset we already have a variable for the **counts** per `cut`.
......@@ -153,7 +143,7 @@ If group is not used, the proportion is calculated with respect to the data that
</p>
</details>
## More details with `stat_summary`
### More details with `stat_summary`
<div class="pencadre">
You might want to draw greater attention to the statistical transformation in your code.
......@@ -188,7 +178,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
</p>
</details>
# Coloring area plots
## Coloring area plots
<div class="pencadre">
You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`:
......@@ -223,7 +213,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
</p>
</details>
# Position adjustments
## Position adjustments
The stacking of the `fill` parameter is performed by the position adjustment `position`
......@@ -295,7 +285,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
</p>
</details>
# Coordinate systems
## Coordinate systems
Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
......@@ -343,3 +333,91 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
By combining the right **geom**, **coordinates** and **faceting** functions, you can build a large number of different plots to present your results.
## See you in [R.4: data transformation](/session_4/session_4.html)
## To go further: animated plots from xls files
In order to be able to read information from a xls file, we will use the `openxlsx` packages. To generate animation we will use the `ggannimate` package. The additional `gifski` package will allow R to save your animation in the gif format (Graphics Interchange Format)
```{r install_readxl, eval=F}
install.packages(c("openxlsx", "gganimate", "gifski"))
```
```{r load_readxl}
library(openxlsx)
library(gganimate)
library(gifski)
```
<div class="pencardre">
Use the `openxlsx` package to save the [https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx) file to the `gapminder` variable
</div>
<details><summary>Solution</summary>
<p>
2 solutions :
Use directly the url
```{r load_xlsx_url, eval = F}
gapminder <- read.xlsx("https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx")
```
Dowload the file, save it in the same directory as your script then use the local path
```{r load_xlsx}
gapminder <- read.xlsx("gapminder.xlsx")
```
</p>
</details>
This dataset contains 4 variables of interest for us to display per country:
- `gdpPercap` the GDP par capita (US$, inflation-adjusted)
- `lifeExp` the life expectancy at birth, in years
- `pop` the population size
- `contient` a factor with 5 levels
<div class="pencardre">
Using `ggplot2`, build a scatterplot of the `gdpPercap` vs `lifeExp`. Add the `pop` and `continent` information to this plot.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_a}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point()
```
</p>
</details>
<div class="pencardre">
What's wrong ?
You can use the `scale_x_log10()` to display the `gdpPercap` on the `log10` scale.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_b}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10()
```
</p>
</details>
<div class="pencardre">
We would like to add the `year` information to the plots. We could use a `facet_wrap`, but instead we are going to use the `gganimate` package.
For this we need to add a `transition_time` layer that will take as an argument `year` to our plot.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_c}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10() +
transition_time(year) +
labs(title = 'Year: {as.integer(frame_time)}')
```
</p>
</details>
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
session_4/img/colorsR.png

370 KiB

session_4/img/transform-logical.png

70.2 KiB | W: 0px | H: 0px

session_4/img/transform-logical.png

82.8 KiB | W: 0px | H: 0px

session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
  • 2-up
  • Swipe
  • Onion skin
This diff is collapsed.
---
title: "R#5: Pipping and grouping"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
title: "R.5: Pipping and grouping"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
The goal of this practical is to practice combining data transformation with `tidyverse`.
The objectives of this session will be to:
......@@ -47,7 +40,7 @@ library("nycflights13")
</p>
</details>
# Combining multiple operations with the pipe
## Combining multiple operations with the pipe
<div id="pencadre">
Find the 10 most delayed flights using a ranking function. `min_rank()`
......@@ -78,8 +71,8 @@ Try to pipe operators to rewrite your precedent code with only **one** variable
<p>
```{r pipe_example_b, include=TRUE}
flights_md2 <- flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)
```
</p>
......@@ -89,12 +82,12 @@ Working with the pipe is one of the key criteria for belonging to the `tidyverse
The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
## When not to use the pipe
### When not to use the pipe
- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
# Grouping variable
## Grouping variable
The `summarise()` function collapses a data frame to a single row.
Check the difference between `summarise()` and `mutate()` with the following commands:
......@@ -108,7 +101,7 @@ flights %>%
Where mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column.
## The power of `summarise()` with `group_by()`
### The power of `summarise()` with `group_by()`
The `group_by()` function changes the unit of analysis from the complete dataset to individual groups.
Individual groups are defined by categorial variable or **factors**.
......@@ -132,7 +125,7 @@ ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
Why did we `group_by` `year` and `month` and not only `year` ?
</div>
## Missing values
### Missing values
<div class="pencadre">
You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it?
......@@ -149,7 +142,7 @@ flights %>%
Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**.
## Counts
### Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
......@@ -157,31 +150,29 @@ Whenever you do any aggregation, it’s always a good idea to include either a c
summ_delay_filghts <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20)
filter(avg_delay < 40 & avg_delay > -20)
ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
ggplot(summ_delay_filghts, mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
<div class="pencadre">
Imagine that we want to explore the relationship between the distance and average delay for each location and recreate the above figure.
Imagine that we want to explore the relationship between the average distance (`distance`) and average delay (`arr_delay`) for each location (`dest`) and recreate the above figure.
here are three steps to prepare this data:
1. Group flights by destination.
2. Summarize to compute distance, average delay, and number of flights using `n()`.
3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
2. Summarize to compute average distance (`avg_distance`), average delay (`avg_delay`), and number of flights using `n()` (`n_flights`).
3. Filter to remove Honolulu airport, which is almost twice as far away as the next closest airport.
4. Filter to remove noisy points with delay superior to 40 or inferior to -20
5. Create a `mapping` on `dist`, `delay` and `count` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()`
5. Create a `mapping` on `avg_distance`, `avg_delay` and `n_flights` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()` (use method = lm)
7. We can hide the legend with the layer `theme(legend.position='none')`
</div>
......@@ -191,13 +182,13 @@ here are three steps to prepare this data:
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20) %>%
ggplot(mapping = aes(x = dist, y = delay, size = count)) +
filter(avg_delay < 40 & avg_delay > -20) %>%
ggplot(mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
......@@ -206,7 +197,7 @@ flights %>%
</details>
## Ungrouping
### Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
......@@ -221,19 +212,21 @@ flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
# Grouping challenges
## Grouping challenges
## First challenge
### First challenge
<div class="pencadre">
Look at the number of canceled flights per day. Is there a pattern?
(A canceled flight is a flight where the `dep_time` or the `arr_time` is `NA`)
**Remember to always try to decompose complex questions into smaller and simple problems**
- What are `canceled` flights?
- Who can I `canceled` flights?
- We need to define the day of the week `wday` variable (`strftime(x,'%A')` give you the name of the day from a POSIXct date).
- How can you create a `canceled` flights variable which will be TRUE if the flight is canceled or FALSE if not?
- We need to define the day of the week `wday` variable (Monday, Tuesday, ...). To do that, you can use `strftime(x,'%A')` to get the name of the day of a `x` date in the POSIXct format as in the `time_hour` column, ex: `strftime("2013-01-01 05:00:00 EST",'%A')` return "Tuesday" ).
- We can count the number of canceled flight (`cancel_day`) by day of the week (`wday`).
- We can pipe transformed and filtered tibble into a `ggplot` function.
- We can use `geom_col` to have a barplot of the number of `cancel_day` for each. `wday`
......@@ -260,7 +253,7 @@ flights %>%
</p>
</details>
## Second challenge
### Second challenge
<div class="pencadre">
Is the proportion of canceled flights by day of the week related to the average departure delay?
......@@ -275,8 +268,8 @@ flights %>%
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
mutate(
prop_cancel_day = sum(canceled)/sum(!canceled),
summarise(
prop_cancel_day = sum(canceled)/n(),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
ungroup() %>%
......@@ -353,7 +346,7 @@ flights %>%
</p>
</details>
## Third challenge
### Third challenge
<div class="pencadre">
Which carrier has the worst delays?
......@@ -361,7 +354,7 @@ Which carrier has the worst delays?
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_c, eval=F, echo = T, message=FALSE, cache=T}
```{r grouping_challenges_c2, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier) %>%
summarise(
......@@ -375,12 +368,12 @@ flights %>%
</details>
<div class="pencadre">
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n())`)
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n=n())`)
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_c, eval=F, echo = T, message=FALSE, cache=T}
```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier, dest) %>%
summarise(
......@@ -394,3 +387,5 @@ flights %>%
```
</p>
</details>
### See you in [R.6: tidydata](/session_6/session_6.html)
session_6/img/overview_joins.png

50.5 KiB

session_6/img/overview_set.png

11.5 KiB

session_6/img/pivot_longer.png

21.1 KiB

session_6/img/pivot_wider.png

21.7 KiB

---
title: "R.6: tidydata"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nCarine Rey [carine.rey@ens-lyon.fr](mailto:carine.rey@ens-lyon.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
## Introduction
Until now we have worked with data already formated in a *nice way*.
In the `tidyverse` data formated in a *nice way* are called **tidy**
The goal of this practical is to understand how to transform an hugly blob of information into a **tidy** data set.
### Tidydata
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Doing this kind and transformation is often called **data wrangling**, due to the felling that we have to *wrangle* with the data to force them into a **tidy** format.
But once this step is finish most of the subsequent analysis will be realy fast to do !
<div class="pencadre">
As usual we will need the `tidyverse` library.
</div>
<details><summary>Solution</summary>
<p>
```{r load_data, eval=T, message=F}
library(tidyverse)
```
</p>
</details>
For this practical we are going to use the `table` set of datasets which demonstrate multiple ways to layout the same tabular data.
<div class="pencadre">
Use the help to know more about `table1` dataset
</div>
<details><summary>Solution</summary>
```{r}
?table1
```
<p>
`table1`, `table2`, `table3`, `table4a`, `table4b`, and `table5` all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.
The data is a subset of the data contained in the World Health Organization Global Tuberculosis Report
</p>
</details>
## Pivoting data
### pivot longer
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_longer.png')
```
```{r, eval = F}
wide_example <- tibble(X1 = c("A","B"),
X2 = c(1,2),
X3 = c(0.1,0.2),
X4 = c(10,20))
```
If you have a wide dataset, such as `wide_example`, that you want to make longer, you will use the `pivot_longer()` function.
You have to specify the names of the columns you want to pivot into longer format (X2,X3,X4):
```{r, eval = F}
wide_example %>%
pivot_longer(c(X2,X3,X4))
```
... or the reverse selection (-X1):
```{r, eval = F}
wide_example %>% pivot_longer(-X1)
```
You can specify the names of the columns where the data will be tidy (by default, it is `names` and `value`):
```{r, eval = F}
long_example <- wide_example %>%
pivot_longer(-X1, names_to = "V1", values_to = "V2")
```
#### Exercice
<div class="pencadre">
Visualize the `table4a` dataset (you can use the `View()` function).
```{r table4a, eval=F, message=T}
View(table4a)
```
Is the data **tidy** ? How would you transform this dataset to make it **tidy** ?
</div>
<details><summary>Solution</summary>
<p>
We have information about 3 variables in the `table4a`: `country`, `year` and number of `cases`.
However, the variable information (`year`) is stored as column names.
We want to pivot the horizontal column year, vertically and make the table longer.
You can use the `pivot_longer` fonction to make your table longer and have one observation per row and one variable per column.
For this we need to :
- specify which column to select (all except `country`).
- give the name of the new variable (`year`)
- give the name of the variable stored in the cells of the columns years (`case`)
```{r pivot_longer, eval=T, message=T}
table4a %>%
pivot_longer(-country,
names_to = "year",
values_to = "case")
```
</p>
</details>
### pivot wider
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_wider.png')
```
If you have a long dataset, that you want to make wider, you will use the `pivot_wider()` function.
You have to specify which column contains the name of the output column (`names_from`), and which column contains the cell values from (`values_from`).
```{r, eval = F}
long_example %>% pivot_wider(names_from = V1,
values_from = V2)
```
#### Exercice
<div class="pencadre">
Visualize the `table2` dataset
Is the data **tidy** ? How would you transform this dataset to make it **tidy** ? (you can now make also make a guess from the name of the subsection)
</div>
<details><summary>Solution</summary>
<p>
The column `count` store two types of information: the `population` size of the country and the number of `cases` in the country.
You can use the `pivot_wider` fonction to make your table wider and have one observation per row and one variable per column.
```{r pivot_wider, eval=T, message=T}
table2 %>%
pivot_wider(names_from = type,
values_from = count)
```
</p>
</details>
## Merging data
### Relational data
To avoid having a huge table and to save space, information is often splited between different tables.
In our `flights` dataset, information about the `carrier` or the `airports` (origin and dest) are saved in a separate table (`airlines`, `airports`).
```{r airlines, eval=T, echo = T}
library(nycflights13)
flights
airlines
airports
weather
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
### Relational schema
The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
```{r airlines_dag, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/relational-nycflights.png')
```
### Joints
If you have to combine data from 2 tables in a a new table, you will use `joints`.
There are several types of joints depending of what you want to get.
```{r joints, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/join-venn.png')
```
Small concrete examples:
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_joins.png')
```
#### `inner_joint()`
keeps observations in `x` AND `y`
```{r inner_joint, eval=T}
flights2 %>%
inner_join(airlines)
```
#### `left_joint()`
keeps all observations in `x`
```{r left_joint, eval=T}
flights2 %>%
left_join(airlines)
```
#### `right_joint()`
keeps all observations in `y`
```{r right_joint, eval=T}
flights2 %>%
right_join(airlines)
```
#### `full_joint()`
keeps all observations in `x` and `y`
```{r full_joint, eval=T}
flights2 %>%
full_join(airlines)
```
### Defining the key columns
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r , eval=T}
flights2 %>%
left_join(weather)
```
If the two tables contain columns with the same names but corresponding to different things (such as `year` in `flights2` and `planes`) you have to manually define the key or the keys.
```{r , eval=T, echo = T}
flights2 %>%
left_join(planes, by = "tailnum")
```
If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa"))
```
If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix `.x` and `.y` because all column names must be different in the output table.
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa")) %>%
left_join(airports, c("origin" = "faa"))
```
You can change the suffix using the option `suffix`
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin"))
```
### Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
- `semi_join(x, y)` keeps all observations in `x` that have a match in `y`.
- `anti_join(x, y)` drops all observations in `x` that have a match in `y`.
```{r top_dest, eval=T, echo = T}
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
flights %>%
semi_join(top_dest)
```
### Set operations
These expect the x and y inputs to have the same variables, and treat the observations like sets:
- `intersect(x, y)`: return only observations in both `x` and `y`.
- `union(x, y)`: return unique observations in `x` and `y`.
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_set.png')
```
### See you in [R.7: String & RegExp](/session_7/session_7.html)
---
title: "R#6: tidydata"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "19 Dec 2019"
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
---
```{r setup, include=FALSE, echo = F}
library(tidyverse)
library(nycflights13)
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
## Tidydata
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
```{r load_data, eval=T, message=T}
library(tidyverse)
```
## pivot longer
```{r table4a, eval=T, message=T}
table4a # number of TB cases
```
## pivot longer
```{r pivot_longer, eval=T, message=T}
table4a %>%
pivot_longer(-country,
names_to = "year",
values_to = "case")
```
## pivot wider
```{r table2, eval=T, message=T}
table2
```
## pivot wider
```{r pivot_wider, eval=T, message=T}
table2 %>%
pivot_wider(names_from = type,
values_from = count)
```
## Relational data
Sometime the information can be split between different table
```{r airlines, eval=F, echo = T}
library(nycflights13)
flights
airlines
airports
weather
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
## Relational data
```{r airlines_dag, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/relational-nycflights.png')
```
## joints
```{r joints, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/join-venn.png')
```
## `inner_joint()`
Matches pairs of observations whenever their keys are equal
```{r inner_joint, eval=T}
flights2 %>%
inner_join(airlines)
```
## `left_joint()`
keeps all observations in `x`
```{r left_joint, eval=T}
flights2 %>%
left_join(airlines)
```
## `right_joint()`
keeps all observations in `y`
```{r right_joint, eval=T}
flights2 %>%
right_join(airlines)
```
## `full_joint()`
keeps all observations in `x` and `y`
```{r full_joint, eval=T}
flights2 %>%
full_join(airlines)
```
## Defining the key columns
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_weather, eval=T}
flights2 %>%
left_join(weather)
```
## Defining the key columns
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_tailnum, eval=T, echo = T}
flights2 %>%
left_join(planes, by = "tailnum")
```
## Defining the key columns
A named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
```{r left_join_airport, eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa"))
```
## Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
- `semi_join(x, y)` keeps all observations in `x` that have a match in `y`.
- `anti_join(x, y)` drops all observations in `x` that have a match in `y`.
## Filtering joins
```{r top_dest, eval=T, echo = T}
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
flights %>%
semi_join(top_dest)
```
## Set operations
These expect the x and y inputs to have the same variables, and treat the observations like sets:
- `intersect(x, y)`: return only observations in both `x` and `y`.
- `union(x, y)`: return unique observations in `x` and `y`.
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
This diff is collapsed.
This diff is collapsed.