Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
  • master
2 results

Target

Select target project
  • LBMC/hub/formations/R_basis
  • can/R_basis
2 results
Select Git revision
  • main
  • master
  • quarto-rebuild
3 results
Show changes
Showing
with 19929 additions and 593 deletions
---
title: 'R.1: Installing packages from Bioconductor'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: false
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
To install packages from [bioconductor](http://www.bioconductor.org) you must first install a package called "BiocManager".
This package imports a function called "install" allowing you to install packages hosted in bioconductor from their name.
To install "BiocManager" you must type:
```R
install.packages("BiocManager")
```
Then to install, for example "tximport", you just have to write:
```R
BiocManager::install("tximport")
```
---
title: 'R.1: Installing packages from github'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: false
use_bookdown: true
default_style: "dark"
lightbox: true
css: "../src/style.css"
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
If you need to install a package that is not available on the CRAN but on a github repository, you can do it using the "remotes" package. Indeed this package imports functions that will allow you to install a package available on [github](https://github.com/) or bitbucket or gitlab directly on your computer.
To use the "remotes" packages, you must first install it:
```R
install.packages("remotes")
```
Once "remotes" is installed, you will be able to install all R package from github or from their URL.
For example, if you want to install the last version of a "gganimate", which allow you to animate ggplot2 graphes, you can use :
```R
remotes::install_github("thomasp85/gganimate")
```
By default the latest version of the package is installed, if you want a given version you can specify it :
```R
remotes::install_github("thomasp85/gganimate@v1.0.7")
```
You can find more information in the documentation of remotes : [https://remotes.r-lib.org](https://remotes.r-lib.org)
This diff is collapsed.
---
title: "R.2: introduction to Tidyverse"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nHélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
```{r download_data, include=FALSE, eval=FALSE}
```{r download_data, include=FALSE, eval=T}
library("tidyverse")
tmp <- tempfile(fileext = ".zip")
download.file("http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
......@@ -114,7 +108,7 @@ read_csv("data-raw/vehicles.csv") %>%
write_csv("mpg.csv")
```
# Introduction
## Introduction
In the last session, we have gone through the basis of R.
Instead of continuing to learn more about R programming, in this session we are going to jump directly to rendering plots.
......@@ -122,7 +116,7 @@ Instead of continuing to learn more about R programming, in this session we are
We make this choice for three reasons:
- Rendering nice plots is directly rewarding
- You will be able to apply what you learn in this session to your own data (given that they are *correctly formated*)
- You will be able to apply what you learn in this session to your own data (given that they are *correctly formatted*)
- We will come back to R programming later, when you have all the necessary tools to visualize your results.
......@@ -133,7 +127,7 @@ The objectives of this session will be to:
- Learn the different aesthetics in R plots
- Compose complex graphics
## Tidyverse
### Tidyverse
The `tidyverse` package is a collection of R packages designed for data science that include `ggplot2`.
......@@ -154,7 +148,7 @@ Luckily for you, `tidyverse` is preinstalled on your Rstudio server. So you just
library("tidyverse")
```
## Toy data set `mpg`
### Toy data set `mpg`
This dataset contains a subset of the fuel economy data that the EPA makes available on [fueleconomy.gov](http://fueleconomy.gov).
It contains only models which had a new release every year between 1999 and 2008.
......@@ -170,9 +164,13 @@ For that we are going to use the command `read_csv` which is able to read a [csv
This command also works for file URL
```{r mpg_download, cache=TRUE, message=FALSE}
```{r mpg_download_local, cache=TRUE, message=FALSE, echo = F, include=F}
new_mpg <- read_csv("./mpg.csv")
```
```{r mpg_download, cache=TRUE, message=FALSE, eval = F}
new_mpg <- read_csv(
"http://perso.ens-lyon.fr/laurent.modolo/R/mpg.csv"
"https://can.gitbiopages.ens-lyon.fr/R_basis/session_2/mpg.csv"
)
```
......@@ -199,13 +197,13 @@ new_mpg
Here we can see that `new_mpg` is a `tibble` we will come back to `tibble` later.
## New script
### New script
Like in the last session, instead of typing your commands directly in the console, you are going to write them in an R script.
![](./img/formationR_session2_scriptR.png)
# First plot with `ggplot2`
## First plot with `ggplot2`
We are going to make the simplest plot possible to study the relationship between two variables: the scatterplot.
......@@ -237,7 +235,15 @@ ggplot(data = <DATA>) +
What happend when you use only the command `ggplot(data = mpg)` ?
</div>
<details><summary>Solution</summary>
<p>
```{r only_ggplot, cache = TRUE, fig.width=4.5, fig.height=2}
ggplot(data = new_mpg)
```
</p>
</details>
<div class="pencadre">
Make a scatterplot of `hwy` ( fuel efficiency ) vs. `cyl` ( number of cylinders ).
</div>
......@@ -251,12 +257,22 @@ ggplot(data = new_mpg, mapping = aes(x = hwy, y = cyl)) +
```
</p>
<div class="pencadre">
What seems to be the problem ?
</div>
<details><summary>Solution</summary>
<p>
Dots with the same coordinates are superposed.
</p>
</details>
</details>
# Aesthetic mappings
## Aesthetic mappings
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. `ggplot2` will also add a legend that explains which levels correspond to which values.
......@@ -266,7 +282,7 @@ Try the following aesthetic:
- `alpha`
- `shape`
## `color` mapping
### `color` mapping
```{r new_mpg_plot_e, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
......@@ -274,21 +290,21 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
```
## `size` mapping
### `size` mapping
```{r new_mpg_plot_f, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, size = class)) +
geom_point()
```
## `alpha` mapping
### `alpha` mapping
```{r new_mpg_plot_g, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, alpha = class)) +
geom_point()
```
## `shape` mapping
### `shape` mapping
```{r new_mpg_plot_h, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, shape = class)) +
......@@ -325,7 +341,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
</p>
</details>
## Mapping a **continuous** variable to a color.
### Mapping a **continuous** variable to a color.
You can also map continuous variable to a color
......@@ -347,7 +363,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = displ < 5)) +
</p>
</details>
# Facets
## Facets
You can create multiple plots at once by faceting. For this you can use the command `facet_wrap`.
This command takes a formula as input.
......@@ -362,7 +378,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
```
<div class="pencadre">
Now try to facet your plot by `fl + class`
Now try to facet your plot by `fuel + class`
</div>
......@@ -371,14 +387,14 @@ Now try to facet your plot by `fl + class`
Formulas allow you to express complex relationship between variables in R !
```{r new_mpg_plot_l, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fl + class, nrow = 2)
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fuel + class, nrow = 2)
```
</p>
</details>
# Composition
## Composition
There are different ways to represent the information :
......@@ -416,7 +432,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
\
We can use different `data` for different layers (you will lean more on `filter()` later)
We can use different `data` (here new_mpg and mpg tables) for different layers (you will lean more on `filter()` later)
```{r new_mpg_plot_t, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
......@@ -424,14 +440,14 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = filter(mpg, class == "subcompact"))
```
# Challenge !
## Challenge !
## First challenge
### First challenge
<div class="pencadre">
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
</div>
```R
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
......@@ -441,61 +457,231 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
- What does the `se` argument to `geom_smooth()` do?
</div>
## Second challenge
<details><summary>Solution</summary>
<p>
```{r soluce_challenge_1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
</p>
</details>
### Second challenge
<div class="pencadre">
How being a `2seater` car impact the engine size versus fuel efficiency relationship ?
How being a `Two Seaters` car (*class column*) impact the engine size (*displ column*) versus fuel efficiency relationship (*hwy column*) ?
1. Make a plot of `hwy` in function of `displ `
1. *Colorize* this plot in another color for `Two Seaters` class
2. *Split* this plot for each *class*
Make a plot *colorizing* this information
</div>
<details><summary>Solution</summary>
<details><summary>Solution 1</summary>
<p>
```{r new_mpg_plot_color_2seater1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
```
</p>
</details>
<details><summary>Solution 2</summary>
<p>
```{r new_mpg_plot_color_2seater, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
```{r new_mpg_plot_color_2seater2, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red")
```
</p>
</details>
<details><summary>Solution 3</summary>
<p>
```{r new_mpg_plot_color_2seater_facet, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red") +
facet_wrap(~class)
```
</p>
</details>
<div class="pencadre">
Write a `function` called `plot_color_2seater` that can take as sol argument the variable `mpg` and plot the same graph.
Write a `function` called `plot_color_a_class` that can take as argument the class and plot the same graph for this class
</div>
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_color_2seater_fx, cache = TRUE, fig.width=8, fig.height=4.5}
plot_color_2seater <- function(mpg) {
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
plot_color_a_class <- function(my_class) {
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == my_class), color = "red") +
facet_wrap(~class)
}
plot_color_2seater(mpg)
plot_color_a_class("Two Seaters")
plot_color_a_class("Compact Cars")
```
</p>
</details>
## Third challenge
### Third challenge
<div class="pencadre">
Recreate the R code necessary to generate the following graph
Recreate the R code necessary to generate the following graph (see "linetype" option of "geom_smooth")
</div>
```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_v, eval=F}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
</p>
</details>
\ No newline at end of file
</details>
### See you in [R.3: Transformations with ggplot2](/session_3/session_3.html)
## To go further: publication ready plots
Once you have created the graph you need for your publication, you have to save it.
You can do it with the `ggsave` function.
First save your plot in a variable :
```{r}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()
```
Then save it in the wanted format:
```{r, eval=F}
ggsave("test_plot_1.png", p1, width = 12, height = 8, units = "cm")
```
```{r, eval=F}
ggsave("test_plot_1.pdf", p1, width = 12, height = 8, units = "cm")
```
You may also change the appearance of your plot by adding a `theme` layer to your plot:
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 + theme_bw()
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 + theme_minimal()
```
You may have to combine several plots, for that you can use the `cowplot` package which is a `ggplot2` extension.
First install it :
```{r, eval=F}
install.packages("cowplot")
```
```{r, include=F, echo =F}
if (! require("cowplot")) {
install.packages("cowplot")
}
```
Then you can use the function `plot` grid to combine plots in a publication ready style:
```{r,message=FALSE}
library(cowplot)
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 <- ggplot(data = new_mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
p1
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
p2
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
```
You can also save it in a file.
```{r, eval=F}
p_final = plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
ggsave("test_plot_1_and_2.png", p_final, width = 20, height = 8, units = "cm")
```
You can learn more features about `cowplot` on [https://wilkelab.org/cowplot/articles/introduction.html](its website).
<div class="pencadre">
Use the `cowplot` documentation to reproduce this plot and save it.
</div>
```{r, echo=F}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() + theme_bw()
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_point() + theme_bw()
p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))
plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
```
<details><summary>Solution</summary>
<p>
```{r , echo = TRUE, eval = F}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() + theme_bw()
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_point() + theme_bw()
p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))
p_final <- plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
p_final
```
```{r , echo = TRUE, eval = F}
ggsave("plot_1_2_and_legend.png", p_final, width = 20, height = 8, units = "cm")
```
</p>
</details>
There are a lot of other available `ggplot2` extensions which can be useful (and also beautiful).
You can take a look at them here: [https://exts.ggplot2.tidyverse.org/gallery/]( ggplot2 gallery)
File added
---
title: 'R.3: Transformations with ggplot2'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
In the last session, we have seen how to use `ggplot2` and [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). The goal of this practical is to practices more advanced features of `ggplot2`.
......@@ -47,7 +37,7 @@ library("tidyverse")
Like in the previous sessions, it's good practice to create a new **.R** file to write your code instead of using the R terminal directly.
# `ggplot2` statistical transformations
## `ggplot2` statistical transformations
In the previous session, we have plotted the data as they are by using the variable values as **x** or **y** coordinates, color shade, size or transparency.
When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations.
......@@ -67,7 +57,7 @@ We are going to use the `diamonds` data set included in `tidyverse`.
str(diamonds)
```
## Introduction to `geom_bar`
### Introduction to `geom_bar`
We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`).
Now barplot with `geom_bar()` :
......@@ -82,7 +72,7 @@ More diamonds are available with high quality cuts.
On the x-axis, the chart displays **cut**, a variable from diamonds. On the y-axis, it displays **count**, **but count is not a variable in diamonds!**
## **geom** and **stat**
### **geom** and **stat**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
......@@ -98,7 +88,7 @@ ggplot(data = diamonds, mapping = aes(x = cut)) +
Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly:
## Why **stat** ?
### Why **stat** ?
You might want to override the default stat.
For example, in the following `demo` dataset we already have a variable for the **counts** per `cut`.
......@@ -153,7 +143,7 @@ If group is not used, the proportion is calculated with respect to the data that
</p>
</details>
## More details with `stat_summary`
### More details with `stat_summary`
<div class="pencadre">
You might want to draw greater attention to the statistical transformation in your code.
......@@ -188,7 +178,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
</p>
</details>
# Coloring area plots
## Coloring area plots
<div class="pencadre">
You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`:
......@@ -223,7 +213,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
</p>
</details>
# Position adjustments
## Position adjustments
The stacking of the `fill` parameter is performed by the position adjustment `position`
......@@ -295,7 +285,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
</p>
</details>
# Coordinate systems
## Coordinate systems
Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
......@@ -343,3 +333,91 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
By combining the right **geom**, **coordinates** and **faceting** functions, you can build a large number of different plots to present your results.
## See you in [R.4: data transformation](/session_4/session_4.html)
## To go further: animated plots from xls files
In order to be able to read information from a xls file, we will use the `openxlsx` packages. To generate animation we will use the `ggannimate` package. The additional `gifski` package will allow R to save your animation in the gif format (Graphics Interchange Format)
```{r install_readxl, eval=F}
install.packages(c("openxlsx", "gganimate", "gifski"))
```
```{r load_readxl}
library(openxlsx)
library(gganimate)
library(gifski)
```
<div class="pencardre">
Use the `openxlsx` package to save the [https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx) file to the `gapminder` variable
</div>
<details><summary>Solution</summary>
<p>
2 solutions :
Use directly the url
```{r load_xlsx_url, eval = F}
gapminder <- read.xlsx("https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx")
```
Dowload the file, save it in the same directory as your script then use the local path
```{r load_xlsx}
gapminder <- read.xlsx("gapminder.xlsx")
```
</p>
</details>
This dataset contains 4 variables of interest for us to display per country:
- `gdpPercap` the GDP par capita (US$, inflation-adjusted)
- `lifeExp` the life expectancy at birth, in years
- `pop` the population size
- `contient` a factor with 5 levels
<div class="pencardre">
Using `ggplot2`, build a scatterplot of the `gdpPercap` vs `lifeExp`. Add the `pop` and `continent` information to this plot.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_a}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point()
```
</p>
</details>
<div class="pencardre">
What's wrong ?
You can use the `scale_x_log10()` to display the `gdpPercap` on the `log10` scale.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_b}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10()
```
</p>
</details>
<div class="pencardre">
We would like to add the `year` information to the plots. We could use a `facet_wrap`, but instead we are going to use the `gganimate` package.
For this we need to add a `transition_time` layer that will take as an argument `year` to our plot.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_c}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10() +
transition_time(year) +
labs(title = 'Year: {as.integer(frame_time)}')
```
</p>
</details>
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
session_4/img/colorsR.png

370 KiB

session_4/img/transform-logical.png

70.2 KiB | W: 0px | H: 0px

session_4/img/transform-logical.png

82.8 KiB | W: 0px | H: 0px

session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
  • 2-up
  • Swipe
  • Onion skin
This diff is collapsed.
Dear all,
---
title: "R.5: Pipping and grouping"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2022"
---
The first session of the R the basis formation will be in the CBP TP room the:
- 14/09 at 11h for the Tuesday session
- 17/09 at 11h for the Friday session
- 20/09 at 11h for the Monday session
```{r include=FALSE}
library(fontawesome)
For this first session, some formators will wait for you at the reception of the ENS Monod site 15 min before the start of the session to guide you to the room.
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
You will have access to a computer to do all the practicals with your ens email account (same login and password).
There are no prerequisite for this formation are we will start from scratch.
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
If you want to work on your own laptop, you will need
—a recent browser
—access to the eduroam wifi network
In case of problems, we won't provide any IT support, just advise you to switch to a computer available in the TP room.
## Introduction
If you are unable to attend to a session, please give us some heads-up so we will not wait for you. All the supports will be available online so you can try to catch up before the next session.
The goal of this practical is to practice combining data transformation with `tidyverse`.
The objectives of this session will be to:
- Combining multiple operations with the pipe `%>%`
- Work on subgroup of the data with `group_by`
Best,
<div class="pencadre">
For this session we are going to work with a new dataset included in the `nycflights13` package.
Install this package and load it.
As usual you will also need the `tidyverse` library.
</div>
<details><summary>Solution</summary>
<p>
```{r packageloaded, include=TRUE, message=FALSE}
library("tidyverse")
library("nycflights13")
```
</p>
</details>
## Combining multiple operations with the pipe
<div id="pencadre">
Find the 10 most delayed flights using a ranking function. `min_rank()`
</div>
<details><summary>Solution</summary>
<p>
```{r pipe_example_a, include=TRUE}
flights_md <- mutate(flights,
most_delay = min_rank(desc(dep_delay)))
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)
```
</p>
</details>
We don't want to create useless intermediate variables so we can use the pipe operator: `%>%`
(or `ctrl + shift + M`).
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
<div id="pencadre">
Try to pipe operators to rewrite your precedent code with only **one** variable assignment.
</div>
<details><summary>Solution</summary>
<p>
```{r pipe_example_b, include=TRUE}
flights_md2 <- flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)
```
</p>
</details>
Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered and use `+` instead of `%>%`. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
### When not to use the pipe
- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
## Grouping variable
The `summarise()` function collapses a data frame to a single row.
Check the difference between `summarise()` and `mutate()` with the following commands:
```{r load_data, eval=FALSE}
flights %>%
mutate(delay = mean(dep_delay, na.rm = TRUE))
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
Where mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column.
### The power of `summarise()` with `group_by()`
The `group_by()` function changes the unit of analysis from the complete dataset to individual groups.
Individual groups are defined by categorial variable or **factors**.
Then, when you use the function you already know on grouped data frame and they’ll be automatically applied *by groups*.
You can use the following code to compute the average delay per months across years.
```{r summarise_group_by, include=TRUE, fig.width=8, fig.height=3.5}
flights_delay <- flights %>%
group_by(year, month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>%
arrange(month)
ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
geom_bar(stat="identity", color="black", fill = "#619CFF") +
geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) +
theme(axis.text.x = element_blank())
```
<div class="pencadre">
Why did we `group_by` `year` and `month` and not only `year` ?
</div>
### Missing values
<div class="pencadre">
You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it?
</div>
```{r summarise_group_by_NA, include=TRUE}
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)
```
Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**.
### Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
```{r summarise_group_by_count, include = T, echo=F, warning=F, message=F, fig.width=8, fig.height=3.5}
summ_delay_filghts <- flights %>%
group_by(dest) %>%
summarise(
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(avg_delay < 40 & avg_delay > -20)
ggplot(summ_delay_filghts, mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
<div class="pencadre">
Imagine that we want to explore the relationship between the average distance (`distance`) and average delay (`arr_delay`) for each location (`dest`) and recreate the above figure.
here are three steps to prepare this data:
1. Group flights by destination.
2. Summarize to compute average distance (`avg_distance`), average delay (`avg_delay`), and number of flights using `n()` (`n_flights`).
3. Filter to remove Honolulu airport, which is almost twice as far away as the next closest airport.
4. Filter to remove noisy points with delay superior to 40 or inferior to -20
5. Create a `mapping` on `avg_distance`, `avg_delay` and `n_flights` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()` (use method = lm)
7. We can hide the legend with the layer `theme(legend.position='none')`
</div>
<details><summary>Solution</summary>
<p>
```{r summarise_group_by_count_b, include = T, eval=F, warning=F, message=F, fig.width=8, fig.height=3.5}
flights %>%
group_by(dest) %>%
summarise(
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(avg_delay < 40 & avg_delay > -20) %>%
ggplot(mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
</p>
</details>
### Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
<div class="pencadre">
Try the following example
</div>
```{r ungroup, eval=T, message=FALSE, cache=T}
flights %>%
group_by(year, month, day) %>%
ungroup() %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
## Grouping challenges
### First challenge
<div class="pencadre">
Look at the number of canceled flights per day. Is there a pattern?
(A canceled flight is a flight where the `dep_time` or the `arr_time` is `NA`)
**Remember to always try to decompose complex questions into smaller and simple problems**
- How can you create a `canceled` flights variable which will be TRUE if the flight is canceled or FALSE if not?
- We need to define the day of the week `wday` variable (Monday, Tuesday, ...). To do that, you can use `strftime(x,'%A')` to get the name of the day of a `x` date in the POSIXct format as in the `time_hour` column, ex: `strftime("2013-01-01 05:00:00 EST",'%A')` return "Tuesday" ).
- We can count the number of canceled flight (`cancel_day`) by day of the week (`wday`).
- We can pipe transformed and filtered tibble into a `ggplot` function.
- We can use `geom_col` to have a barplot of the number of `cancel_day` for each. `wday`
- You can use the function `fct_reorder()` to reorder the `wday` by number of `cancel_day` and make the plot easier to read.
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_a, eval=T, message=FALSE, cache=T}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
filter(canceled) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
cancel_day = n()
) %>%
ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
geom_col()
```
</p>
</details>
### Second challenge
<div class="pencadre">
Is the proportion of canceled flights by day of the week related to the average departure delay?
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_b1, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
prop_cancel_day = sum(canceled)/n(),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
ungroup() %>%
ggplot(mapping = aes(x = av_delay, y = prop_cancel_day, color = wday)) +
geom_point()
```
Which day would you prefer to book a flight ?
</p>
</details>
<div class="pencadre">
We can add error bars to this plot to justify our decision.
Brainstorm a way to have access to the mean and standard deviation or the `prop_cancel_day` and `av_delay`.
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(day) %>%
mutate(
prop_cancel_day = sum(canceled)/sum(!canceled),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
group_by(wday) %>%
summarize(
mean_cancel_day = mean(prop_cancel_day, na.rm = TRUE),
sd_cancel_day = sd(prop_cancel_day, na.rm = TRUE),
mean_av_delay = mean(av_delay, na.rm = TRUE),
sd_av_delay = sd(av_delay, na.rm = TRUE)
) %>%
ggplot(mapping = aes(x = mean_av_delay, y = mean_cancel_day, color = wday)) +
geom_point() +
geom_errorbarh(mapping = aes(
xmin = -sd_av_delay + mean_av_delay,
xmax = sd_av_delay + mean_av_delay
)) +
geom_errorbar(mapping = aes(
ymin = -sd_cancel_day + mean_cancel_day,
ymax = sd_cancel_day + mean_cancel_day
))
```
</p>
</details>
<div class="pencadre">
Now that you are aware of the interest of using `geom_errorbar`, what `hour` of the day should you fly if you want to avoid delays as much as possible?
</div>
<details><summary>Solution</summary>
<p>
```{r group_filter_b3, eval=T, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
group_by(hour) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = T),
sd_delay = sd(arr_delay, na.rm = T),
) %>%
ggplot() +
geom_errorbar(mapping = aes(
x = hour,
ymax = mean_delay + sd_delay,
ymin = mean_delay - sd_delay)) +
geom_point(mapping = aes(
x = hour,
y = mean_delay,
))
```
</p>
</details>
### Third challenge
<div class="pencadre">
Which carrier has the worst delays?
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_c2, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)
```
</p>
</details>
<div class="pencadre">
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n=n())`)
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier, dest) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T),
number_of_flight = n()
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_boxplot() +
geom_jitter(height = 0)
```
</p>
</details>
### See you in [R.6: tidydata](/session_6/session_6.html)
session_6/img/overview_joins.png

50.5 KiB

session_6/img/overview_set.png

11.5 KiB

session_6/img/pivot_longer.png

21.1 KiB

session_6/img/pivot_wider.png

21.7 KiB

---
title: "R.6: tidydata"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nCarine Rey [carine.rey@ens-lyon.fr](mailto:carine.rey@ens-lyon.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
Until now we have worked with data already formated in a *nice way*.
In the `tidyverse` data formated in a *nice way* are called **tidy**
The goal of this practical is to understand how to transform an hugly blob of information into a **tidy** data set.
## Tidydata
### Tidydata
There are three interrelated rules which make a dataset tidy:
......@@ -54,13 +44,18 @@ library(tidyverse)
</p>
</details>
For this practical we are going to use the `table` dataset which demonstrate multiple ways to layout the same tabular data.
For this practical we are going to use the `table` set of datasets which demonstrate multiple ways to layout the same tabular data.
<div class="pencadre">
Use the help to know more about this dataset
Use the help to know more about `table1` dataset
</div>
<details><summary>Solution</summary>
```{r}
?table1
```
<p>
`table1`, `table2`, `table3`, `table4a`, `table4b`, and `table5` all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.
......@@ -68,9 +63,44 @@ The data is a subset of the data contained in the World Health Organization Glob
</p>
</details>
# Pivoting data
## Pivoting data
### pivot longer
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_longer.png')
```
```{r, eval = F}
wide_example <- tibble(X1 = c("A","B"),
X2 = c(1,2),
X3 = c(0.1,0.2),
X4 = c(10,20))
```
If you have a wide dataset, such as `wide_example`, that you want to make longer, you will use the `pivot_longer()` function.
You have to specify the names of the columns you want to pivot into longer format (X2,X3,X4):
```{r, eval = F}
wide_example %>%
pivot_longer(c(X2,X3,X4))
```
... or the reverse selection (-X1):
```{r, eval = F}
wide_example %>% pivot_longer(-X1)
```
You can specify the names of the columns where the data will be tidy (by default, it is `names` and `value`):
```{r, eval = F}
long_example <- wide_example %>%
pivot_longer(-X1, names_to = "V1", values_to = "V2")
```
## pivot longer
#### Exercice
<div class="pencadre">
Visualize the `table4a` dataset (you can use the `View()` function).
......@@ -107,7 +137,23 @@ table4a %>%
</p>
</details>
## pivot wider
### pivot wider
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_wider.png')
```
If you have a long dataset, that you want to make wider, you will use the `pivot_wider()` function.
You have to specify which column contains the name of the output column (`names_from`), and which column contains the cell values from (`values_from`).
```{r, eval = F}
long_example %>% pivot_wider(names_from = V1,
values_from = V2)
```
#### Exercice
<div class="pencadre">
Visualize the `table2` dataset
......@@ -128,13 +174,15 @@ table2 %>%
</p>
</details>
# Merging data
## Merging data
## Relational data
### Relational data
Sometime the information can be split between different table
To avoid having a huge table and to save space, information is often splited between different tables.
```{r airlines, eval=F, echo = T}
In our `flights` dataset, information about the `carrier` or the `airports` (origin and dest) are saved in a separate table (`airlines`, `airports`).
```{r airlines, eval=T, echo = T}
library(nycflights13)
flights
airlines
......@@ -144,27 +192,40 @@ flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
## Relational data
### Relational schema
The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
```{r airlines_dag, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/relational-nycflights.png')
```
## joints
### Joints
If you have to combine data from 2 tables in a a new table, you will use `joints`.
There are several types of joints depending of what you want to get.
```{r joints, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/join-venn.png')
```
## `inner_joint()`
Small concrete examples:
Matches pairs of observations whenever their keys are equal
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_joins.png')
```
#### `inner_joint()`
keeps observations in `x` AND `y`
```{r inner_joint, eval=T}
flights2 %>%
inner_join(airlines)
```
## `left_joint()`
#### `left_joint()`
keeps all observations in `x`
......@@ -173,7 +234,7 @@ flights2 %>%
left_join(airlines)
```
## `right_joint()`
#### `right_joint()`
keeps all observations in `y`
......@@ -182,7 +243,7 @@ flights2 %>%
right_join(airlines)
```
## `full_joint()`
#### `full_joint()`
keeps all observations in `x` and `y`
......@@ -191,43 +252,52 @@ flights2 %>%
full_join(airlines)
```
## Defining the key columns
### Defining the key columns
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_weather, eval=T}
```{r , eval=T}
flights2 %>%
left_join(weather)
```
## Defining the key columns
If the two tables contain columns with the same names but corresponding to different things (such as `year` in `flights2` and `planes`) you have to manually define the key or the keys.
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
```{r left_join_tailnum, eval=T, echo = T}
```{r , eval=T, echo = T}
flights2 %>%
left_join(planes, by = "tailnum")
```
## Defining the key columns
A named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: `by = c("a" = "b")`. This will match variable `a` in table `x` to variable `b` in table `y`.
```{r left_join_airport, eval=T, echo = T}
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa"))
```
## Filtering joins
If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix `.x` and `.y` because all column names must be different in the output table.
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, c("dest" = "faa")) %>%
left_join(airports, c("origin" = "faa"))
```
You can change the suffix using the option `suffix`
```{r , eval=T, echo = T}
flights2 %>%
left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin"))
```
### Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
- `semi_join(x, y)` keeps all observations in `x` that have a match in `y`.
- `anti_join(x, y)` drops all observations in `x` that have a match in `y`.
## Filtering joins
```{r top_dest, eval=T, echo = T}
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
......@@ -236,10 +306,16 @@ flights %>%
semi_join(top_dest)
```
## Set operations
### Set operations
These expect the x and y inputs to have the same variables, and treat the observations like sets:
- `intersect(x, y)`: return only observations in both `x` and `y`.
- `union(x, y)`: return unique observations in `x` and `y`.
- `setdiff(x, y)`: return observations in `x`, but not in `y`.
```{r , echo=FALSE, out.width='100%'}
knitr::include_graphics('img/overview_set.png')
```
### See you in [R.7: String & RegExp](/session_7/session_7.html)
---
title: '#8 Factors'
title: "R.8: Factors"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "31 Jan 2020"
always_allow_html: yes
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
date: "2022"
---
```{r setup, include=FALSE, cache=TRUE}
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
## Introduction
In this session, you will learn more about the factor type in R.
Factors can be very useful, but you have to be mindful of the implicit conversions from simple vector to factor !
They are the source of loot of pain for R programmers.
<div class="pencadre">
As usual we will need the `tidyverse` library.
</div>
<details><summary>Solution</summary>
<p>
```{r load_data, eval=T, message=F}
library(tidyverse)
```
</p>
</details>
## Creating factors
......@@ -41,8 +57,6 @@ x2 <- c("Dec", "Apr", "Jam", "Mar")
sort(x1)
```
## Creating factors
You can fix both of these problems with a factor.
```{r sort_month_factor, eval=T, cache=T}
......@@ -55,8 +69,6 @@ y1
sort(y1)
```
## Creating factors
And any values not in the set will be converted to NA:
```{r sort_month_factor2, eval=T, cache=T}
......@@ -79,12 +91,10 @@ gss_cat %>%
count(race)
```
## General Social Survey
By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:
By default, `ggplot2` will drop levels that don’t have any values. You can force them to display with:
```{r race_plot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(gss_cat, aes(race)) +
ggplot(gss_cat, aes(x = race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
```
......@@ -101,39 +111,28 @@ relig_summary <- gss_cat %>%
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
ggplot(relig_summary, aes(x = tvhours, y = relig)) + geom_point()
```
**8_a**
## Modifying factor order
It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using `fct_reorder()`. `fct_reorder()` takes three arguments:
- `f`, the factor whose levels you want to modify.
- `x`, a numeric vector that you want to use to reorder the levels.
- Optionally, `fun`, a function that’s used if there are multiple values of `x` for each value of `f`. The default value is `median`.
## Modifying factor order
```{r tv_hour_order, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
geom_point()
```
**8_b**
## Modifying factor order
As you start making more complicated transformations, I’d recommend moving them out of `aes()` and into a separate `mutate()` step. For example, you could rewrite the plot above as:
```{r tv_hour_order_mutate, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
ggplot(aes(x = tvhours, y = relig)) +
geom_point()
```
**8_c**
## `fct_reorder2()`
......@@ -146,23 +145,35 @@ by_age <- gss_cat %>%
group_by(age) %>%
mutate(prop = n / sum(n))
```
**8_d**
## `fct_reorder2()`
```{r fct_reorder2a, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(by_age, aes(age, prop, colour = marital)) +
ggplot(by_age, aes(x = age, y = prop, colour = marital)) +
geom_line(na.rm = TRUE)
```
**8_e**
## `fct_reorder2()`
```{r fct_reorder2b, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
ggplot(by_age, aes(x = age, y = prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")
```
**8_f**
\ No newline at end of file
## Materials
There are lots of material online for R and more particularly on `tidyverse` and `Rstudio`
You can find cheat sheet for all the packages of the `tidyverse` on this page:
[https://www.rstudio.com/resources/cheatsheets/](https://www.rstudio.com/resources/cheatsheets/)
The `Rstudio` websites are also a good place to learn more about R and the meta-package maintenained by the `Rstudio` community:
- [https://www.rstudio.com/resources/webinars/](https://www.rstudio.com/resources/webinars/)
- [https://www.rstudio.com/products/rpackages/](https://www.rstudio.com/products/rpackages/)
For example [rmarkdown](https://rmarkdown.rstudio.com/) is a great way to turn your analyses into high quality documents, reports, presentations and dashboards:
- A comprehensive guide: [https://bookdown.org/yihui/rmarkdown/](https://bookdown.org/yihui/rmarkdown/)
- The cheatsheet [https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf)
In addition most packages will provide **vignette**s on how to perform an analysis from scratch. On the [bioconductor.org](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) website (specialised on R packages for biologists), you will have direct links to the packages vignette.
Finally, don't forget to search the web for your problems or error in R websites like [stackoverflow](https://stackoverflow.com/) contains high quality and well-curated answers.
\ No newline at end of file
This diff is collapsed.