Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
  • master
2 results

Target

Select target project
  • LBMC/hub/formations/R_basis
  • can/R_basis
2 results
Select Git revision
  • main
  • master
  • quarto-rebuild
3 results
Show changes
Showing
with 20252 additions and 0 deletions
session_2/img/colors.png

286 KiB

session_2/img/formationR_session2_scriptR.png

198 KiB

session_2/img/shapes.png

25.6 KiB

session_2/img/tidyverse.jpg

69.5 KiB

---
title: "R.2: introduction to Tidyverse"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nHélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r download_data, include=FALSE, eval=T}
library("tidyverse")
tmp <- tempfile(fileext = ".zip")
download.file("http://www.fueleconomy.gov/feg/epadata/vehicles.csv.zip",
tmp,
quiet = TRUE)
unzip(tmp, exdir = "data-raw")
new_class_level <- c(
"Compact Cars",
"Large Cars",
"Midsize Cars",
"Midsize Cars",
"Midsize Cars",
"Compact Cars",
"Minivan",
"Minivan",
"Pickup Trucks",
"Pickup Trucks",
"Pickup Trucks",
"Sport Utility Vehicle",
"Sport Utility Vehicle",
"Compact Cars",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Special Purpose Vehicle",
"Sport Utility Vehicle",
"Sport Utility Vehicle",
"Pickup Trucks",
"Pickup Trucks",
"Pickup Trucks",
"Pickup Trucks",
"Sport Utility Vehicle",
"Sport Utility Vehicle",
"Compact Cars",
"Two Seaters",
"Vans",
"Vans",
"Vans",
"Vans"
)
new_fuel_level <- c(
"gas",
"Diesel",
"Regular",
"gas",
"gas",
"Regular",
"Regular",
"Hybrid",
"Hybrid",
"Regular",
"Regular",
"Hybrid",
"Hybrid"
)
read_csv("data-raw/vehicles.csv") %>%
select(
"id",
"make",
"model",
"year",
"VClass",
"trany",
"drive",
"cylinders",
"displ",
"fuelType",
"highway08",
"city08"
) %>%
rename(
"class" = "VClass",
"trans" = "trany",
"drive" = "drive",
"cyl" = "cylinders",
"displ" = "displ",
"fuel" = "fuelType",
"hwy" = "highway08",
"cty" = "city08"
) %>%
filter(drive != "") %>%
drop_na() %>%
arrange(make, model, year) %>%
mutate(class = factor(as.factor(class), labels = new_class_level)) %>%
mutate(fuel = factor(as.factor(fuel), labels = new_fuel_level)) %>%
write_csv("mpg.csv")
```
## Introduction
In the last session, we have gone through the basis of R.
Instead of continuing to learn more about R programming, in this session we are going to jump directly to rendering plots.
We make this choice for three reasons:
- Rendering nice plots is directly rewarding
- You will be able to apply what you learn in this session to your own data (given that they are *correctly formatted*)
- We will come back to R programming later, when you have all the necessary tools to visualize your results.
The objectives of this session will be to:
- Create basic plot with the `ggplot2` `library`
- Understand the `tibble` type
- Learn the different aesthetics in R plots
- Compose complex graphics
### Tidyverse
The `tidyverse` package is a collection of R packages designed for data science that include `ggplot2`.
All packages share an underlying design philosophy, grammar, and data structures (plus the same shape of logo).
<center>
![](./img/tidyverse.jpg){width=500px}
</center>
`tidyverse` is a meta library, which can be long to install with the following command:
```R
install.packages("tidyverse")
```
Luckily for you, `tidyverse` is preinstalled on your Rstudio server. So you just have to load the ` library`
```{R load_tidyverse}
library("tidyverse")
```
### Toy data set `mpg`
This dataset contains a subset of the fuel economy data that the EPA makes available on [fueleconomy.gov](http://fueleconomy.gov).
It contains only models which had a new release every year between 1999 and 2008.
You can use the `?` command to know more about this dataset.
```{r mpg_inspect, include=TRUE}
?mpg
```
But instead of using a dataset included in a R package, you may want to be able to use any dataset with the same format.
For that we are going to use the command `read_csv` which is able to read a [csv](https://en.wikipedia.org/wiki/Comma-separated_values) file.
This command also works for file URL
```{r mpg_download_local, cache=TRUE, message=FALSE, echo = F, include=F}
new_mpg <- read_csv("./mpg.csv")
```
```{r mpg_download, cache=TRUE, message=FALSE, eval = F}
new_mpg <- read_csv(
"https://can.gitbiopages.ens-lyon.fr/R_basis/session_2/mpg.csv"
)
```
You can check the number of lines and columns of the data with `dim`:
```{r mpg_inspect2, include=TRUE}
dim(new_mpg)
```
To visualize the data in Rstudio you can use the command. `View`
```R
View(new_mpg)
```
Or by simply calling the variable.
Like for simple data type calling a variable print it.
But complex data type like `new_mpg` can use complex print function.
```{r mpg_inspect3, include=TRUE}
new_mpg
```
Here we can see that `new_mpg` is a `tibble` we will come back to `tibble` later.
### New script
Like in the last session, instead of typing your commands directly in the console, you are going to write them in an R script.
![](./img/formationR_session2_scriptR.png)
## First plot with `ggplot2`
We are going to make the simplest plot possible to study the relationship between two variables: the scatterplot.
The following command generates a plot between engine size `displ` and fuel efficiency `hwy` present the `new_mpg` `tibble`.
```{r new_mpg_plot_a, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
```
<div class="pencadre">
Are cars with bigger engines less fuel efficient ?
</div>
`ggplot2` is a system for declaratively creating graphics, based on [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). You provide the data, tell `ggplot2` how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.
```
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
```
- you begin a plot with the function `ggplot()`
- you complete your graph by adding one or more layers
- `geom_point()` adds a layer with a scatterplot
- each **geom **function in `ggplot2` takes a `mapping` argument
- the `mapping` argument is always paired with `aes()`
<div class="pencadre">
What happend when you use only the command `ggplot(data = mpg)` ?
</div>
<details><summary>Solution</summary>
<p>
```{r only_ggplot, cache = TRUE, fig.width=4.5, fig.height=2}
ggplot(data = new_mpg)
```
</p>
</details>
<div class="pencadre">
Make a scatterplot of `hwy` ( fuel efficiency ) vs. `cyl` ( number of cylinders ).
</div>
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_b, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = hwy, y = cyl)) +
geom_point()
```
</p>
<div class="pencadre">
What seems to be the problem ?
</div>
<details><summary>Solution</summary>
<p>
Dots with the same coordinates are superposed.
</p>
</details>
</details>
## Aesthetic mappings
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. `ggplot2` will also add a legend that explains which levels correspond to which values.
Try the following aesthetic:
- `size`
- `alpha`
- `shape`
### `color` mapping
```{r new_mpg_plot_e, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()
```
### `size` mapping
```{r new_mpg_plot_f, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, size = class)) +
geom_point()
```
### `alpha` mapping
```{r new_mpg_plot_g, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, alpha = class)) +
geom_point()
```
### `shape` mapping
```{r new_mpg_plot_h, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, shape = class)) +
geom_point()
```
You can also set the aesthetic properties of your **geom** manually. For example, we can make all of the points in our plot blue and squares:
```{r new_mpg_plot_i, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(color = "blue", shape=0)
```
Here is a list of different shapes available in R:
<center>
![](./img/shapes.png){width=300px}
</center>
<div class="pencadre">
What’s gone wrong with this code? Why are the points not blue?
</div>
```{r new_mpg_plot_not_blue, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = "blue")) +
geom_point()
```
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_blue, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(color = "blue")
```
</p>
</details>
### Mapping a **continuous** variable to a color.
You can also map continuous variable to a color
```{r continu, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = cyl)) +
geom_point()
```
<div class="pencadre">
What happens if you map an aesthetic to something other than a variable name, like `color = displ < 5`?
</div>
<details><summary>Solution</summary>
<p>
```{r condiColor, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = displ < 5)) +
geom_point()
```
</p>
</details>
## Facets
You can create multiple plots at once by faceting. For this you can use the command `facet_wrap`.
This command takes a formula as input.
We will come back to formulas in R later, for now, you have to know that formulas start with a `~` symbol.
To make a scatterplot of `displ` versus `hwy` per car `class` you can use the following code:
```{r new_mpg_plot_k, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~class, nrow = 2)
```
<div class="pencadre">
Now try to facet your plot by `fuel + class`
</div>
<details><summary>Solution</summary>
<p>
Formulas allow you to express complex relationship between variables in R !
```{r new_mpg_plot_l, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fuel + class, nrow = 2)
```
</p>
</details>
## Composition
There are different ways to represent the information :
```{r new_mpg_plot_o, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
```
\
```{r new_mpg_plot_p, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth()
```
\
We can add as many layers as we want
```{r new_mpg_plot_q, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
```
\
We can make `mapping` layer specific
```{r new_mpg_plot_s, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
```
\
We can use different `data` (here new_mpg and mpg tables) for different layers (you will lean more on `filter()` later)
```{r new_mpg_plot_t, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"))
```
## Challenge !
### First challenge
<div class="pencadre">
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
</div>
```R
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
<div class="pencadre">
- What does `show.legend = FALSE` do?
- What does the `se` argument to `geom_smooth()` do?
</div>
<details><summary>Solution</summary>
<p>
```{r soluce_challenge_1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
</p>
</details>
### Second challenge
<div class="pencadre">
How being a `Two Seaters` car (*class column*) impact the engine size (*displ column*) versus fuel efficiency relationship (*hwy column*) ?
1. Make a plot of `hwy` in function of `displ `
1. *Colorize* this plot in another color for `Two Seaters` class
2. *Split* this plot for each *class*
</div>
<details><summary>Solution 1</summary>
<p>
```{r new_mpg_plot_color_2seater1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
```
</p>
</details>
<details><summary>Solution 2</summary>
<p>
```{r new_mpg_plot_color_2seater2, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red")
```
</p>
</details>
<details><summary>Solution 3</summary>
<p>
```{r new_mpg_plot_color_2seater_facet, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red") +
facet_wrap(~class)
```
</p>
</details>
<div class="pencadre">
Write a `function` called `plot_color_a_class` that can take as argument the class and plot the same graph for this class
</div>
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_color_2seater_fx, cache = TRUE, fig.width=8, fig.height=4.5}
plot_color_a_class <- function(my_class) {
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(new_mpg, class == my_class), color = "red") +
facet_wrap(~class)
}
plot_color_a_class("Two Seaters")
plot_color_a_class("Compact Cars")
```
</p>
</details>
### Third challenge
<div class="pencadre">
Recreate the R code necessary to generate the following graph (see "linetype" option of "geom_smooth")
</div>
```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_v, eval=F}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
</p>
</details>
### See you in [R.3: Transformations with ggplot2](/session_3/session_3.html)
## To go further: publication ready plots
Once you have created the graph you need for your publication, you have to save it.
You can do it with the `ggsave` function.
First save your plot in a variable :
```{r}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()
```
Then save it in the wanted format:
```{r, eval=F}
ggsave("test_plot_1.png", p1, width = 12, height = 8, units = "cm")
```
```{r, eval=F}
ggsave("test_plot_1.pdf", p1, width = 12, height = 8, units = "cm")
```
You may also change the appearance of your plot by adding a `theme` layer to your plot:
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 + theme_bw()
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 + theme_minimal()
```
You may have to combine several plots, for that you can use the `cowplot` package which is a `ggplot2` extension.
First install it :
```{r, eval=F}
install.packages("cowplot")
```
```{r, include=F, echo =F}
if (! require("cowplot")) {
install.packages("cowplot")
}
```
Then you can use the function `plot` grid to combine plots in a publication ready style:
```{r,message=FALSE}
library(cowplot)
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p1 <- ggplot(data = new_mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
p1
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
p2
```
```{r,fig.width=8, fig.height=4.5, message=FALSE}
plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
```
You can also save it in a file.
```{r, eval=F}
p_final = plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
ggsave("test_plot_1_and_2.png", p_final, width = 20, height = 8, units = "cm")
```
You can learn more features about `cowplot` on [https://wilkelab.org/cowplot/articles/introduction.html](its website).
<div class="pencadre">
Use the `cowplot` documentation to reproduce this plot and save it.
</div>
```{r, echo=F}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() + theme_bw()
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_point() + theme_bw()
p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))
plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
```
<details><summary>Solution</summary>
<p>
```{r , echo = TRUE, eval = F}
p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() + theme_bw()
p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_point() + theme_bw()
p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))
p_final <- plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
p_final
```
```{r , echo = TRUE, eval = F}
ggsave("plot_1_2_and_legend.png", p_final, width = 20, height = 8, units = "cm")
```
</p>
</details>
There are a lot of other available `ggplot2` extensions which can be useful (and also beautiful).
You can take a look at them here: [https://exts.ggplot2.tidyverse.org/gallery/]( ggplot2 gallery)
File added
session_3/img/visualization-stat-bar.png

257 KiB

---
title: 'R.3: Transformations with ggplot2'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
## Introduction
In the last session, we have seen how to use `ggplot2` and [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). The goal of this practical is to practices more advanced features of `ggplot2`.
The objectives of this session will be to:
- learn about statistical transformations
- practices position adjustments
- change the coordinate systems
The first step is to load the `tidyverse`.
<details><summary>Solution</summary>
<p>
```{r packageloaded, include=TRUE, message=FALSE}
library("tidyverse")
```
</p>
</details>
Like in the previous sessions, it's good practice to create a new **.R** file to write your code instead of using the R terminal directly.
## `ggplot2` statistical transformations
In the previous session, we have plotted the data as they are by using the variable values as **x** or **y** coordinates, color shade, size or transparency.
When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations.
For example, we may want to have coordinates on an axis proportional to the number of records for a given category.
We are going to use the `diamonds` data set included in `tidyverse`.
<div class="pencadre">
- Use the `help` and `View` command to explore this data set.
- How much records does this dataset contain ?
- Try the `str` command, which information are displayed ?
</div>
```{r str_diamon}
str(diamonds)
```
### Introduction to `geom_bar`
We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`).
Now barplot with `geom_bar()` :
```{r diamonds_barplot, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = diamonds, mapping = aes(x = cut)) +
geom_bar()
```
More diamonds are available with high quality cuts.
On the x-axis, the chart displays **cut**, a variable from diamonds. On the y-axis, it displays **count**, **but count is not a variable in diamonds!**
### **geom** and **stat**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
![](img/visualization-stat-bar.png)
You can generally use **geoms** and **stats** interchangeably. For example, you can recreate the previous plot using `stat_count()` instead of `geom_bar()`:
```{r diamonds_stat_count, include=TRUE, fig.width=8, fig.height=4.5}
ggplot(data = diamonds, mapping = aes(x = cut)) +
stat_count()
```
Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly:
### Why **stat** ?
You might want to override the default stat.
For example, in the following `demo` dataset we already have a variable for the **counts** per `cut`.
```{r 3_a, include=TRUE, fig.width=8, fig.height=4.5}
demo <- tribble(
~cut, ~freq,
"Fair", 1610,
"Good", 4906,
"Very Good", 12082,
"Premium", 13791,
"Ideal", 21551
)
```
(Don't worry that you haven't seen `tribble()` before. You might be able
to guess at their meaning from the context, and you will learn exactly what
they do soon!)
<div class="pencadre">
So instead of using the default `geom_bar` parameter `stat = "count"` try to use `"identity"`
</div>
<details><summary>Solution</summary>
<p>
```{r 3_ab, include=TRUE, fig.width=8, fig.height=4.5}
ggplot(data = demo, mapping = aes(x = cut, y = freq)) +
geom_bar(stat = "identity")
```
</p>
</details>
You might want to override the default mapping from transformed variables to aesthetics ( e.g., proportion).
```{r 3_b, include=TRUE, fig.width=8, fig.height=4.5}
ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop.., group = 1)) +
geom_bar()
```
<div class="pencadre">
In our proportion bar chart, we need to set `group = 1`. Why?
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_stats_challenge, include=TRUE, message=FALSE, fig.width=8, fig.height=4.5}
ggplot(data = diamonds, mapping = aes(x = cut, y = ..prop..)) +
geom_bar()
```
If group is not used, the proportion is calculated with respect to the data that contains that field and is ultimately going to be 100% in any case. For instance, the proportion of an ideal cut in the ideal cut specific data will be 1.
</p>
</details>
### More details with `stat_summary`
<div class="pencadre">
You might want to draw greater attention to the statistical transformation in your code.
you might use `stat_summary()`, which summarize the **y** values for each unique **x**
value, to draw attention to the summary that you are computing
</div>
<details><summary>Solution</summary>
<p>
```{r 3_c, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
stat_summary()
```
</p>
</details>
<div class="pencadre">
Set the `fun.min`, `fun.max` and `fun` to the `min`, `max` and `median` function respectively
</div>
<details><summary>Solution</summary>
<p>
```{r 3_d, include=TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
stat_summary(
fun.min = min,
fun.max = max,
fun = median
)
```
</p>
</details>
## Coloring area plots
<div class="pencadre">
You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`:
Try both solutions on a `cut`, histogram.
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_barplot_color, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, color = cut)) +
geom_bar()
```
```{r diamonds_barplot_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar()
```
</p>
</details>
<div class="pencadre">
You can also use `fill` with another variable:
Try to color by `clarity`. Is `clarity` a continuous or categorial variable ?
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_barplot_fill_clarity, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar()
```
</p>
</details>
## Position adjustments
The stacking of the `fill` parameter is performed by the position adjustment `position`
<div class="pencadre">
Try the following `position` parameter for `geom_bar`: `"fill"`, `"dodge"` and `"jitter"`
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_barplot_pos_fill, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar( position = "fill")
```
```{r diamonds_barplot_pos_dodge, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar( position = "dodge")
```
```{r diamonds_barplot_pos_jitter, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar( position = "jitter")
```
</p>
</details>
`jitter` is often used for plotting points when they are stacked on top of each other.
<div class="pencadre">
Compare `geom_point` to `geom_jitter` plot `cut` versus `depth` and color by `clarity`
</div>
<details><summary>Solution</summary>
<p>
```{r dia_jitter2, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_point()
```
```{r dia_jitter3, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_jitter()
```
</p>
</details>
<div class="pencadre">
What parameters of `geom_jitter` control the amount of jittering ?
</div>
<details><summary>Solution</summary>
<p>
```{r dia_jitter4, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_jitter(width = .1, height = .1)
```
</p>
</details>
In the `geom_jitter` plot that we made, we cannot really see the limits of the different clarity groups. Instead we can use the `geom_violin` to see their density.
<details><summary>Solution</summary>
<p>
```{r dia_violon, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_violin()
```
</p>
</details>
## Coordinate systems
Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
```{r dia_boxplot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_boxplot()
```
<div class="pencardre">
Add the `coord_flip()` layer to the previous plot
</div>
<details><summary>Solution</summary>
<p>
```{r dia_boxplot_flip, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
geom_boxplot() +
coord_flip()
```
</p>
</details>
<div class="pencardre">
Add the `coord_polar()` layer to this plot:
```{r diamonds_bar, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE, eval=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar( show.legend = FALSE, width = 1 ) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
```
</div>
<details><summary>Solution</summary>
<p>
```{r diamonds_bar2, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
geom_bar( show.legend = FALSE, width = 1 ) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL) +
coord_polar()
```
</p>
</details>
By combining the right **geom**, **coordinates** and **faceting** functions, you can build a large number of different plots to present your results.
## See you in [R.4: data transformation](/session_4/session_4.html)
## To go further: animated plots from xls files
In order to be able to read information from a xls file, we will use the `openxlsx` packages. To generate animation we will use the `ggannimate` package. The additional `gifski` package will allow R to save your animation in the gif format (Graphics Interchange Format)
```{r install_readxl, eval=F}
install.packages(c("openxlsx", "gganimate", "gifski"))
```
```{r load_readxl}
library(openxlsx)
library(gganimate)
library(gifski)
```
<div class="pencardre">
Use the `openxlsx` package to save the [https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx) file to the `gapminder` variable
</div>
<details><summary>Solution</summary>
<p>
2 solutions :
Use directly the url
```{r load_xlsx_url, eval = F}
gapminder <- read.xlsx("https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx")
```
Dowload the file, save it in the same directory as your script then use the local path
```{r load_xlsx}
gapminder <- read.xlsx("gapminder.xlsx")
```
</p>
</details>
This dataset contains 4 variables of interest for us to display per country:
- `gdpPercap` the GDP par capita (US$, inflation-adjusted)
- `lifeExp` the life expectancy at birth, in years
- `pop` the population size
- `contient` a factor with 5 levels
<div class="pencardre">
Using `ggplot2`, build a scatterplot of the `gdpPercap` vs `lifeExp`. Add the `pop` and `continent` information to this plot.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_a}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point()
```
</p>
</details>
<div class="pencardre">
What's wrong ?
You can use the `scale_x_log10()` to display the `gdpPercap` on the `log10` scale.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_b}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10()
```
</p>
</details>
<div class="pencardre">
We would like to add the `year` information to the plots. We could use a `facet_wrap`, but instead we are going to use the `gganimate` package.
For this we need to add a `transition_time` layer that will take as an argument `year` to our plot.
</div>
<details><summary>Solution</summary>
<p>
```{r gapminder_plot_c}
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10() +
transition_time(year) +
labs(title = 'Year: {as.integer(frame_time)}')
```
</p>
</details>
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
session_4/img/colorsR.png

370 KiB

session_4/img/transform-logical.png

82.8 KiB

This diff is collapsed.
---
title: "R.5: Pipping and grouping"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2022"
---
```{r include=FALSE}
library(fontawesome)
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
## Introduction
The goal of this practical is to practice combining data transformation with `tidyverse`.
The objectives of this session will be to:
- Combining multiple operations with the pipe `%>%`
- Work on subgroup of the data with `group_by`
<div class="pencadre">
For this session we are going to work with a new dataset included in the `nycflights13` package.
Install this package and load it.
As usual you will also need the `tidyverse` library.
</div>
<details><summary>Solution</summary>
<p>
```{r packageloaded, include=TRUE, message=FALSE}
library("tidyverse")
library("nycflights13")
```
</p>
</details>
## Combining multiple operations with the pipe
<div id="pencadre">
Find the 10 most delayed flights using a ranking function. `min_rank()`
</div>
<details><summary>Solution</summary>
<p>
```{r pipe_example_a, include=TRUE}
flights_md <- mutate(flights,
most_delay = min_rank(desc(dep_delay)))
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)
```
</p>
</details>
We don't want to create useless intermediate variables so we can use the pipe operator: `%>%`
(or `ctrl + shift + M`).
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
<div id="pencadre">
Try to pipe operators to rewrite your precedent code with only **one** variable assignment.
</div>
<details><summary>Solution</summary>
<p>
```{r pipe_example_b, include=TRUE}
flights_md2 <- flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)
```
</p>
</details>
Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered and use `+` instead of `%>%`. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
### When not to use the pipe
- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
## Grouping variable
The `summarise()` function collapses a data frame to a single row.
Check the difference between `summarise()` and `mutate()` with the following commands:
```{r load_data, eval=FALSE}
flights %>%
mutate(delay = mean(dep_delay, na.rm = TRUE))
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
Where mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column.
### The power of `summarise()` with `group_by()`
The `group_by()` function changes the unit of analysis from the complete dataset to individual groups.
Individual groups are defined by categorial variable or **factors**.
Then, when you use the function you already know on grouped data frame and they’ll be automatically applied *by groups*.
You can use the following code to compute the average delay per months across years.
```{r summarise_group_by, include=TRUE, fig.width=8, fig.height=3.5}
flights_delay <- flights %>%
group_by(year, month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>%
arrange(month)
ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
geom_bar(stat="identity", color="black", fill = "#619CFF") +
geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) +
theme(axis.text.x = element_blank())
```
<div class="pencadre">
Why did we `group_by` `year` and `month` and not only `year` ?
</div>
### Missing values
<div class="pencadre">
You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it?
</div>
```{r summarise_group_by_NA, include=TRUE}
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)
```
Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**.
### Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
```{r summarise_group_by_count, include = T, echo=F, warning=F, message=F, fig.width=8, fig.height=3.5}
summ_delay_filghts <- flights %>%
group_by(dest) %>%
summarise(
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(avg_delay < 40 & avg_delay > -20)
ggplot(summ_delay_filghts, mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
<div class="pencadre">
Imagine that we want to explore the relationship between the average distance (`distance`) and average delay (`arr_delay`) for each location (`dest`) and recreate the above figure.
here are three steps to prepare this data:
1. Group flights by destination.
2. Summarize to compute average distance (`avg_distance`), average delay (`avg_delay`), and number of flights using `n()` (`n_flights`).
3. Filter to remove Honolulu airport, which is almost twice as far away as the next closest airport.
4. Filter to remove noisy points with delay superior to 40 or inferior to -20
5. Create a `mapping` on `avg_distance`, `avg_delay` and `n_flights` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()` (use method = lm)
7. We can hide the legend with the layer `theme(legend.position='none')`
</div>
<details><summary>Solution</summary>
<p>
```{r summarise_group_by_count_b, include = T, eval=F, warning=F, message=F, fig.width=8, fig.height=3.5}
flights %>%
group_by(dest) %>%
summarise(
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(avg_delay < 40 & avg_delay > -20) %>%
ggplot(mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
</p>
</details>
### Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
<div class="pencadre">
Try the following example
</div>
```{r ungroup, eval=T, message=FALSE, cache=T}
flights %>%
group_by(year, month, day) %>%
ungroup() %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
## Grouping challenges
### First challenge
<div class="pencadre">
Look at the number of canceled flights per day. Is there a pattern?
(A canceled flight is a flight where the `dep_time` or the `arr_time` is `NA`)
**Remember to always try to decompose complex questions into smaller and simple problems**
- How can you create a `canceled` flights variable which will be TRUE if the flight is canceled or FALSE if not?
- We need to define the day of the week `wday` variable (Monday, Tuesday, ...). To do that, you can use `strftime(x,'%A')` to get the name of the day of a `x` date in the POSIXct format as in the `time_hour` column, ex: `strftime("2013-01-01 05:00:00 EST",'%A')` return "Tuesday" ).
- We can count the number of canceled flight (`cancel_day`) by day of the week (`wday`).
- We can pipe transformed and filtered tibble into a `ggplot` function.
- We can use `geom_col` to have a barplot of the number of `cancel_day` for each. `wday`
- You can use the function `fct_reorder()` to reorder the `wday` by number of `cancel_day` and make the plot easier to read.
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_a, eval=T, message=FALSE, cache=T}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
filter(canceled) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
cancel_day = n()
) %>%
ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
geom_col()
```
</p>
</details>
### Second challenge
<div class="pencadre">
Is the proportion of canceled flights by day of the week related to the average departure delay?
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_b1, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
prop_cancel_day = sum(canceled)/n(),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
ungroup() %>%
ggplot(mapping = aes(x = av_delay, y = prop_cancel_day, color = wday)) +
geom_point()
```
Which day would you prefer to book a flight ?
</p>
</details>
<div class="pencadre">
We can add error bars to this plot to justify our decision.
Brainstorm a way to have access to the mean and standard deviation or the `prop_cancel_day` and `av_delay`.
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(day) %>%
mutate(
prop_cancel_day = sum(canceled)/sum(!canceled),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
group_by(wday) %>%
summarize(
mean_cancel_day = mean(prop_cancel_day, na.rm = TRUE),
sd_cancel_day = sd(prop_cancel_day, na.rm = TRUE),
mean_av_delay = mean(av_delay, na.rm = TRUE),
sd_av_delay = sd(av_delay, na.rm = TRUE)
) %>%
ggplot(mapping = aes(x = mean_av_delay, y = mean_cancel_day, color = wday)) +
geom_point() +
geom_errorbarh(mapping = aes(
xmin = -sd_av_delay + mean_av_delay,
xmax = sd_av_delay + mean_av_delay
)) +
geom_errorbar(mapping = aes(
ymin = -sd_cancel_day + mean_cancel_day,
ymax = sd_cancel_day + mean_cancel_day
))
```
</p>
</details>
<div class="pencadre">
Now that you are aware of the interest of using `geom_errorbar`, what `hour` of the day should you fly if you want to avoid delays as much as possible?
</div>
<details><summary>Solution</summary>
<p>
```{r group_filter_b3, eval=T, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
group_by(hour) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = T),
sd_delay = sd(arr_delay, na.rm = T),
) %>%
ggplot() +
geom_errorbar(mapping = aes(
x = hour,
ymax = mean_delay + sd_delay,
ymin = mean_delay - sd_delay)) +
geom_point(mapping = aes(
x = hour,
y = mean_delay,
))
```
</p>
</details>
### Third challenge
<div class="pencadre">
Which carrier has the worst delays?
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_c2, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)
```
</p>
</details>
<div class="pencadre">
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n=n())`)
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier, dest) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T),
number_of_flight = n()
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_boxplot() +
geom_jitter(height = 0)
```
</p>
</details>
### See you in [R.6: tidydata](/session_6/session_6.html)
session_6/img/join-venn.png

50 KiB