Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
  • master
2 results

Target

Select target project
  • LBMC/hub/formations/R_basis
  • can/R_basis
2 results
Select Git revision
  • main
  • master
  • quarto-rebuild
3 results
Show changes
......@@ -2,33 +2,21 @@
title: "R.2: introduction to Tidyverse"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nHélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
---
```{r include=FALSE}
library(fontawesome)
```
 `r fa(name = "fas fa-house", fill = "grey", height = "1em")`  https://can.gitbiopages.ens-lyon.fr/R_basis/
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
```{r download_data, include=FALSE, eval=T}
library("tidyverse")
tmp <- tempfile(fileext = ".zip")
......@@ -120,7 +108,7 @@ read_csv("data-raw/vehicles.csv") %>%
write_csv("mpg.csv")
```
# Introduction
## Introduction
In the last session, we have gone through the basis of R.
Instead of continuing to learn more about R programming, in this session we are going to jump directly to rendering plots.
......@@ -139,7 +127,7 @@ The objectives of this session will be to:
- Learn the different aesthetics in R plots
- Compose complex graphics
## Tidyverse
### Tidyverse
The `tidyverse` package is a collection of R packages designed for data science that include `ggplot2`.
......@@ -160,7 +148,7 @@ Luckily for you, `tidyverse` is preinstalled on your Rstudio server. So you just
library("tidyverse")
```
## Toy data set `mpg`
### Toy data set `mpg`
This dataset contains a subset of the fuel economy data that the EPA makes available on [fueleconomy.gov](http://fueleconomy.gov).
It contains only models which had a new release every year between 1999 and 2008.
......@@ -209,13 +197,13 @@ new_mpg
Here we can see that `new_mpg` is a `tibble` we will come back to `tibble` later.
## New script
### New script
Like in the last session, instead of typing your commands directly in the console, you are going to write them in an R script.
![](./img/formationR_session2_scriptR.png)
# First plot with `ggplot2`
## First plot with `ggplot2`
We are going to make the simplest plot possible to study the relationship between two variables: the scatterplot.
......@@ -247,7 +235,15 @@ ggplot(data = <DATA>) +
What happend when you use only the command `ggplot(data = mpg)` ?
</div>
<details><summary>Solution</summary>
<p>
```{r only_ggplot, cache = TRUE, fig.width=4.5, fig.height=2}
ggplot(data = new_mpg)
```
</p>
</details>
<div class="pencadre">
Make a scatterplot of `hwy` ( fuel efficiency ) vs. `cyl` ( number of cylinders ).
</div>
......@@ -261,12 +257,22 @@ ggplot(data = new_mpg, mapping = aes(x = hwy, y = cyl)) +
```
</p>
<div class="pencadre">
What seems to be the problem ?
</div>
<details><summary>Solution</summary>
<p>
Dots with the same coordinates are superposed.
</p>
</details>
</details>
# Aesthetic mappings
## Aesthetic mappings
`ggplot2` will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. `ggplot2` will also add a legend that explains which levels correspond to which values.
......@@ -276,7 +282,7 @@ Try the following aesthetic:
- `alpha`
- `shape`
## `color` mapping
### `color` mapping
```{r new_mpg_plot_e, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
......@@ -284,21 +290,21 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) +
```
## `size` mapping
### `size` mapping
```{r new_mpg_plot_f, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, size = class)) +
geom_point()
```
## `alpha` mapping
### `alpha` mapping
```{r new_mpg_plot_g, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, alpha = class)) +
geom_point()
```
## `shape` mapping
### `shape` mapping
```{r new_mpg_plot_h, cache = TRUE, fig.width=8, fig.height=4.5, warning=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, shape = class)) +
......@@ -335,7 +341,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
</p>
</details>
## Mapping a **continuous** variable to a color.
### Mapping a **continuous** variable to a color.
You can also map continuous variable to a color
......@@ -357,7 +363,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = displ < 5)) +
</p>
</details>
# Facets
## Facets
You can create multiple plots at once by faceting. For this you can use the command `facet_wrap`.
This command takes a formula as input.
......@@ -372,7 +378,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
```
<div class="pencadre">
Now try to facet your plot by `fl + class`
Now try to facet your plot by `fuel + class`
</div>
......@@ -381,14 +387,14 @@ Now try to facet your plot by `fl + class`
Formulas allow you to express complex relationship between variables in R !
```{r new_mpg_plot_l, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fl + class, nrow = 2)
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(~ fuel + class, nrow = 2)
```
</p>
</details>
# Composition
## Composition
There are different ways to represent the information :
......@@ -426,7 +432,7 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
\
We can use different `data` for different layers (you will lean more on `filter()` later)
We can use different `data` (here new_mpg and mpg tables) for different layers (you will lean more on `filter()` later)
```{r new_mpg_plot_t, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
......@@ -434,14 +440,14 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = filter(mpg, class == "subcompact"))
```
# Challenge !
## Challenge !
## First challenge
### First challenge
<div class="pencadre">
Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
</div>
```R
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
......@@ -451,71 +457,117 @@ ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
- What does the `se` argument to `geom_smooth()` do?
</div>
## Second challenge
<details><summary>Solution</summary>
<p>
```{r soluce_challenge_1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) +
geom_point(show.legend = FALSE) +
geom_smooth(se = FALSE)
```
</p>
</details>
### Second challenge
<div class="pencadre">
How being a `2seater` car impact the engine size versus fuel efficiency relationship ?
How being a `Two Seaters` car (*class column*) impact the engine size (*displ column*) versus fuel efficiency relationship (*hwy column*) ?
1. Make a plot of `hwy` in function of `displ `
1. *Colorize* this plot in another color for `Two Seaters` class
2. *Split* this plot for each *class*
Make a plot *colorizing* this information
</div>
<details><summary>Solution</summary>
<details><summary>Solution 1</summary>
<p>
```{r new_mpg_plot_color_2seater1, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point()
```
</p>
</details>
<details><summary>Solution 2</summary>
<p>
```{r new_mpg_plot_color_2seater2, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red")
```
</p>
</details>
<details><summary>Solution 3</summary>
<p>
```{r new_mpg_plot_color_2seater, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
```{r new_mpg_plot_color_2seater_facet, cache = TRUE, fig.width=8, fig.height=4.5}
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red") +
facet_wrap(~class)
```
</p>
</details>
<div class="pencadre">
Write a `function` called `plot_color_2seater` that can take as sol argument the variable `mpg` and plot the same graph.
Write a `function` called `plot_color_a_class` that can take as argument the class and plot the same graph for this class
</div>
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_color_2seater_fx, cache = TRUE, fig.width=8, fig.height=4.5}
plot_color_2seater <- function(mpg) {
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
plot_color_a_class <- function(my_class) {
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_point(data = filter(mpg, class == "2seater"), color = "red")
geom_point(data = filter(new_mpg, class == my_class), color = "red") +
facet_wrap(~class)
}
plot_color_2seater(mpg)
plot_color_a_class("Two Seaters")
plot_color_a_class("Compact Cars")
```
</p>
</details>
## Third challenge
### Third challenge
<div class="pencadre">
Recreate the R code necessary to generate the following graph
Recreate the R code necessary to generate the following graph (see "linetype" option of "geom_smooth")
</div>
```{r new_mpg_plot_u, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
<details><summary>Solution</summary>
<p>
```{r new_mpg_plot_v, eval=F}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
geom_point() +
geom_smooth(mapping = aes(linetype = drv))
geom_smooth(linetype = "dashed", color = "black") +
facet_wrap(~fuel)
```
</p>
</details>
## See you in [R.3: Transformations with ggplot2](https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/)
### See you in [R.3: Transformations with ggplot2](/session_3/session_3.html)
# To go further: publication ready plots
## To go further: publication ready plots
Once you have created the graph you need for your publication, you have to save it.
You can do it with the the `ggsave` function.
You can do it with the `ggsave` function.
First save your plot in a variable :
......
......@@ -2,35 +2,19 @@
title: 'R.3: Transformations with ggplot2'
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2022"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
---
```{r include=FALSE}
library(fontawesome)
```
&ensp;`r fa(name = "fas fa-house", fill = "grey", height = "1em")` &ensp;https://can.gitbiopages.ens-lyon.fr/R_basis/
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
In the last session, we have seen how to use `ggplot2` and [The Grammar of Graphics](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl). The goal of this practical is to practices more advanced features of `ggplot2`.
......@@ -53,7 +37,7 @@ library("tidyverse")
Like in the previous sessions, it's good practice to create a new **.R** file to write your code instead of using the R terminal directly.
# `ggplot2` statistical transformations
## `ggplot2` statistical transformations
In the previous session, we have plotted the data as they are by using the variable values as **x** or **y** coordinates, color shade, size or transparency.
When dealing with categorical variables, also called **factors**, it can be interesting to perform some simple statistical transformations.
......@@ -73,7 +57,7 @@ We are going to use the `diamonds` data set included in `tidyverse`.
str(diamonds)
```
## Introduction to `geom_bar`
### Introduction to `geom_bar`
We saw scatterplot (`geom_point()`), smoothplot (`geom_smooth()`).
Now barplot with `geom_bar()` :
......@@ -88,7 +72,7 @@ More diamonds are available with high quality cuts.
On the x-axis, the chart displays **cut**, a variable from diamonds. On the y-axis, it displays **count**, **but count is not a variable in diamonds!**
## **geom** and **stat**
### **geom** and **stat**
The algorithm used to calculate new values for a graph is called a **stat**, short for statistical transformation.
The figure below describes how this process works with `geom_bar()`.
......@@ -104,7 +88,7 @@ ggplot(data = diamonds, mapping = aes(x = cut)) +
Every **geom** has a default **stat**; and every **stat** has a default **geom**. This means that you can typically use **geoms** without worrying about the underlying statistical transformation. There are three reasons you might need to use a **stat** explicitly:
## Why **stat** ?
### Why **stat** ?
You might want to override the default stat.
For example, in the following `demo` dataset we already have a variable for the **counts** per `cut`.
......@@ -159,7 +143,7 @@ If group is not used, the proportion is calculated with respect to the data that
</p>
</details>
## More details with `stat_summary`
### More details with `stat_summary`
<div class="pencadre">
You might want to draw greater attention to the statistical transformation in your code.
......@@ -194,7 +178,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
</p>
</details>
# Coloring area plots
## Coloring area plots
<div class="pencadre">
You can color a bar chart using either the `color` aesthetic, or, more usefully `fill`:
......@@ -229,7 +213,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
</p>
</details>
# Position adjustments
## Position adjustments
The stacking of the `fill` parameter is performed by the position adjustment `position`
......@@ -301,7 +285,7 @@ ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
</p>
</details>
# Coordinate systems
## Coordinate systems
Cartesian coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.
......@@ -349,9 +333,9 @@ ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
By combining the right **geom**, **coordinates** and **faceting** functions, you can build a large number of different plots to present your results.
# See you in [R.4: data transformation](https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/)
## See you in [R.4: data transformation](/session_4/session_4.html)
# To go further: animated plots from xls files
## To go further: animated plots from xls files
In order to be able to read information from a xls file, we will use the `openxlsx` packages. To generate animation we will use the `ggannimate` package. The additional `gifski` package will allow R to save your animation in the gif format (Graphics Interchange Format)
......@@ -432,7 +416,8 @@ For this we need to add a `transition_time` layer that will take as an argument
ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
geom_point() +
scale_x_log10() +
transition_time(year)
transition_time(year) +
labs(title = 'Year: {as.integer(frame_time)}')
```
</p>
</details>
\ No newline at end of file
session_4/img/transform-logical.png

70.2 KiB | W: 0px | H: 0px

session_4/img/transform-logical.png

82.8 KiB | W: 0px | H: 0px

session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
session_4/img/transform-logical.png
  • 2-up
  • Swipe
  • Onion skin
This diff is collapsed.
---
title: "R#5: Pipping and grouping"
title: "R.5: Pipping and grouping"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2022"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
---
```{r include=FALSE}
library(fontawesome)
```
&ensp;`r fa(name = "fas fa-house", fill = "grey", height = "1em")` &ensp;https://can.gitbiopages.ens-lyon.fr/R_basis/
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
The goal of this practical is to practice combining data transformation with `tidyverse`.
The objectives of this session will be to:
......@@ -53,7 +40,7 @@ library("nycflights13")
</p>
</details>
# Combining multiple operations with the pipe
## Combining multiple operations with the pipe
<div id="pencadre">
Find the 10 most delayed flights using a ranking function. `min_rank()`
......@@ -95,12 +82,12 @@ Working with the pipe is one of the key criteria for belonging to the `tidyverse
The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
## When not to use the pipe
### When not to use the pipe
- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
# Grouping variable
## Grouping variable
The `summarise()` function collapses a data frame to a single row.
Check the difference between `summarise()` and `mutate()` with the following commands:
......@@ -114,7 +101,7 @@ flights %>%
Where mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column.
## The power of `summarise()` with `group_by()`
### The power of `summarise()` with `group_by()`
The `group_by()` function changes the unit of analysis from the complete dataset to individual groups.
Individual groups are defined by categorial variable or **factors**.
......@@ -138,7 +125,7 @@ ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
Why did we `group_by` `year` and `month` and not only `year` ?
</div>
## Missing values
### Missing values
<div class="pencadre">
You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it?
......@@ -155,7 +142,7 @@ flights %>%
Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**.
## Counts
### Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
......@@ -163,29 +150,29 @@ Whenever you do any aggregation, it’s always a good idea to include either a c
summ_delay_filghts <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20)
filter(avg_delay < 40 & avg_delay > -20)
ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
ggplot(summ_delay_filghts, mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
<div class="pencadre">
Imagine that we want to explore the relationship between the distance and average delay for each location and recreate the above figure.
Imagine that we want to explore the relationship between the average distance (`distance`) and average delay (`arr_delay`) for each location (`dest`) and recreate the above figure.
here are three steps to prepare this data:
1. Group flights by destination.
2. Summarize to compute distance, average delay, and number of flights using `n()`.
3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
2. Summarize to compute average distance (`avg_distance`), average delay (`avg_delay`), and number of flights using `n()` (`n_flights`).
3. Filter to remove Honolulu airport, which is almost twice as far away as the next closest airport.
4. Filter to remove noisy points with delay superior to 40 or inferior to -20
5. Create a `mapping` on `dist`, `delay` and `count` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()`
5. Create a `mapping` on `avg_distance`, `avg_delay` and `n_flights` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()` (use method = lm)
7. We can hide the legend with the layer `theme(legend.position='none')`
</div>
......@@ -195,13 +182,13 @@ here are three steps to prepare this data:
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20) %>%
ggplot(mapping = aes(x = dist, y = delay, size = count)) +
filter(avg_delay < 40 & avg_delay > -20) %>%
ggplot(mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
......@@ -210,7 +197,7 @@ flights %>%
</details>
## Ungrouping
### Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
......@@ -225,19 +212,21 @@ flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
# Grouping challenges
## Grouping challenges
## First challenge
### First challenge
<div class="pencadre">
Look at the number of canceled flights per day. Is there a pattern?
(A canceled flight is a flight where the `dep_time` or the `arr_time` is `NA`)
**Remember to always try to decompose complex questions into smaller and simple problems**
- What are `canceled` flights?
- Who can I create a `canceled` flights variable?
- We need to define the day of the week `wday` variable (`strftime(x,'%A')` give you the name of the day from a POSIXct date).
- How can you create a `canceled` flights variable which will be TRUE if the flight is canceled or FALSE if not?
- We need to define the day of the week `wday` variable (Monday, Tuesday, ...). To do that, you can use `strftime(x,'%A')` to get the name of the day of a `x` date in the POSIXct format as in the `time_hour` column, ex: `strftime("2013-01-01 05:00:00 EST",'%A')` return "Tuesday" ).
- We can count the number of canceled flight (`cancel_day`) by day of the week (`wday`).
- We can pipe transformed and filtered tibble into a `ggplot` function.
- We can use `geom_col` to have a barplot of the number of `cancel_day` for each. `wday`
......@@ -264,7 +253,7 @@ flights %>%
</p>
</details>
## Second challenge
### Second challenge
<div class="pencadre">
Is the proportion of canceled flights by day of the week related to the average departure delay?
......@@ -279,7 +268,7 @@ flights %>%
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
mutate(
summarise(
prop_cancel_day = sum(canceled)/n(),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
......@@ -357,7 +346,7 @@ flights %>%
</p>
</details>
## Third challenge
### Third challenge
<div class="pencadre">
Which carrier has the worst delays?
......@@ -379,7 +368,7 @@ flights %>%
</details>
<div class="pencadre">
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n())`)
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n=n())`)
</div>
<details><summary>Solution</summary>
......@@ -399,4 +388,4 @@ flights %>%
</p>
</details>
## See you in [R.6: tidydata](https://can.gitbiopages.ens-lyon.fr/R_basis/session_6/)
### See you in [R.6: tidydata](/session_6/session_6.html)
......@@ -2,42 +2,25 @@
title: "R.6: tidydata"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr);\nCarine Rey [carine.rey@ens-lyon.fr](mailto:carine.rey@ens-lyon.fr)"
date: "2022"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
---
```{r include=FALSE}
library(fontawesome)
```
&ensp;`r fa(name = "fas fa-house", fill = "grey", height = "1em")` &ensp;https://can.gitbiopages.ens-lyon.fr/R_basis/
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
Until now we have worked with data already formated in a *nice way*.
In the `tidyverse` data formated in a *nice way* are called **tidy**
The goal of this practical is to understand how to transform an hugly blob of information into a **tidy** data set.
## Tidydata
### Tidydata
There are three interrelated rules which make a dataset tidy:
......@@ -80,9 +63,9 @@ The data is a subset of the data contained in the World Health Organization Glob
</p>
</details>
# Pivoting data
## Pivoting data
## pivot longer
### pivot longer
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_longer.png')
......@@ -117,7 +100,7 @@ long_example <- wide_example %>%
pivot_longer(-X1, names_to = "V1", values_to = "V2")
```
### Exercice
#### Exercice
<div class="pencadre">
Visualize the `table4a` dataset (you can use the `View()` function).
......@@ -154,7 +137,7 @@ table4a %>%
</p>
</details>
## pivot wider
### pivot wider
```{r, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/pivot_wider.png')
......@@ -170,7 +153,7 @@ long_example %>% pivot_wider(names_from = V1,
```
### Exercice
#### Exercice
<div class="pencadre">
Visualize the `table2` dataset
......@@ -191,9 +174,9 @@ table2 %>%
</p>
</details>
# Merging data
## Merging data
## Relational data
### Relational data
To avoid having a huge table and to save space, information is often splited between different tables.
......@@ -209,7 +192,7 @@ flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```
## Relational schema
### Relational schema
The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
......@@ -217,7 +200,7 @@ The relationships between tables can be seen in a relational graph. The variable
knitr::include_graphics('img/relational-nycflights.png')
```
## Joints
### Joints
If you have to combine data from 2 tables in a a new table, you will use `joints`.
......@@ -233,7 +216,7 @@ Small concrete examples:
knitr::include_graphics('img/overview_joins.png')
```
### `inner_joint()`
#### `inner_joint()`
keeps observations in `x` AND `y`
......@@ -242,7 +225,7 @@ flights2 %>%
inner_join(airlines)
```
### `left_joint()`
#### `left_joint()`
keeps all observations in `x`
......@@ -251,7 +234,7 @@ flights2 %>%
left_join(airlines)
```
### `right_joint()`
#### `right_joint()`
keeps all observations in `y`
......@@ -260,7 +243,7 @@ flights2 %>%
right_join(airlines)
```
### `full_joint()`
#### `full_joint()`
keeps all observations in `x` and `y`
......@@ -269,7 +252,7 @@ flights2 %>%
full_join(airlines)
```
## Defining the key columns
### Defining the key columns
The default, `by = NULL`, uses all variables that appear in both tables, the so called natural join.
......@@ -308,7 +291,7 @@ flights2 %>%
left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin"))
```
## Filtering joins
### Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
......@@ -323,7 +306,7 @@ flights %>%
semi_join(top_dest)
```
## Set operations
### Set operations
These expect the x and y inputs to have the same variables, and treat the observations like sets:
......@@ -335,4 +318,4 @@ These expect the x and y inputs to have the same variables, and treat the observ
knitr::include_graphics('img/overview_set.png')
```
## See you in [R.7: String & RegExp](https://can.gitbiopages.ens-lyon.fr/R_basis/session_7/)
### See you in [R.7: String & RegExp](/session_7/session_7.html)
......@@ -2,36 +2,22 @@
title: "R.7: String & RegExp"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2022"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
---
```{r include=FALSE}
library(fontawesome)
```
&ensp;`r fa(name = "fas fa-house", fill = "grey", height = "1em")` &ensp;https://can.gitbiopages.ens-lyon.fr/R_basis/
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
In the previous session, we have often overlooked a particular type of data, the **string**.
In R a sequence of characters is stored as a string.
......@@ -50,9 +36,9 @@ library(tidyverse)
</p>
</details>
# String basics
## String basics
## String definition
### String definition
A string can be defined within double `"` or simple `'` quote
......@@ -82,7 +68,7 @@ single_quote <- '\'' # or "'"
If you want to include a literal backslash, you’ll need to double it up: `"\\"`.
## String representation
### String representation
The printed representation of a string is not the same as string itself
......@@ -97,7 +83,7 @@ writeLines(x)
Some characters have a special representation, they are called **special characters**.
The most common are `"\n"`, newline, and `"\t"`, tabulation, but you can see the complete list by requesting help on `"`: `?'"'`
## String operation
### String operation
You can perform basic operation on strings like
......@@ -134,7 +120,7 @@ str_to_lower(x)
str_sort(x)
```
# Matching patterns with regular expressions
## Matching patterns with regular expressions
Regexps are a very terse language that allows you to describe patterns in strings.
......@@ -196,13 +182,13 @@ writeLines(x)
str_view(x, "\\\\")
```
## Exercises
### Exercises
- Explain why each of these strings doesn’t match a \: "`\`", "`\\`", "`\\\`".
- How would you match the sequence `"'\`?
- What patterns will the regular expression `\..\..\..` match? How would you represent it as a string?
## Anchors
### Anchors
Until now we searched for patterns anywhere in the target string. But we can use anchors to be more precise.
......@@ -223,7 +209,7 @@ x <- c("apple pie", "apple", "apple cake")
str_view(x, "^apple$")
```
## Exercices
### Exercices
- How would you match the literal string `"$^$"`?
- Given the corpus of common words in stringr::words, create regular expressions that find all words that:
......@@ -234,7 +220,7 @@ str_view(x, "^apple$")
Since this list is long, you might want to use the match argument to `str_view()` to show only the matching or non-matching words.
## Character classes and alternatives
### Character classes and alternatives
In regular expression we have special character and patterns that match groups of characters.
......@@ -255,7 +241,7 @@ You can use alternations to pick between one or more alternative patterns. For e
str_view(c("grey", "gray"), "gr(e|a)y")
```
## Exercices
### Exercices
Create regular expressions to find all words that:
......@@ -264,7 +250,7 @@ Create regular expressions to find all words that:
- End with ed, but not with eed.
- End with ing or ise.
## Repetition
### Repetition
Now that you know how to search for groups of characters you can define the number of times you want to see them.
......@@ -292,7 +278,7 @@ str_view(x, "C{2,}")
str_view(x, "C{2,3}")
```
## Exercices
### Exercices
- Describe in words what these regular expressions match: (read carefully to see if I’m using a regular expression or a string that defines a regular expression.)
- `^.*$`
......@@ -305,7 +291,7 @@ str_view(x, "C{2,3}")
- Have two or more vowel-consonant pairs in a row.
## Grouping
### Grouping
You learned about parentheses as a way to disambiguate complex expressions. Parentheses also create a numbered capturing group (number 1, 2 etc.). A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses. You can refer to the same text as previously matched by a capturing group with back references, like `\1`, `\2` etc.
......@@ -313,7 +299,7 @@ You learned about parentheses as a way to disambiguate complex expressions. Pare
str_view(fruit, "(..)\\1", match = TRUE)
```
## Exercices
### Exercices
- Describe, in words, what these expressions will match:
- `"(.)\1\1"`
......@@ -326,7 +312,7 @@ str_view(fruit, "(..)\\1", match = TRUE)
- Contain a repeated pair of letters (e.g. `“church”` contains `“ch”` repeated twice.)
- Contain one letter repeated in at least three places (e.g. `“eleven”` contains three `“e”`s.)
## Detect matches
### Detect matches
```{r str_view_match, eval=T, cache=T}
x <- c("apple", "banana", "pear")
......@@ -345,7 +331,7 @@ What proportion of common words ends with a vowel?
mean(str_detect(words, "[aeiou]$"))
```
## Combining detection
### Combining detection
Find all words containing at least one vowel, and negate
......@@ -360,7 +346,7 @@ no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
```
## With tibble
### With tibble
```{r str_detecttibble, eval=T, cache=T}
df <- tibble(
......@@ -371,7 +357,7 @@ df %>%
filter(str_detect(word, "x$"))
```
## Extract matches
### Extract matches
```{r str_sentences, eval=T, cache=T}
head(sentences)
......@@ -385,7 +371,7 @@ colour_match <- str_c(colours, collapse = "|")
colour_match
```
## Extract matches
### Extract matches
We can select the sentences that contain a colour, and then extract the colour to figure out which one it is:
......@@ -395,7 +381,7 @@ matches <- str_extract(has_colour, colour_match)
head(matches)
```
## Grouped matches
### Grouped matches
Imagine we want to extract nouns from the sentences. As a heuristic, we’ll look for any word that comes after “a” or “the”.
......@@ -415,11 +401,11 @@ has_noun %>%
str_match(noun)
```
## Exercises
### Exercises
- Find all words that come after a `number` like `one`, `two`, `three` etc. Pull out both the number and the word.
## Replacing matches
### Replacing matches
Instead of replacing with a fixed string, you can use back references to insert components of the match. In the following code, I flip the order of the second and third words.
......@@ -429,13 +415,13 @@ sentences %>%
head(5)
```
## Exercices
### Exercices
- Replace all forward slashes in a string with backslashes.
- Implement a simple version of `str_to_lower()` using `replace_all()`.
- Switch the first and last letters in words. Which of those strings are still words?
## Splitting
### Splitting
```{r splitting, eval=T, cache=T}
sentences %>%
......@@ -443,4 +429,4 @@ sentences %>%
str_split("\\s")
```
## See you in [R.8: Factors](https://can.gitbiopages.ens-lyon.fr/R_basis/session_8/)
\ No newline at end of file
### See you in [R.8: Factors](/session_8/session_8.html)
\ No newline at end of file
......@@ -2,36 +2,22 @@
title: "R.8: Factors"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2022"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
---
```{r include=FALSE}
library(fontawesome)
```
&ensp;`r fa(name = "fas fa-house", fill = "grey", height = "1em")` &ensp;https://can.gitbiopages.ens-lyon.fr/R_basis/
if("conflicted" %in% .packages())
conflicted::conflicts_prefer(dplyr::filter)
```
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
## Introduction
In this session, you will learn more about the factor type in R.
Factors can be very useful, but you have to be mindful of the implicit conversions from simple vector to factor !
......@@ -49,7 +35,7 @@ library(tidyverse)
</p>
</details>
# Creating factors
## Creating factors
Imagine that you have a variable that records month:
......@@ -98,7 +84,7 @@ f2
levels(f2)
```
# General Social Survey
## General Social Survey
```{r race_count, eval=T, cache=T}
gss_cat %>%
......@@ -108,12 +94,12 @@ gss_cat %>%
By default, `ggplot2` will drop levels that don’t have any values. You can force them to display with:
```{r race_plot, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(gss_cat, aes(race)) +
ggplot(gss_cat, aes(x = race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
```
# Modifying factor order
## Modifying factor order
It’s often useful to change the order of the factor levels in a visualisation.
......@@ -125,7 +111,7 @@ relig_summary <- gss_cat %>%
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
ggplot(relig_summary, aes(x = tvhours, y = relig)) + geom_point()
```
It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using `fct_reorder()`. `fct_reorder()` takes three arguments:
......@@ -135,7 +121,7 @@ It is difficult to interpret this plot because there’s no overall pattern. We
- Optionally, `fun`, a function that’s used if there are multiple values of `x` for each value of `f`. The default value is `median`.
```{r tv_hour_order, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
geom_point()
```
......@@ -144,11 +130,11 @@ As you start making more complicated transformations, I’d recommend moving the
```{r tv_hour_order_mutate, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
ggplot(aes(x = tvhours, y = relig)) +
geom_point()
```
# `fct_reorder2()`
## `fct_reorder2()`
Another type of reordering is useful when you are colouring the lines on a plot. `fct_reorder2()` reorders the factor by the `y` values associated with the largest `x` values. This makes the plot easier to read because the line colours line up with the legend.
......@@ -161,17 +147,17 @@ by_age <- gss_cat %>%
```
```{r fct_reorder2a, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(by_age, aes(age, prop, colour = marital)) +
ggplot(by_age, aes(x = age, y = prop, colour = marital)) +
geom_line(na.rm = TRUE)
```
```{r fct_reorder2b, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
ggplot(by_age, aes(x = age, y = prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "marital")
```
# Materials
## Materials
There are lots of material online for R and more particularly on `tidyverse` and `Rstudio`
......@@ -179,10 +165,14 @@ You can find cheat sheet for all the packages of the `tidyverse` on this page:
[https://www.rstudio.com/resources/cheatsheets/](https://www.rstudio.com/resources/cheatsheets/)
The `Rstudio` websites are also a good place to learn more about R and the meta-package maintenained by the `Rstudio` community:
- [https://www.rstudio.com/resources/webinars/](https://www.rstudio.com/resources/webinars/)
- [https://www.rstudio.com/products/rpackages/](https://www.rstudio.com/products/rpackages/)
For example [rmarkdown](https://rmarkdown.rstudio.com/) is a great way to turn your analyses into high quality documents, reports, presentations and dashboards.
For example [rmarkdown](https://rmarkdown.rstudio.com/) is a great way to turn your analyses into high quality documents, reports, presentations and dashboards:
- A comprehensive guide: [https://bookdown.org/yihui/rmarkdown/](https://bookdown.org/yihui/rmarkdown/)
- The cheatsheet [https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf](https://raw.githubusercontent.com/rstudio/cheatsheets/main/rmarkdown-2.0.pdf)
In addition most packages will provide **vignette**s on how to perform an analysis from scratch. On the [bioconductor.org](http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) website (specialised on R packages for biologists), you will have direct links to the packages vignette.
......
#! /usr/bin/bash
# USAGE
# wget https://gitbio.ens-lyon.fr/can/R_basis/-/raw/main/src/create_users_from_user_list_csv.sh
# upload r_user_list_<day_number>_<day>.csv from your computer to the rstudio server
# sudo bash create_users_from_user_list_csv.sh r_user_list_<day_number>_<day>.csv
USER_PASSWORD_FILENAME=$@
while IFS=';' read -r NAME SURNAME EMAIL LAB COMMENT STATUS USERNAME PASSWD ; do
if [[ $EMAIL =~ "@" ]]
then
echo "=========================================="
echo user: $NAME $SURNAME $EMAIL $LAB
echo r_login: $USERNAME
echo r_passwd: $PASSWD
adduser ${USERNAME} --gecos 'First Last,RoomNumber,WorkPhone,HomePhone' --disabled-password --force-badname > /dev/null
echo "${USERNAME}:${PASSWD}" | chpasswd > /dev/null
fi
done < $USER_PASSWORD_FILENAME
echo "=========================================="
#! /usr/bin/bash
# USAGE
# wget https://gitbio.ens-lyon.fr/can/R_basis/-/raw/master/src/create_users_from_user_pwd_list.sh
# upload X_user_pwd_list.tsv from your computer to the rstudio server
# sudo bash create_users_from_user_pwd_list.sh X_user_pwd_list.tsv
USER_PASSWORD_FILENAME=$@
while IFS=$'\t' read -r GROUPE NAME SURNAME MAIL LOGIN_CBP PASSWD_CBP LABO R_USERNAME R_PASSWD ; do
if [[ $MAIL =~ "@" ]]
then
echo "=========================================="
echo user: $NAME $SURNAME $MAIL $LABO group:$GROUPE
if ! [[ $GROUPE =~ "L" ]]
then
echo computer_login: $LOGIN_CBP
echo computer_passwd: $PASSWD_CBP
else
echo computer_login: "TP"
echo computer_passwd:
fi
echo r_login: $R_USERNAME
echo r_passwd: $R_PASSWD
adduser ${R_USERNAME} --gecos 'First Last,RoomNumber,WorkPhone,HomePhone' --disabled-password --force-badname > /dev/null
echo "${R_USERNAME}:${R_PASSWD}" | chpasswd > /dev/null
fi
done < $USER_PASSWORD_FILENAME
echo "=========================================="
\ No newline at end of file