Skip to content
Snippets Groups Projects
Verified Commit 940456f2 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

start working on session_6

parent fc54ef09
No related branches found
No related tags found
No related merge requests found
---
title: "R.5: Pipping and grouping"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
---
Dear all,
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
The first session of the R the basis formation will be in the CBP TP room the:
- 14/09 at 11h for the Tuesday session
- 17/09 at 11h for the Friday session
- 20/09 at 11h for the Monday session
# Introduction
For this first session, some formators will wait for you at the reception of the ENS Monod site 15 min before the start of the session to guide you to the room.
The goal of this practical is to practice combining data transformation with `tidyverse`.
The objectives of this session will be to:
You will have access to a computer to do all the practicals with your ens email account (same login and password).
There are no prerequisite for this formation are we will start from scratch.
- Combining multiple operations with the pipe `%>%`
- Work on subgroup of the data with `group_by`
If you want to work on your own laptop, you will need
—a recent browser
—access to the eduroam wifi network
In case of problems, we won't provide any IT support, just advise you to switch to a computer available in the TP room.
<div class="pencadre">
For this session we are going to work with a new dataset included in the `nycflights13` package.
Install this package and load it.
As usual you will also need the `tidyverse` library.
</div>
If you are unable to attend to a session, please give us some heads-up so we will not wait for you. All the supports will be available online so you can try to catch up before the next session.
<details><summary>Solution</summary>
<p>
```{r packageloaded, include=TRUE, message=FALSE}
library("tidyverse")
library("nycflights13")
```
</p>
</details>
# Combining multiple operations with the pipe
<div id="pencadre">
Find the 10 most delayed flights using a ranking function. `min_rank()`
</div>
<details><summary>Solution</summary>
<p>
```{r pipe_example_a, include=TRUE}
flights_md <- mutate(flights,
most_delay = min_rank(desc(dep_delay)))
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)
```
</p>
</details>
We don't want to create useless intermediate variables so we can use the pipe operator: `%>%`
(or `ctrl + shift + M`).
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
<div id="pencadre">
Try to pipe operators to rewrite your precedent code with only **one** variable assignment.
</div>
<details><summary>Solution</summary>
<p>
```{r pipe_example_b, include=TRUE}
flights_md2 <- flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)
```
</p>
</details>
Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered and use `+` instead of `%>%`. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
## When not to use the pipe
- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
# Grouping variable
The `summarise()` function collapses a data frame to a single row.
Check the difference between `summarise()` and `mutate()` with the following commands:
```{r load_data, eval=FALSE}
flights %>%
mutate(delay = mean(dep_delay, na.rm = TRUE))
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
Where mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column.
## The power of `summarise()` with `group_by()`
The `group_by()` function changes the unit of analysis from the complete dataset to individual groups.
Individual groups are defined by categorial variable or **factors**.
Then, when you use the function you already know on grouped data frame and they’ll be automatically applied *by groups*.
You can use the following code to compute the average delay per months across years.
```{r summarise_group_by, include=TRUE, message=FALSE, fig.width=8, fig.height=3.5}
flights_delay <- flights %>%
group_by(year, month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>%
arrange(month)
ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
geom_bar(stat="identity", color="black", fill = "#619CFF") +
geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) +
theme(axis.text.x = element_blank())
```
<div class="pencadre">
Why did we `group_by` `year` and `month` and not only `year` ?
</div>
## Missing values
<div class="pencadre">
You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it?
</div>
<details><summary>Solution</summary>
<p>
```{r summarise_group_by_NA, include=TRUE}
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)
```
</p>
</details>
Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**.
## Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
```{r summarise_group_by_count, include = T, echo=F, warning=F, message=F, fig.width=8, fig.height=3.5}
summ_delay_filghts <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20)
ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
<div class="pencadre">
Imagine that we want to explore the relationship between the distance and average delay for each location and recreate the above figure.
here are three steps to prepare this data:
1. Group flights by destination.
2. Summarize to compute distance, average delay, and number of flights using `n()`.
3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
4. Filter to remove noisy points with delay superior to 40 or inferior to -20
5. Create a `mapping` on `dist`, `delay` and `count` as `size`.
6. Use the layer `geom_point()` and `geom_smooth()`
7. We can hide the legend with the layer `theme(legend.position='none')`
</div>
<details><summary>Solution</summary>
<p>
```{r summarise_group_by_count_b, include = T, eval=F, warning=F, message=F, fig.width=8, fig.height=3.5}
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20) %>%
ggplot(mapping = aes(x = dist, y = delay, size = count)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
```
</p>
</details>
## Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
<div class="pencadre">
Try the following example
</div>
```{r ungroup, eval=T, message=FALSE, cache=T}
flights %>%
group_by(year, month, day) %>%
ungroup() %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
# Grouping challenges
## First challenge
<div class="pencadre">
Look at the number of canceled flights per day. Is there a pattern?
**Remember to always try to decompose complex questions into smaller and simple problems**
- What are `canceled` flights?
- Who can I `canceled` flights?
- We need to define the day of the week `wday` variable (`strftime(x,'%A')` give you the name of the day from a POSIXct date).
- We can count the number of canceled flight (`cancel_day`) by day of the week (`wday`).
- We can pipe transformed and filtered tibble into a `ggplot` function.
- We can use `geom_col` to have a barplot of the number of `cancel_day` for each. `wday`
- You can use the function `fct_reorder()` to reorder the `wday` by number of `cancel_day` and make the plot easier to read.
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_a, eval=T, message=FALSE, cache=T}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
filter(canceled) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
cancel_day = n()
) %>%
ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
geom_col()
```
</p>
</details>
## Second challenge
<div class="pencadre">
Is the proportion of canceled flights by day of the week related to the average departure delay?
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_b1, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
mutate(
prop_cancel_day = sum(canceled)/sum(!canceled),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
ungroup() %>%
ggplot(mapping = aes(x = av_delay, y = prop_cancel_day, color = wday)) +
geom_point()
```
Which day would you prefer to book a flight ?
</p>
</details>
<div class="pencadre">
We can add error bars to this plot to justify our decision.
Brainstorm a way to have access to the mean and standard deviation or the `prop_cancel_day` and `av_delay`.
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(day) %>%
mutate(
prop_cancel_day = sum(canceled)/sum(!canceled),
av_delay = mean(dep_delay, na.rm = TRUE)
) %>%
group_by(wday) %>%
summarize(
mean_cancel_day = mean(prop_cancel_day, na.rm = TRUE),
sd_cancel_day = sd(prop_cancel_day, na.rm = TRUE),
mean_av_delay = mean(av_delay, na.rm = TRUE),
sd_av_delay = sd(av_delay, na.rm = TRUE)
) %>%
ggplot(mapping = aes(x = mean_av_delay, y = mean_cancel_day, color = wday)) +
geom_point() +
geom_errorbarh(mapping = aes(
xmin = -sd_av_delay + mean_av_delay,
xmax = sd_av_delay + mean_av_delay
)) +
geom_errorbar(mapping = aes(
ymin = -sd_cancel_day + mean_cancel_day,
ymax = sd_cancel_day + mean_cancel_day
))
```
</p>
</details>
<div class="pencadre">
Now that you are aware of the interest of using `geom_errorbar`, what `hour` of the day should you fly if you want to avoid delays as much as possible?
</div>
<details><summary>Solution</summary>
<p>
```{r group_filter_b3, eval=T, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
group_by(hour) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = T),
sd_delay = sd(arr_delay, na.rm = T),
) %>%
ggplot() +
geom_errorbar(mapping = aes(
x = hour,
ymax = mean_delay + sd_delay,
ymin = mean_delay - sd_delay)) +
geom_point(mapping = aes(
x = hour,
y = mean_delay,
))
```
</p>
</details>
## Third challenge
<div class="pencadre">
Which carrier has the worst delays?
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)
```
</p>
</details>
<div class="pencadre">
Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n())`)
</div>
<details><summary>Solution</summary>
<p>
```{r grouping_challenges_c2, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier, dest) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T),
number_of_flight = n()
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_boxplot() +
geom_jitter(height = 0)
```
</p>
</details>
Best,
---
title: "R#6: tidydata"
title: "R.6: tidydata"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "19 Dec 2019"
date: "2021"
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
---
```{r setup, include=FALSE, echo = F}
library(tidyverse)
library(nycflights13)
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
Until now we have worked with data already formated in a *nice way*.
In the `tidyverse` data formated in a *nice way* are called **tidy**
The goal of this practical is to understand how to transform an hugly blob of information into a **tidy** data set.
## Tidydata
......@@ -29,17 +38,65 @@ There are three interrelated rules which make a dataset tidy:
- Each observation must have its own row.
- Each value must have its own cell.
```{r load_data, eval=T, message=T}
Doing this kind and transformation is often called **data wrangling**, due to the felling that we have to *wrangle* with the data to force them into a **tidy** format.
But once this step is finish most of the subsequent analysis will be realy fast to do !
<div class="pencadre">
As usual we will need the `tidyverse` library.
</div>
<details><summary>Solution</summary>
<p>
```{r load_data, eval=T, message=F}
library(tidyverse)
```
</p>
</details>
For this practical we are going to use the `table` dataset which demonstrate multiple ways to layout the same tabular data.
<div class="pencadre">
Use the help to know more about this dataset
</div>
<details><summary>Solution</summary>
<p>
`table1`, `table2`, `table3`, `table4a`, `table4b`, and `table5` all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.
The data is a subset of the data contained in the World Health Organization Global Tuberculosis Report
</p>
</details>
# Pivoting data
## pivot longer
```{r table4a, eval=T, message=T}
table4a # number of TB cases
<div class="pencadre">
Visualize the `table4a` dataset (you can use the `View()` function).
```{r table4a, eval=F, message=T}
View(table4a)
```
## pivot longer
Is the data **tidy** ? How would you transform this dataset to make it **tidy** ?
</div>
<details><summary>Solution</summary>
<p>
We have information about 3 variables in the `table4a`: `country`, `year` and number of `cases`.
However, the variable information (`year`) is stored as column names.
We want to pivot the horizontal column year, vertically and make the table longer.
You can use the `pivot_longer` fonction to make your table longer and have one observation per row and one variable per column.
For this we need to :
- specify which column to select (all except `country`).
- give the name of the new variable (`year`)
- give the name of the variable stored in the cells of the columns years (`case`)
```{r pivot_longer, eval=T, message=T}
table4a %>%
......@@ -47,20 +104,31 @@ table4a %>%
names_to = "year",
values_to = "case")
```
</p>
</details>
## pivot wider
```{r table2, eval=T, message=T}
table2
```
<div class="pencadre">
Visualize the `table2` dataset
Is the data **tidy** ? How would you transform this dataset to make it **tidy** ? (you can now make also make a guess from the name of the subsection)
</div>
## pivot wider
<details><summary>Solution</summary>
<p>
The column `count` store two types of information: the `population` size of the country and the number of `cases` in the country.
You can use the `pivot_wider` fonction to make your table wider and have one observation per row and one variable per column.
```{r pivot_wider, eval=T, message=T}
table2 %>%
pivot_wider(names_from = type,
values_from = count)
```
</p>
</details>
# Merging data
## Relational data
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment