From f1cc320cc4b3a1892f1ead725622aa29efe7c37c Mon Sep 17 00:00:00 2001
From: Laurent Modolo <laurent.modolo@ens-lyon.fr>
Date: Fri, 10 Sep 2021 11:12:52 +0200
Subject: [PATCH] split session 4 into session 4 and 5

---
 session_4/session_4.Rmd  | 154 ---------------
 session_5/sesssion_5.Rmd | 396 +++++++++++++++++++++++++++++++++++++++
 session_5/slides.Rmd     | 285 ----------------------------
 3 files changed, 396 insertions(+), 439 deletions(-)
 create mode 100644 session_5/sesssion_5.Rmd
 delete mode 100644 session_5/slides.Rmd

diff --git a/session_4/session_4.Rmd b/session_4/session_4.Rmd
index 93d0a36..ecf70e9 100644
--- a/session_4/session_4.Rmd
+++ b/session_4/session_4.Rmd
@@ -33,7 +33,6 @@ The objectives of this session will be to:
 - Arrange rows with `arrange()`
 - Select columns with `select()`
 - Add new variables with `mutate()`
-- Combining multiple operations with the pipe `%>%`
 
 <div class="pencadre">
 For this session we are going to work with a new dataset included in the `nycflights13` package.
@@ -411,156 +410,3 @@ mutate(
 - Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. 
 - Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`
 - Ranking: there are a number of ranking functions, but you should start with `min_rank()`. There is also `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`
-
- 
-# Combining multiple operations with the pipe
-
-
-<div id="pencadre">
-Find the 10 most delayed flights using a ranking function. `min_rank()`
-</div>
-
-<details><summary>Solution</summary>
-<p>
-```{r pipe_example_a, include=TRUE}
-flights_md <- mutate(flights,
-                     most_delay = min_rank(desc(dep_delay)))
-flights_md <- filter(flights_md, most_delay < 10)
-flights_md <- arrange(flights_md, most_delay)
-```
-</p>
-</details>
-
-
-We don't want to create useless intermediate variables so we can use the pipe operator: `%>%`
-( or `ctrl + shift + M`). 
-
-Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. 
-
-<div id="pencadre">
-Try to pipe operator to rewrite your precedent code with only **one** variable assignment.
-</div>
- 
-
-<details><summary>Solution</summary>
-<p>
-```{r pipe_example_b, include=TRUE}
-flights_md2 <- flights %>%
-    mutate(most_delay = min_rank(desc(dep_delay))) %>% 
-    filter(most_delay < 10) %>% 
-    arrange(most_delay)
-```
-</p>
-</details>
-
-Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered and use `+` instead of `%>%`. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
-
-# Grouped summaries with `summarise()`
-
-`summarise()` collapses a data frame to a single row:
-
-Check the difference between `summarise()` and `mutate()` with the following commands:
-
-```{r load_data, eval=FALSE}
-flights %>% 
-  mutate(delay = mean(dep_delay, na.rm = TRUE))
-flights %>% 
-  summarise(delay = mean(dep_delay, na.rm = TRUE))
-```
-
-## The power of `summarise()` with `group_by()`
-
-The `group_by()` function changes the unit of analysis from the complete dataset to individual groups.
-Then, when you use the function you already know on grouped data frame and they’ll be automatically applied *by group*.
-
-You can use the following code to compute the average delay per months across years.
-
-```{r summarise_group_by, include=TRUE, fig.width=8, fig.height=3.5}
-flights_delay <- flights %>% 
-  group_by(year, month) %>% 
-  summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>% 
-  arrange(month)
-
-ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
-  geom_bar(stat="identity", color="black", fill = "#619CFF") +
-  geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) + 
-  theme(axis.text.x = element_blank())
-```
-<div class="pencadre">
-Why did we `group_by` `year` and `month` and not only `year` ?
-</div>
-
-
-## Missing values
-
-<div class="pencadre">
-You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it?
-</div>
-
-```{r summarise_group_by_NA, include=TRUE}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    dist = mean(distance),
-    delay = mean(arr_delay)
-  )
-```
-
-Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**.
-
-# Counts
-
-Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
-
-```{r summarise_group_by_count, include = T, echo=F, warning=F, message=F, fig.width=8, fig.height=3.5}
-summ_delay_filghts <- flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(dest != "HNL") %>% 
-  filter(delay < 40 & delay > -20)
-
-  
-
-ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
-  geom_point() +
-  geom_smooth(method = lm, se = FALSE) +
-  theme(legend.position='none')
-```
-
-<div class="pencadre">
-Imagine that we want to explore the relationship between the distance and average delay for each location and recreate the above figure. 
-here are three steps to prepare this data: 
-
-1. Group flights by destination.
-2. Summarize to compute distance, average delay, and number of flights using `n()`.
-3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
-4. Filter to remove noisy points with delay superior to 40 or inferior to -20
-5. Create a `mapping` on `dist`, `delay` and `count` as `size`.
-6. Use the layer `geom_point()` and `geom_smooth()`
-7. We can hide the legend with the layer `theme(legend.position='none')`
-</div>
-
-<details><summary>Solution</summary>
-<p>
-```{r summarise_group_by_count_b, include = T, eval=F, warning=F, message=F, fig.width=8, fig.height=3.5}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(dest != "HNL") %>% 
-  filter(delay < 40 & delay > -20) %>% 
-  ggplot(mapping = aes(x = dist, y = delay, size = count)) +
-  geom_point() +
-  geom_smooth(method = lm, se = FALSE) +
-  theme(legend.position='none')
-```
-</p>
-</details>
- 
\ No newline at end of file
diff --git a/session_5/sesssion_5.Rmd b/session_5/sesssion_5.Rmd
new file mode 100644
index 0000000..05c9201
--- /dev/null
+++ b/session_5/sesssion_5.Rmd
@@ -0,0 +1,396 @@
+---
+title: "R#5: Pipping and grouping"
+author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)
+date: "2021"
+output:
+  rmdformats::downcute:
+    self_contain: true
+    use_bookdown: true
+    default_style: "dark"
+    lightbox: true
+    css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
+---
+
+```{r setup, include=FALSE}
+rm(list=ls())
+knitr::opts_chunk$set(echo = TRUE)
+knitr::opts_chunk$set(comment = NA)
+```
+```{r klippy, echo=FALSE, include=TRUE}
+klippy::klippy(
+  position = c('top', 'right'),
+  color = "white",
+  tooltip_message = 'Click to copy',
+  tooltip_success = 'Copied !')
+```
+
+# Introduction
+
+The goal of this practical is to practice combining data transformation with `tidyverse`.
+The objectives of this session will be to:
+
+- Combining multiple operations with the pipe `%>%`
+- Work on subgroup of the data with `group_by`
+
+<div class="pencadre">
+For this session we are going to work with a new dataset included in the `nycflights13` package.
+Install this package and load it.
+As usual you will also need the `tidyverse` library.
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r packageloaded, include=TRUE, message=FALSE}
+library("tidyverse")
+library("nycflights13")
+```
+</p>
+</details>
+
+# Combining multiple operations with the pipe
+
+<div id="pencadre">
+Find the 10 most delayed flights using a ranking function. `min_rank()`
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r pipe_example_a, include=TRUE}
+flights_md <- mutate(flights,
+                     most_delay = min_rank(desc(dep_delay)))
+flights_md <- filter(flights_md, most_delay < 10)
+flights_md <- arrange(flights_md, most_delay)
+```
+</p>
+</details>
+
+
+We don't want to create useless intermediate variables so we can use the pipe operator: `%>%`
+(or `ctrl + shift + M`). 
+
+Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. 
+
+<div id="pencadre">
+Try to pipe operators to rewrite your precedent code with only **one** variable assignment.
+</div>
+ 
+<details><summary>Solution</summary>
+<p>
+```{r pipe_example_b, include=TRUE}
+flights_md2 <- flights %>%
+    mutate(most_delay = min_rank(desc(dep_delay))) %>% 
+    filter(most_delay < 10) %>% 
+    arrange(most_delay)
+```
+</p>
+</details>
+
+Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered and use `+` instead of `%>%`. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
+
+The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
+
+## When not to use the pipe
+
+- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
+- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
+
+# Grouping variable
+
+The `summarise()` function collapses a data frame to a single row.
+Check the difference between `summarise()` and `mutate()` with the following commands:
+
+```{r load_data, eval=FALSE}
+flights %>% 
+  mutate(delay = mean(dep_delay, na.rm = TRUE))
+flights %>% 
+  summarise(delay = mean(dep_delay, na.rm = TRUE))
+```
+
+Where mutate compute the `mean` of `dep_delay` row by row (which is not useful), `summarise` compute the `mean` of the whole `dep_delay` column.
+
+## The power of `summarise()` with `group_by()`
+
+The `group_by()` function changes the unit of analysis from the complete dataset to individual groups.
+Individual groups are defined by categorial variable or **factors**.
+Then, when you use the function you already know on grouped data frame and they’ll be automatically applied *by groups*.
+
+You can use the following code to compute the average delay per months across years.
+
+```{r summarise_group_by, include=TRUE, fig.width=8, fig.height=3.5}
+flights_delay <- flights %>% 
+  group_by(year, month) %>% 
+  summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>% 
+  arrange(month)
+
+ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
+  geom_bar(stat="identity", color="black", fill = "#619CFF") +
+  geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) + 
+  theme(axis.text.x = element_blank())
+```
+
+<div class="pencadre">
+Why did we `group_by` `year` and `month` and not only `year` ?
+</div>
+
+## Missing values
+
+<div class="pencadre">
+You may have wondered about the `na.rm` argument we used above. What happens if we don’t set it?
+</div>
+
+```{r summarise_group_by_NA, include=TRUE}
+flights %>% 
+  group_by(dest) %>% 
+  summarise(
+    dist = mean(distance),
+    delay = mean(arr_delay)
+  )
+```
+
+Aggregation functions obey the usual rule of missing values: **if there’s any missing value in the input, the output will be a missing value**.
+
+## Counts
+
+Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
+
+```{r summarise_group_by_count, include = T, echo=F, warning=F, message=F, fig.width=8, fig.height=3.5}
+summ_delay_filghts <- flights %>% 
+  group_by(dest) %>% 
+  summarise(
+    count = n(),
+    dist = mean(distance, na.rm = TRUE),
+    delay = mean(arr_delay, na.rm = TRUE)
+  ) %>% 
+  filter(dest != "HNL") %>% 
+  filter(delay < 40 & delay > -20)
+
+  
+
+ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
+  geom_point() +
+  geom_smooth(method = lm, se = FALSE) +
+  theme(legend.position='none')
+```
+
+<div class="pencadre">
+Imagine that we want to explore the relationship between the distance and average delay for each location and recreate the above figure. 
+here are three steps to prepare this data: 
+
+1. Group flights by destination.
+2. Summarize to compute distance, average delay, and number of flights using `n()`.
+3. Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
+4. Filter to remove noisy points with delay superior to 40 or inferior to -20
+5. Create a `mapping` on `dist`, `delay` and `count` as `size`.
+6. Use the layer `geom_point()` and `geom_smooth()`
+7. We can hide the legend with the layer `theme(legend.position='none')`
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r summarise_group_by_count_b, include = T, eval=F, warning=F, message=F, fig.width=8, fig.height=3.5}
+flights %>% 
+  group_by(dest) %>% 
+  summarise(
+    count = n(),
+    dist = mean(distance, na.rm = TRUE),
+    delay = mean(arr_delay, na.rm = TRUE)
+  ) %>% 
+  filter(dest != "HNL") %>% 
+  filter(delay < 40 & delay > -20) %>% 
+  ggplot(mapping = aes(x = dist, y = delay, size = count)) +
+  geom_point() +
+  geom_smooth(method = lm, se = FALSE) +
+  theme(legend.position='none')
+```
+</p>
+</details>
+
+
+## Ungrouping
+
+If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
+
+<div class="pencadre">
+Try the following example
+</div>
+
+```{r ungroup, eval=T, message=FALSE, cache=T}
+flights %>% 
+  group_by(year, month, day) %>% 
+  ungroup() %>%
+  summarise(delay = mean(dep_delay, na.rm = TRUE))
+```
+
+# Grouping challenges
+
+## First challenge
+
+<div class="pencadre">
+
+Look at the number of canceled flights per day. Is there a pattern?
+
+**Remember to always try to decompose complex questions into smaller and simple problems**
+
+- What are `canceled` flights?
+- Who can I `canceled` flights?
+- We need to define the day of the week `wday` variable (`strftime(x,'%A')` give you the name of the day from a POSIXct date).
+- We can count the number of canceled flight (`cancel_day`) by day of the week (`wday`).
+- We can pipe transformed and filtered tibble into a `ggplot` function.
+- We can use `geom_col` to have a barplot of the number of `cancel_day` for each. `wday`
+- You can use the function `fct_reorder()` to reorder the `wday` by number of `cancel_day` and make the plot easier to read.
+
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r grouping_challenges_a, eval=T, message=FALSE, cache=T}
+flights %>% 
+  mutate(
+    canceled = is.na(dep_time) | is.na(arr_time)
+  ) %>% 
+  filter(canceled) %>% 
+  mutate(wday = strftime(time_hour,'%A')) %>% 
+  group_by(wday) %>% 
+  summarise(
+    cancel_day = n()
+  ) %>%
+  ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
+  geom_col()
+```
+</p>
+</details>
+
+## Second challenge
+
+<div class="pencadre">
+Is the proportion of canceled flights by day of the week related to the average departure delay?
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r grouping_challenges_b1, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
+flights %>% 
+  mutate(
+    canceled = is.na(dep_time) | is.na(arr_time)
+  ) %>% 
+  mutate(wday = strftime(time_hour,'%A')) %>% 
+  group_by(wday) %>% 
+  mutate(
+    prop_cancel_day = sum(canceled)/sum(!canceled),
+    av_delay = mean(dep_delay, na.rm = TRUE)
+  ) %>%
+  ungroup() %>% 
+  ggplot(mapping = aes(x = av_delay, y = prop_cancel_day, color = wday)) +
+  geom_point()
+```
+
+Which day would you prefer to book a flight ?
+</p>
+</details>
+
+<div class="pencadre">
+We can add error bars to this plot to justify our decision.
+Brainstorm a way to have access to the mean and standard deviation or the `prop_cancel_day` and `av_delay`.
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
+flights %>% 
+  mutate(
+    canceled = is.na(dep_time) | is.na(arr_time)
+  ) %>% 
+  mutate(wday = strftime(time_hour,'%A')) %>% 
+  group_by(day) %>% 
+  mutate(
+    prop_cancel_day = sum(canceled)/sum(!canceled),
+    av_delay = mean(dep_delay, na.rm = TRUE)
+  ) %>%
+  group_by(wday) %>% 
+  summarize(
+    mean_cancel_day = mean(prop_cancel_day, na.rm = TRUE),
+    sd_cancel_day = sd(prop_cancel_day, na.rm = TRUE),
+    mean_av_delay = mean(av_delay, na.rm = TRUE),
+    sd_av_delay = sd(av_delay, na.rm = TRUE)
+  ) %>% 
+  ggplot(mapping = aes(x = mean_av_delay, y = mean_cancel_day, color = wday)) +
+  geom_point() +
+  geom_errorbarh(mapping = aes(
+    xmin = -sd_av_delay + mean_av_delay,
+    xmax = sd_av_delay + mean_av_delay
+  )) +
+  geom_errorbar(mapping = aes(
+    ymin = -sd_cancel_day + mean_cancel_day,
+    ymax = sd_cancel_day + mean_cancel_day
+  ))
+```
+</p>
+</details>
+
+<div class="pencadre">
+Now that you are aware of the interest of using `geom_errorbar`, what `hour` of the day should you fly if you want to avoid delays as much as possible?
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r group_filter_b3, eval=T, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
+flights %>% 
+  group_by(hour) %>% 
+  summarise(
+    mean_delay = mean(arr_delay, na.rm = T),
+    sd_delay = sd(arr_delay, na.rm = T),
+  ) %>% 
+  ggplot() +
+  geom_errorbar(mapping = aes(
+    x = hour,
+    ymax = mean_delay + sd_delay,
+    ymin = mean_delay - sd_delay)) +
+  geom_point(mapping = aes(
+    x = hour,
+    y = mean_delay,
+  ))
+```
+</p>
+</details>
+
+## Third challenge
+
+<div class="pencadre">
+Which carrier has the worst delays?
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r grouping_challenges_c, eval=F, echo = T, message=FALSE, cache=T}
+flights %>% 
+  group_by(carrier) %>% 
+  summarise(
+    carrier_delay = mean(arr_delay, na.rm = T)
+  ) %>%
+  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
+  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
+  geom_col(alpha = 0.5)
+```
+</p>
+</details>
+
+<div class="pencadre">
+Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about `group_by(carrier, dest) %>% summarise(n())`)
+</div>
+
+<details><summary>Solution</summary>
+<p>
+```{r grouping_challenges_c, eval=F, echo = T, message=FALSE, cache=T}
+flights %>% 
+  group_by(carrier, dest) %>% 
+  summarise(
+    carrier_delay = mean(arr_delay, na.rm = T),
+    number_of_flight = n()
+  ) %>%
+  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
+  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
+  geom_boxplot() +
+  geom_jitter(height = 0)
+```
+</p>
+</details>
diff --git a/session_5/slides.Rmd b/session_5/slides.Rmd
deleted file mode 100644
index 369cda9..0000000
--- a/session_5/slides.Rmd
+++ /dev/null
@@ -1,285 +0,0 @@
----
-title: "R#5: data transformation"
-author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
-date: "28 Nov 2019"
-output:
-  slidy_presentation:
-    highlight: tango
-  beamer_presentation:
-    theme: metropolis
-    slide_level: 3
-    fig_caption: no
-    df_print: tibble
-    highlight: tango
-    latex_engine: xelatex
----
-
-```{r setup, include=FALSE, cache=TRUE}
-knitr::opts_chunk$set(echo = FALSE)
-library(tidyverse)
-```
-
-## Grouped summaries with `summarise()`
-
-`summarise()` collapses a data frame to a single row:
-
-```{r load_data, eval=T, message=FALSE, cache=T}
-library(nycflights13)
-library(tidyverse)
-flights %>% 
-  summarise(delay = mean(dep_delay, na.rm = TRUE))
-```
-
-## The power of `summarise()` with `group_by()`
-
-This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by group”.
-
-
-```{r summarise_group_by, eval=T, message=FALSE, cache=T}
-flights %>% 
-  group_by(year, month, day) %>% 
-  summarise(delay = mean(dep_delay, na.rm = TRUE))
-```
-
-**5_a**
-
-## Challenge with `summarise()` and `group_by()`
-
-Imagine that we want to explore the relationship between the distance and average delay for each location. 
-here are three steps to prepare this data: 
-
-- Group flights by destination.
-- Summarise to compute distance, average delay, and number of flights.
-- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
-
-```{r summarise_group_by_ggplot_a, eval = F}
-flights %>% 
-  group_by(dest)
-```
-
-## Challenge with `summarise()` and `group_by()`
-
-Imagine that we want to explore the relationship between the distance and average delay for each location. 
-
-- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
-
-```{r summarise_group_by_ggplot_b, eval = F}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  )
-```
-
-## Missing values
-
-You may have wondered about the na.rm argument we used above. What happens if we don’t set it?
-
-```{r summarise_group_by_NA, cache = TRUE, fig.width=8, fig.height=4.5, message = FALSE}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    dist = mean(distance),
-    delay = mean(arr_delay)
-  )
-```
-
-Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
-
-## Counts
-
-Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
-
-```{r summarise_group_by_count, cache = TRUE, fig.width=8, fig.height=4.5, message = FALSE}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  )
-```
-
-## Challenge with `summarise()` and `group_by()`
-
-Imagine that we want to explore the relationship between the distance and average delay for each location. 
-
-- Summarise to compute distance, average delay, and number of flights.
-- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
-
-```{r summarise_group_by_ggplot_c, eval = F}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(count > 20, dest != "HNL")
-```
-
-## Challenge with `summarise()` and `group_by()`
-
-Imagine that we want to explore the relationship between the distance and average delay for each location. 
-
-```{r summarise_group_by_ggplot_d, eval = F}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(count > 20, dest != "HNL") %>% 
-  ggplot(mapping = aes(x = dist, y = delay)) +
-  geom_point(aes(size = count), alpha = 1/3) +
-  geom_smooth(se = FALSE)
-```
-
-**5_b**
-
-## Challenge with `summarise()` and `group_by()`
-
-```{r summarise_group_by_ggplot, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
-flights %>% 
-  group_by(dest) %>% 
-  summarise(
-    count = n(),
-    dist = mean(distance, na.rm = TRUE),
-    delay = mean(arr_delay, na.rm = TRUE)
-  ) %>% 
-  filter(count > 20, dest != "HNL") %>% 
-  ggplot(mapping = aes(x = dist, y = delay)) +
-  geom_point(aes(size = count), alpha = 1/3) +
-  geom_smooth(se = FALSE)
-```
-
-## Ungrouping
-
-
-If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
-
-```{r ungroup, eval=T, message=FALSE, cache=T}
-flights %>% 
-  group_by(year, month, day) %>% 
-  ungroup() %>%
-  summarise(delay = mean(dep_delay, na.rm = TRUE))
-```
-
-## Grouping challenges
-
-- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
-- Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
-
-
-## Grouping challenges
-
-- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
-
-```{r grouping_challenges_a, eval=F, message=FALSE, cache=T}
-flights %>% 
-  mutate(
-    canceled = is.na(dep_time) | is.na(arr_time)
-  ) %>% 
-  filter(canceled) %>% 
-  mutate(wday = strftime(time_hour,'%A')) %>% 
-  group_by(wday) %>% 
-  summarise(
-    cancel_day = n()
-  ) %>%
-  ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
-  geom_col()
-```
-
-**5_b**
-
-## Grouping challenges
-
-- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
-
-```{r grouping_challenges_b, eval=T, echo = F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
-flights %>% 
-  mutate(
-    canceled = is.na(dep_time) | is.na(arr_time)
-  ) %>% 
-  mutate(wday = strftime(time_hour,'%A')) %>% 
-  group_by(wday) %>% 
-  summarise(
-    cancel_day = n()
-  ) %>%
-  ggplot(mapping = aes(x = wday, y = cancel_day)) +
-  geom_col()
-```
-
-## Grouping challenges
-
-- Which carrier has the worst delays?
-
-```{r grouping_challenges_c, eval=F, echo = T, message=FALSE, cache=T}
-flights %>% 
-  group_by(carrier) %>% 
-  summarise(
-    carrier_delay = mean(arr_delay, na.rm = T)
-  ) %>%
-  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
-  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
-  geom_col(alpha = 0.5)
-```
-
-**5_c**
-
-## Grouping challenges
-
-- Which carrier has the worst delays?
-
-```{r grouping_challenges_d, eval=T, echo = F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
-flights %>% 
-  group_by(carrier) %>% 
-  summarise(
-    carrier_delay = mean(arr_delay, na.rm = T)
-  ) %>%
-  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
-  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
-  geom_col(alpha = 0.5)
-```
-
-## Grouped mutates (and filters)
-
-Grouping is also useful in conjunction with `mutate()` and `filter()`
-
-- Find all groups bigger than a threshold:
-- Standardise to compute per group metrics:
-
-```{r group_filter, eval=F}
-flights %>% 
-  group_by(dest, year) %>% 
-  filter(n() > 10000) %>% 
-  filter(arr_delay > 0) %>% 
-  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
-  select(year:day, dest, arr_delay, prop_delay)
-```
-
-## Goup by challenges
-
-- What time of day should you fly if you want to avoid delays as much as possible?
-
-```{r group_filter_b, eval=T, echo = F, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
-flights %>% 
-  group_by(hour) %>% 
-  summarise(
-    mean_delay = mean(arr_delay, na.rm = T),
-    sd_delay = sd(arr_delay, na.rm = T),
-  ) %>% 
-  ggplot() +
-  geom_errorbar(mapping = aes(
-    x = hour,
-    ymax = mean_delay + sd_delay,
-    ymin = mean_delay - sd_delay)) +
-  geom_point(mapping = aes(
-    x = hour,
-    y = mean_delay,
-  ))
-```
- **5_d**
\ No newline at end of file
-- 
GitLab