---
title: "R#5: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "28 Nov 2019"
output:
  slidy_presentation:
    highlight: tango
  beamer_presentation:
    theme: metropolis
    slide_level: 3
    fig_caption: no
    df_print: tibble
    highlight: tango
    latex_engine: xelatex
---

```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
```

## Grouped summaries with `summarise()`

`summarise()` collapses a data frame to a single row:

```{r load_data, eval=T, message=FALSE, cache=T}
library(nycflights13)
library(tidyverse)
flights %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE))
```

## The power of `summarise()` with `group_by()`

This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by group”.


```{r summarise_group_by, eval=T, message=FALSE, cache=T}
flights %>% 
  group_by(year, month, day) %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE))
```

**5_a**

## Challenge with `summarise()` and `group_by()`

Imagine that we want to explore the relationship between the distance and average delay for each location. 
here are three steps to prepare this data: 

- Group flights by destination.
- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.

```{r summarise_group_by_ggplot_a, eval = F}
flights %>% 
  group_by(dest)
```

## Challenge with `summarise()` and `group_by()`

Imagine that we want to explore the relationship between the distance and average delay for each location. 

- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.

```{r summarise_group_by_ggplot_b, eval = F}
flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )
```

## Missing values

You may have wondered about the na.rm argument we used above. What happens if we don’t set it?

```{r summarise_group_by_NA, cache = TRUE, fig.width=8, fig.height=4.5, message = FALSE}
flights %>% 
  group_by(dest) %>% 
  summarise(
    dist = mean(distance),
    delay = mean(arr_delay)
  )
```

Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.

## Counts

Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.

```{r summarise_group_by_count, cache = TRUE, fig.width=8, fig.height=4.5, message = FALSE}
flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  )
```

## Challenge with `summarise()` and `group_by()`

Imagine that we want to explore the relationship between the distance and average delay for each location. 

- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.

```{r summarise_group_by_ggplot_c, eval = F}
flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")
```

## Challenge with `summarise()` and `group_by()`

Imagine that we want to explore the relationship between the distance and average delay for each location. 

```{r summarise_group_by_ggplot_d, eval = F}
flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL") %>% 
  ggplot(mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
```

**5_b**

## Challenge with `summarise()` and `group_by()`

```{r summarise_group_by_ggplot, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL") %>% 
  ggplot(mapping = aes(x = dist, y = delay)) +
  geom_point(aes(size = count), alpha = 1/3) +
  geom_smooth(se = FALSE)
```

## Ungrouping


If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.

```{r ungroup, eval=T, message=FALSE, cache=T}
flights %>% 
  group_by(year, month, day) %>% 
  ungroup() %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE))
```

## Grouping challenges

- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
- Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)


## Grouping challenges

- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)

```{r grouping_challenges_a, eval=F, message=FALSE, cache=T}
flights %>% 
  mutate(
    canceled = is.na(dep_time) | is.na(arr_time)
  ) %>% 
  filter(canceled) %>% 
  mutate(wday = strftime(time_hour,'%A')) %>% 
  group_by(wday) %>% 
  summarise(
    cancel_day = n()
  ) %>%
  ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
  geom_col()
```

**5_b**

## Grouping challenges

- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)

```{r grouping_challenges_b, eval=T, echo = F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>% 
  mutate(
    canceled = is.na(dep_time) | is.na(arr_time)
  ) %>% 
  mutate(wday = strftime(time_hour,'%A')) %>% 
  group_by(wday) %>% 
  summarise(
    cancel_day = n()
  ) %>%
  ggplot(mapping = aes(x = wday, y = cancel_day)) +
  geom_col()
```

## Grouping challenges

- Which carrier has the worst delays?

```{r grouping_challenges_c, eval=F, echo = T, message=FALSE, cache=T}
flights %>% 
  group_by(carrier) %>% 
  summarise(
    carrier_delay = mean(arr_delay, na.rm = T)
  ) %>%
  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
  geom_col(alpha = 0.5)
```

**5_c**

## Grouping challenges

- Which carrier has the worst delays?

```{r grouping_challenges_d, eval=T, echo = F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>% 
  group_by(carrier) %>% 
  summarise(
    carrier_delay = mean(arr_delay, na.rm = T)
  ) %>%
  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
  geom_col(alpha = 0.5)
```

## Grouped mutates (and filters)

Grouping is also useful in conjunction with `mutate()` and `filter()`

- Find all groups bigger than a threshold:
- Standardise to compute per group metrics:

```{r group_filter, eval=F}
flights %>% 
  group_by(dest, year) %>% 
  filter(n() > 10000) %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, dest, arr_delay, prop_delay)
```

## Goup by challenges

- What time of day should you fly if you want to avoid delays as much as possible?

```{r group_filter_b, eval=T, echo = F, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>% 
  group_by(hour) %>% 
  summarise(
    mean_delay = mean(arr_delay, na.rm = T),
    sd_delay = sd(arr_delay, na.rm = T),
  ) %>% 
  ggplot() +
  geom_errorbar(mapping = aes(
    x = hour,
    ymax = mean_delay + sd_delay,
    ymin = mean_delay - sd_delay)) +
  geom_point(mapping = aes(
    x = hour,
    y = mean_delay,
  ))
```
 **5_d**