Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • main
  • master
2 results

Target

Select target project
  • LBMC/hub/formations/R_basis
  • can/R_basis
2 results
Select Git revision
  • main
  • master
  • quarto-rebuild
3 results
Show changes
Showing
with 2769 additions and 781 deletions
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
---
title: "R#5: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "28 Nov 2019"
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
---
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
```
## Grouped summaries with `summarise()`
`summarise()` collapses a data frame to a single row:
```{r load_data, eval=T, message=FALSE, cache=T}
library(nycflights13)
library(tidyverse)
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
## The power of `summarise()` with `group_by()`
This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by group”.
```{r summarise_group_by, eval=T, message=FALSE, cache=T}
flights %>%
group_by(year, month, day) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
**5_a**
## Challenge with `summarise()` and `group_by()`
Imagine that we want to explore the relationship between the distance and average delay for each location.
here are three steps to prepare this data:
- Group flights by destination.
- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
```{r summarise_group_by_ggplot_a, eval = F}
flights %>%
group_by(dest)
```
## Challenge with `summarise()` and `group_by()`
Imagine that we want to explore the relationship between the distance and average delay for each location.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
```{r summarise_group_by_ggplot_b, eval = F}
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
```
## Missing values
You may have wondered about the na.rm argument we used above. What happens if we don’t set it?
```{r summarise_group_by_NA, cache = TRUE, fig.width=8, fig.height=4.5, message = FALSE}
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)
```
Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
## Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
```{r summarise_group_by_count, cache = TRUE, fig.width=8, fig.height=4.5, message = FALSE}
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
```
## Challenge with `summarise()` and `group_by()`
Imagine that we want to explore the relationship between the distance and average delay for each location.
- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
```{r summarise_group_by_ggplot_c, eval = F}
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
```
## Challenge with `summarise()` and `group_by()`
Imagine that we want to explore the relationship between the distance and average delay for each location.
```{r summarise_group_by_ggplot_d, eval = F}
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL") %>%
ggplot(mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
```
**5_b**
## Challenge with `summarise()` and `group_by()`
```{r summarise_group_by_ggplot, echo = FALSE, cache = TRUE, fig.width=8, fig.height=4.5, message=FALSE}
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL") %>%
ggplot(mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
```
## Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use `ungroup()`.
```{r ungroup, eval=T, message=FALSE, cache=T}
flights %>%
group_by(year, month, day) %>%
ungroup() %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
## Grouping challenges
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
- Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about `flights %>% group_by(carrier, dest) %>% summarise(n())`)
## Grouping challenges
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
```{r grouping_challenges_a, eval=F, message=FALSE, cache=T}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
filter(canceled) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
cancel_day = n()
) %>%
ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
geom_col()
```
**5_b**
## Grouping challenges
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (`strftime(x,'%A')` give you the name of the day from a POSIXct date)
```{r grouping_challenges_b, eval=T, echo = F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
cancel_day = n()
) %>%
ggplot(mapping = aes(x = wday, y = cancel_day)) +
geom_col()
```
## Grouping challenges
- Which carrier has the worst delays?
```{r grouping_challenges_c, eval=F, echo = T, message=FALSE, cache=T}
flights %>%
group_by(carrier) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)
```
**5_c**
## Grouping challenges
- Which carrier has the worst delays?
```{r grouping_challenges_d, eval=T, echo = F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
group_by(carrier) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)
```
## Grouped mutates (and filters)
Grouping is also useful in conjunction with `mutate()` and `filter()`
- Find all groups bigger than a threshold:
- Standardise to compute per group metrics:
```{r group_filter, eval=F}
flights %>%
group_by(dest, year) %>%
filter(n() > 10000) %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
```
## Goup by challenges
- What time of day should you fly if you want to avoid delays as much as possible?
```{r group_filter_b, eval=T, echo = F, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5}
flights %>%
group_by(hour) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = T),
sd_delay = sd(arr_delay, na.rm = T),
) %>%
ggplot() +
geom_errorbar(mapping = aes(
x = hour,
ymax = mean_delay + sd_delay,
ymin = mean_delay - sd_delay)) +
geom_point(mapping = aes(
x = hour,
y = mean_delay,
))
```
**5_d**
\ No newline at end of file
session_6/img/overview_joins.png

50.5 KiB

session_6/img/overview_set.png

11.5 KiB

session_6/img/pivot_longer.png

21.1 KiB

session_6/img/pivot_wider.png

21.7 KiB

This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
FROM rocker/tidyverse
RUN apt-get update \
&& apt-get install -y \
libxt6 \
cargo
RUN Rscript -e "install.packages('rmdformats')"
#session 1
RUN Rscript -e "install.packages('rvest')"
RUN Rscript -e "install.packages('remotes'); remotes::install_github('rlesur/klippy')"
#session 3
RUN Rscript -e "install.packages('gganimate')"
RUN Rscript -e "install.packages('gifski')"
RUN Rscript -e "install.packages('openxlsx')"
#session4
RUN Rscript -e "install.packages(c('ghibli', 'nycflights13','viridis','ggrepel'))"
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
File moved
This diff is collapsed.