session_5.Rmd
title: "R#5: Pipping and grouping"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "2022"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
library(fontawesome)
r fa(name = "fas fa-house", fill = "grey", height = "1em") https://can.gitbiopages.ens-lyon.fr/R_basis/
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
Introduction
The goal of this practical is to practice combining data transformation with tidyverse.
The objectives of this session will be to:
- Combining multiple operations with the pipe
%>% - Work on subgroup of the data with
group_by
Solution
```{r packageloaded, include=TRUE, message=FALSE} library("tidyverse") library("nycflights13") ```
Combining multiple operations with the pipe
Solution
```{r pipe_example_a, include=TRUE} flights_md <- mutate(flights, most_delay = min_rank(desc(dep_delay))) flights_md <- filter(flights_md, most_delay < 10) flights_md <- arrange(flights_md, most_delay) ```
We don't want to create useless intermediate variables so we can use the pipe operator: %>%
(or ctrl + shift + M).
Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
Solution
```{r pipe_example_b, include=TRUE} flights_md2 <- flights %>% mutate(most_delay = min_rank(desc(dep_delay))) %>% filter(most_delay < 10) %>% arrange(most_delay) ```
Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered and use + instead of %>%. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.
The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:
When not to use the pipe
- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
Grouping variable
The summarise() function collapses a data frame to a single row.
Check the difference between summarise() and mutate() with the following commands:
flights %>%
mutate(delay = mean(dep_delay, na.rm = TRUE))
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
Where mutate compute the mean of dep_delay row by row (which is not useful), summarise compute the mean of the whole dep_delay column.
The power of summarise() with group_by()
The group_by() function changes the unit of analysis from the complete dataset to individual groups.
Individual groups are defined by categorial variable or factors.
Then, when you use the function you already know on grouped data frame and they’ll be automatically applied by groups.
You can use the following code to compute the average delay per months across years.
flights_delay <- flights %>%
group_by(year, month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>%
arrange(month)
ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
geom_bar(stat="identity", color="black", fill = "#619CFF") +
geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) +
theme(axis.text.x = element_blank())
Missing values
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)
Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (n()). That way you can check that you’re not drawing conclusions based on very small amounts of data.
summ_delay_filghts <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(delay < 40 & delay > -20)
ggplot(data = summ_delay_filghts, mapping = aes(x = dist, y = delay, size = count)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position='none')
- Group flights by destination.
- Summarize to compute distance, average delay, and number of flights using
n(). - Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
- Filter to remove noisy points with delay superior to 40 or inferior to -20
- Create a
mappingondist,delayandcountassize. - Use the layer
geom_point()andgeom_smooth() - We can hide the legend with the layer
theme(legend.position='none')
Solution
```{r summarise_group_by_count_b, include = T, eval=F, warning=F, message=F, fig.width=8, fig.height=3.5} flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(dest != "HNL") %>% filter(delay < 40 & delay > -20) %>% ggplot(mapping = aes(x = dist, y = delay, size = count)) + geom_point() + geom_smooth(method = lm, se = FALSE) + theme(legend.position='none') ```
Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
flights %>%
group_by(year, month, day) %>%
ungroup() %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
Grouping challenges
First challenge
Look at the number of canceled flights per day. Is there a pattern?
Remember to always try to decompose complex questions into smaller and simple problems
- What are
canceledflights? - Who can I create a
canceledflights variable? - We need to define the day of the week
wdayvariable (strftime(x,'%A')give you the name of the day from a POSIXct date). - We can count the number of canceled flight (
cancel_day) by day of the week (wday). - We can pipe transformed and filtered tibble into a
ggplotfunction. - We can use
geom_colto have a barplot of the number ofcancel_dayfor each.wday - You can use the function
fct_reorder()to reorder thewdayby number ofcancel_dayand make the plot easier to read.
Solution
```{r grouping_challenges_a, eval=T, message=FALSE, cache=T} flights %>% mutate( canceled = is.na(dep_time) | is.na(arr_time) ) %>% filter(canceled) %>% mutate(wday = strftime(time_hour,'%A')) %>% group_by(wday) %>% summarise( cancel_day = n() ) %>% ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) + geom_col() ```
Second challenge
Solution
```{r grouping_challenges_b1, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5} flights %>% mutate( canceled = is.na(dep_time) | is.na(arr_time) ) %>% mutate(wday = strftime(time_hour,'%A')) %>% group_by(wday) %>% mutate( prop_cancel_day = sum(canceled)/n(), av_delay = mean(dep_delay, na.rm = TRUE) ) %>% ungroup() %>% ggplot(mapping = aes(x = av_delay, y = prop_cancel_day, color = wday)) + geom_point() ```
Which day would you prefer to book a flight ?
Solution
```{r grouping_challenges_b2, eval=T, message=FALSE, cache=T, fig.width=8, fig.height=3.5} flights %>% mutate( canceled = is.na(dep_time) | is.na(arr_time) ) %>% mutate(wday = strftime(time_hour,'%A')) %>% group_by(day) %>% mutate( prop_cancel_day = sum(canceled)/sum(!canceled), av_delay = mean(dep_delay, na.rm = TRUE) ) %>% group_by(wday) %>% summarize( mean_cancel_day = mean(prop_cancel_day, na.rm = TRUE), sd_cancel_day = sd(prop_cancel_day, na.rm = TRUE), mean_av_delay = mean(av_delay, na.rm = TRUE), sd_av_delay = sd(av_delay, na.rm = TRUE) ) %>% ggplot(mapping = aes(x = mean_av_delay, y = mean_cancel_day, color = wday)) + geom_point() + geom_errorbarh(mapping = aes( xmin = -sd_av_delay + mean_av_delay, xmax = sd_av_delay + mean_av_delay )) + geom_errorbar(mapping = aes( ymin = -sd_cancel_day + mean_cancel_day, ymax = sd_cancel_day + mean_cancel_day )) ```
Solution
```{r group_filter_b3, eval=T, warning=F, message=FALSE, cache=T, fig.width=8, fig.height=3.5} flights %>% group_by(hour) %>% summarise( mean_delay = mean(arr_delay, na.rm = T), sd_delay = sd(arr_delay, na.rm = T), ) %>% ggplot() + geom_errorbar(mapping = aes( x = hour, ymax = mean_delay + sd_delay, ymin = mean_delay - sd_delay)) + geom_point(mapping = aes( x = hour, y = mean_delay, )) ```
Third challenge
Solution
```{r grouping_challenges_c2, eval=F, echo = T, message=FALSE, cache=T} flights %>% group_by(carrier) %>% summarise( carrier_delay = mean(arr_delay, na.rm = T) ) %>% mutate(carrier = fct_reorder(carrier, carrier_delay)) %>% ggplot(mapping = aes(x = carrier, y = carrier_delay)) + geom_col(alpha = 0.5) ```
Solution
```{r grouping_challenges_c1, eval=F, echo = T, message=FALSE, cache=T} flights %>% group_by(carrier, dest) %>% summarise( carrier_delay = mean(arr_delay, na.rm = T), number_of_flight = n() ) %>% mutate(carrier = fct_reorder(carrier, carrier_delay)) %>% ggplot(mapping = aes(x = carrier, y = carrier_delay)) + geom_boxplot() + geom_jitter(height = 0) ```