-
Laurent Modolo authoredLaurent Modolo authored
- Grouped summaries with summarise()
- The power of summarise() with group_by()
- Challenge with summarise() and group_by()
- Challenge with summarise() and group_by()
- Missing values
- Counts
- Challenge with summarise() and group_by()
- Challenge with summarise() and group_by()
- Challenge with summarise() and group_by()
- Ungrouping
- Grouping challenges
- Grouping challenges
- Grouping challenges
- Grouping challenges
- Grouping challenges
- Grouped mutates (and filters)
- Goup by challenges
- Goup by challenges
title: "R#5: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "28 Nov 2019"
output:
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
slidy_presentation:
highlight: tango
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
summarise()
Grouped summaries with summarise()
collapses a data frame to a single row:
library(nycflights13)
library(tidyverse)
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
summarise()
with group_by()
The power of This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr
verbs on a grouped data frame they’ll be automatically applied “by group”.
flights %>%
group_by(year, month, day) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
5_a
summarise()
and group_by()
Challenge with Imagine that we want to explore the relationship between the distance and average delay for each location. here are three steps to prepare this data:
- Group flights by destination.
- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
flights %>%
group_by(dest)
summarise()
and group_by()
Challenge with Imagine that we want to explore the relationship between the distance and average delay for each location.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
Missing values
You may have wondered about the na.rm argument we used above. What happens if we don’t set it?
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)
Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (n()
). That way you can check that you’re not drawing conclusions based on very small amounts of data.
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
summarise()
and group_by()
Challenge with Imagine that we want to explore the relationship between the distance and average delay for each location.
- Summarise to compute distance, average delay, and number of flights.
- Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL")
summarise()
and group_by()
Challenge with Imagine that we want to explore the relationship between the distance and average delay for each location.
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL") %>%
ggplot(mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
5_b
summarise()
and group_by()
Challenge with flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "HNL") %>%
ggplot(mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) +
geom_smooth(se = FALSE)
Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use ungroup()
.
flights %>%
group_by(year, month, day) %>%
ungroup() %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
Grouping challenges
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (
strftime(x,'%A')
give you the name of the day from a POSIXct date) - Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights %>% group_by(carrier, dest) %>% summarise(n())
)
Grouping challenges
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (
strftime(x,'%A')
give you the name of the day from a POSIXct date)
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
cancel_day = n()
) %>%
ggplot(mapping = aes(x = wday, y = cancel_day)) +
geom_col()
5_b
Grouping challenges
- Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (
strftime(x,'%A')
give you the name of the day from a POSIXct date)
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour,'%A')) %>%
group_by(wday) %>%
summarise(
cancel_day = n()
) %>%
ggplot(mapping = aes(x = wday, y = cancel_day)) +
geom_col()
Grouping challenges
- Which carrier has the worst delays?
flights %>%
group_by(carrier) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)
5_c
Grouping challenges
- Which carrier has the worst delays?
flights %>%
group_by(carrier) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)
Grouped mutates (and filters)
Grouping is also useful in conjunction with mutate()
and filter()
- Find all groups bigger than a threshold:
- Standardise to compute per group metrics:
flights %>%
group_by(dest, year) %>%
filter(n() > 10000) %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, dest, arr_delay, prop_delay)
Goup by challenges
- What time of day should you fly if you want to avoid delays as much as possible?
flights %>%
group_by(hour) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = T),
sd_delay = sd(arr_delay, na.rm = T),
) %>%
ggplot() +
geom_errorbar(mapping = aes(
x = hour,
ymax = mean_delay + sd_delay,
ymin = mean_delay - sd_delay)) +
geom_point(mapping = aes(
x = hour,
y = mean_delay,
))
5_d
Goup by challenges
- What time of day should you fly if you want to avoid delays as much as possible?
flights %>%
group_by(hour) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = T),
sd_delay = sd(arr_delay, na.rm = T),
) %>%
ggplot() +
geom_errorbar(mapping = aes(
x = hour,
ymax = mean_delay + sd_delay,
ymin = mean_delay - sd_delay)) +
geom_point(mapping = aes(
x = hour,
y = mean_delay,
))