Skip to content
Snippets Groups Projects
Select Git revision
  • b9a4a8e3a0eb1144069db79dd96b15ab239a6f6b
  • main default
  • master
3 results

slides.Rmd

Blame
  • Laurent Modolo's avatar
    7772b8fd
    History
    title: "R#5: data transformation"
    author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
    date: "28 Nov 2019"
    output:
      slidy_presentation:
        highlight: tango
      beamer_presentation:
        theme: metropolis
        slide_level: 3
        fig_caption: no
        df_print: tibble
        highlight: tango
        latex_engine: xelatex
    knitr::opts_chunk$set(echo = FALSE)
    library(tidyverse)

    Grouped summaries with summarise()

    summarise() collapses a data frame to a single row:

    library(nycflights13)
    library(tidyverse)
    flights %>% 
      summarise(delay = mean(dep_delay, na.rm = TRUE))

    The power of summarise() with group_by()

    This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

    flights %>% 
      group_by(year, month, day) %>% 
      summarise(delay = mean(dep_delay, na.rm = TRUE))

    5_a

    Challenge with summarise() and group_by()

    Imagine that we want to explore the relationship between the distance and average delay for each location. here are three steps to prepare this data:

    • Group flights by destination.
    • Summarise to compute distance, average delay, and number of flights.
    • Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
    flights %>% 
      group_by(dest)

    Challenge with summarise() and group_by()

    Imagine that we want to explore the relationship between the distance and average delay for each location.

    • Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
    flights %>% 
      group_by(dest) %>% 
      summarise(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE)
      )

    Missing values

    You may have wondered about the na.rm argument we used above. What happens if we don’t set it?

    flights %>% 
      group_by(dest) %>% 
      summarise(
        dist = mean(distance),
        delay = mean(arr_delay)
      )

    Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.

    Counts

    Whenever you do any aggregation, it’s always a good idea to include either a count (n()). That way you can check that you’re not drawing conclusions based on very small amounts of data.

    flights %>% 
      group_by(dest) %>% 
      summarise(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE)
      )

    Challenge with summarise() and group_by()

    Imagine that we want to explore the relationship between the distance and average delay for each location.

    • Summarise to compute distance, average delay, and number of flights.
    • Filter to remove noisy points and Honolulu airport, which is almost twice as far away as the next closest airport.
    flights %>% 
      group_by(dest) %>% 
      summarise(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE)
      ) %>% 
      filter(count > 20, dest != "HNL")

    Challenge with summarise() and group_by()

    Imagine that we want to explore the relationship between the distance and average delay for each location.

    flights %>% 
      group_by(dest) %>% 
      summarise(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE)
      ) %>% 
      filter(count > 20, dest != "HNL") %>% 
      ggplot(mapping = aes(x = dist, y = delay)) +
      geom_point(aes(size = count), alpha = 1/3) +
      geom_smooth(se = FALSE)

    5_b

    Challenge with summarise() and group_by()

    flights %>% 
      group_by(dest) %>% 
      summarise(
        count = n(),
        dist = mean(distance, na.rm = TRUE),
        delay = mean(arr_delay, na.rm = TRUE)
      ) %>% 
      filter(count > 20, dest != "HNL") %>% 
      ggplot(mapping = aes(x = dist, y = delay)) +
      geom_point(aes(size = count), alpha = 1/3) +
      geom_smooth(se = FALSE)

    Ungrouping

    If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

    flights %>% 
      group_by(year, month, day) %>% 
      ungroup() %>%
      summarise(delay = mean(dep_delay, na.rm = TRUE))

    Grouping challenges

    • Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (strftime(x,'%A') give you the name of the day from a POSIXct date)
    • Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights %>% group_by(carrier, dest) %>% summarise(n()))

    Grouping challenges

    • Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (strftime(x,'%A') give you the name of the day from a POSIXct date)
    flights %>% 
      mutate(
        canceled = is.na(dep_time) | is.na(arr_time)
      ) %>% 
      filter(canceled) %>% 
      mutate(wday = strftime(time_hour,'%A')) %>% 
      group_by(wday) %>% 
      summarise(
        cancel_day = n()
      ) %>%
      ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
      geom_col()

    5_b

    Grouping challenges

    • Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay? (strftime(x,'%A') give you the name of the day from a POSIXct date)
    flights %>% 
      mutate(
        canceled = is.na(dep_time) | is.na(arr_time)
      ) %>% 
      mutate(wday = strftime(time_hour,'%A')) %>% 
      group_by(wday) %>% 
      summarise(
        cancel_day = n()
      ) %>%
      ggplot(mapping = aes(x = wday, y = cancel_day)) +
      geom_col()

    Grouping challenges

    • Which carrier has the worst delays?
    flights %>% 
      group_by(carrier) %>% 
      summarise(
        carrier_delay = mean(arr_delay, na.rm = T)
      ) %>%
      mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
      ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
      geom_col(alpha = 0.5)

    5_c

    Grouping challenges

    • Which carrier has the worst delays?
    flights %>% 
      group_by(carrier) %>% 
      summarise(
        carrier_delay = mean(arr_delay, na.rm = T)
      ) %>%
      mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
      ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
      geom_col(alpha = 0.5)

    Grouped mutates (and filters)

    Grouping is also useful in conjunction with mutate() and filter()

    • Find all groups bigger than a threshold:
    • Standardise to compute per group metrics:
    flights %>% 
      group_by(dest, year) %>% 
      filter(n() > 10000) %>% 
      filter(arr_delay > 0) %>% 
      mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
      select(year:day, dest, arr_delay, prop_delay)

    Goup by challenges

    • What time of day should you fly if you want to avoid delays as much as possible?
    flights %>% 
      group_by(hour) %>% 
      summarise(
        mean_delay = mean(arr_delay, na.rm = T),
        sd_delay = sd(arr_delay, na.rm = T),
      ) %>% 
      ggplot() +
      geom_errorbar(mapping = aes(
        x = hour,
        ymax = mean_delay + sd_delay,
        ymin = mean_delay - sd_delay)) +
      geom_point(mapping = aes(
        x = hour,
        y = mean_delay,
      ))

    5_d