Skip to content
Snippets Groups Projects
title: "R#4: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "08 Nov 2019"
output:
  beamer_presentation:
    theme: metropolis
    slide_level: 3
    fig_caption: no
    df_print: tibble
    highlight: tango
    latex_engine: xelatex
  slidy_presentation:
    highlight: tango
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)

R#4: data transformation

The goal of this practical is to practices data transformation with tidyverse. The objectives of this session will be to:

  • Filter rows with filter()
  • Arrange rows with arrange()
  • Select columns with select()
  • Add new variables with mutate()
  • Combining multiple operations with the pipe %>%

nycflights13

nycflights13::flightscontains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights

library(nycflights13)
library(tidyverse)

nycflights13

flights
  • int stands for integers.
  • dbl stands for doubles, or real numbers.
  • chr stands for character vectors, or strings.
  • dttm stands for date-times (a date + a time).
  • lgl stands for logical, vectors that contain only TRUE or FALSE.
  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
  • date stands for dates.

Filter rows with filter()

filter() allows you to subset observations based on their values.

filter(flights, month == 1, day == 1)

Filter rows with filter()

dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-

jan1 <- filter(flights, month == 1, day == 1)

R either prints out the results, or saves them to a variable.

(dec25 <- filter(flights, month == 12, day == 25))

Logical operators

Multiple arguments to filter() are combined with “and”: every expression must be true in order for a row to be included in the output.

knitr::include_graphics('img/transform-logical.png')

Logical operators

Test the following operations:

filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)

Missing values

One important feature of R that can make comparison tricky are missing values, or NAs (“not availables”).

NA > 5
10 == NA
NA + 10
NA / 2

Missing values

NA == NA
is.na(NA)

Filter challenges

Find all flights that:

  • Had an arrival delay of two or more hours
  • Were operated by United, American, or Delta
  • Departed between midnight and 6am (inclusive)

Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

Why is NA ^ 0 not NA? Why is NA | TRUE not NA? Why is FALSE & NA not NA? Can you figure out the general rule? (NA * 0 is a tricky counter-example!)

Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order.

arrange(flights, year, month, day)

Use desc() to re-order by a column in descending order:

arrange(flights, desc(dep_delay))

Missing values are always sorted at the end:

arrange(tibble(x = c(5, 2, NA)), x)
arrange(tibble(x = c(5, 2, NA)), desc(x))

Arrange challenges

  • Sort flights to find the most delayed flights. Find the flights that left earliest.
  • Sort flights to find the fastest flights.
  • Which flights traveled the longest? Which traveled the shortest?

Select columns with select()

select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))

Select columns with select()

here are a number of helper functions you can use within select():

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
  • num_range("x", 1:3): matches x1, x2 and x3.

See ?select for more details.

Select challenges

  • Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
  • What does the one_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
  • Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))

Add new variables with mutate()

It’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)

4_a

Add new variables with mutate()

You can refer to columns that you’ve just created:

mutate(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

Useful creation functions

  • Offsets: lead() and lag() allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).
  • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.
  • Logical comparisons, <, <=, >, >=, !=, and ==
  • Ranking: there are a number of ranking functions, but you should start with min_rank(). There is also row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()

Mutate challenges

  • Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

\pause

mutate(
  flights,
  dep_time = (dep_time %/% 100) * 60 +
    dep_time %% 100,
  sched_dep_time = (sched_dep_time %/% 100) * 60 +
    sched_dep_time %% 100
)

4_b

Mutate challenges

  • Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

\pause

mutate(
  flights,
  dep_time = (dep_time %/% 100) * 60 + 
    dep_time %% 100,
  sched_dep_time = (sched_dep_time %/% 100) * 60 +
    sched_dep_time %% 100
)

4_c

Mutate challenges

  • Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for min_rank()

\pause

flights_md <- mutate(flights, most_delay = min_rank(desc(dep_delay)))
filter(flights_md, most_delay < 10)

4_d

Combining multiple operations with the pipe

We don't want to create useless intermediate variables so we can use the pipe operator: %>% (ctrl + shift + M).

flights_md <- mutate(flights,
                     most_delay = min_rank(desc(dep_delay)))
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)

Combining multiple operations with the pipe

We don't want to create useless intermediate variables so we can use the pipe operator: %>% (ctrl + shift + M).

flights %>%
  mutate(most_delay = min_rank(desc(dep_delay))) %>% 
  filter(most_delay < 10) %>% 
  arrange(most_delay)

Combining multiple operations with the pipe

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.

You can access the transmitted variables with .

flights %>%
  mutate(most_delay = min_rank(desc(dep_delay))) %>% 
  filter(., most_delay < 10) %>% 
  arrange(., most_delay)

Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.