-
Laurent Modolo authoredLaurent Modolo authored
- R#4: data transformation
- nycflights13
- nycflights13
- Filter rows with filter()
- Filter rows with filter()
- Logical operators
- Logical operators
- Missing values
- Missing values
- Filter challenges
- Arrange rows with arrange()
- Arrange challenges
- Select columns with select()
- Select columns with select()
- Select challenges
- Add new variables with mutate()
- Add new variables with mutate()
- Useful creation functions
- Mutate challenges
- Mutate challenges
- Mutate challenges
- Combining multiple operations with the pipe
- Combining multiple operations with the pipe
- Combining multiple operations with the pipe
title: "R#4: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "08 Nov 2019"
output:
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
slidy_presentation:
highlight: tango
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
R#4: data transformation
The goal of this practical is to practices data transformation with tidyverse
.
The objectives of this session will be to:
- Filter rows with
filter()
- Arrange rows with
arrange()
- Select columns with
select()
- Add new variables with
mutate()
- Combining multiple operations with the pipe
%>%
nycflights13
nycflights13::flights
contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights
library(nycflights13)
library(tidyverse)
nycflights13
flights
- int stands for integers.
- dbl stands for doubles, or real numbers.
- chr stands for character vectors, or strings.
- dttm stands for date-times (a date + a time).
- lgl stands for logical, vectors that contain only TRUE or FALSE.
- fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
- date stands for dates.
filter()
Filter rows with filter()
allows you to subset observations based on their values.
filter(flights, month == 1, day == 1)
filter()
Filter rows with dplyr
functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-
jan1 <- filter(flights, month == 1, day == 1)
R either prints out the results, or saves them to a variable.
(dec25 <- filter(flights, month == 12, day == 25))
Logical operators
Multiple arguments to filter()
are combined with “and”: every expression must be true in order for a row to be included in the output.
knitr::include_graphics('img/transform-logical.png')
Logical operators
Test the following operations:
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
Missing values
One important feature of R that can make comparison tricky are missing values, or NA
s (“not availables”).
NA > 5
10 == NA
NA + 10
NA / 2
Missing values
NA == NA
is.na(NA)
Filter challenges
Find all flights that:
- Had an arrival delay of two or more hours
- Were operated by United, American, or Delta
- Departed between midnight and 6am (inclusive)
Another useful dplyr filtering helper is between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
How many flights have a missing dep_time
? What other variables are missing? What might these rows represent?
Why is NA ^ 0
not NA
? Why is NA | TRUE
not NA
? Why is FALSE & NA
not NA
? Can you figure out the general rule? (NA * 0
is a tricky counter-example!)
arrange()
Arrange rows with arrange()
works similarly to filter()
except that instead of selecting rows, it changes their order.
arrange(flights, year, month, day)
Use desc()
to re-order by a column in descending order:
arrange(flights, desc(dep_delay))
Missing values are always sorted at the end:
arrange(tibble(x = c(5, 2, NA)), x)
arrange(tibble(x = c(5, 2, NA)), desc(x))
Arrange challenges
- Sort flights to find the most delayed flights. Find the flights that left earliest.
- Sort flights to find the fastest flights.
- Which flights traveled the longest? Which traveled the shortest?
select()
Select columns with select()
allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
select()
Select columns with here are a number of helper functions you can use within select()
:
-
starts_with("abc")
: matches names that begin with “abc”. -
ends_with("xyz")
: matches names that end with “xyz”. -
contains("ijk")
: matches names that contain “ijk”. -
matches("(.)\\1")
: selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings. -
num_range("x", 1:3)
: matchesx1
,x2
andx3
.
See ?select
for more details.
Select challenges
- Brainstorm as many ways as possible to select
dep_time
,dep_delay
,arr_time
, andarr_delay
fromflights
. - What does the
one_of()
function do? Why might it be helpful in conjunction with this vector?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
mutate()
Add new variables with It’s often useful to add new columns that are functions of existing columns. That’s the job of mutate()
.
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
4_a
mutate()
Add new variables with You can refer to columns that you’ve just created:
mutate(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
Useful creation functions
- Offsets:
lead()
andlag()
allow you to refer to leading or lagging values. This allows you to compute running differences (e.g.x - lag(x)
) or find when values change (x != lag(x)
). - Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes:
cumsum()
,cumprod()
,cummin()
,cummax()
; and dplyr providescummean()
for cumulative means. - Logical comparisons,
<
,<=
,>
,>=
,!=
, and==
- Ranking: there are a number of ranking functions, but you should start with
min_rank()
. There is alsorow_number()
,dense_rank()
,percent_rank()
,cume_dist()
,ntile()
Mutate challenges
- Currently
dep_time
andsched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
\pause
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
4_b
Mutate challenges
- Compare
dep_time
,sched_dep_time
, anddep_delay
. How would you expect those three numbers to be related?
\pause
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
4_c
Mutate challenges
- Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for
min_rank()
\pause
flights_md <- mutate(flights, most_delay = min_rank(desc(dep_delay)))
filter(flights_md, most_delay < 10)
4_d
Combining multiple operations with the pipe
We don't want to create useless intermediate variables so we can use the pipe operator: %>%
(ctrl + shift + M
).
flights_md <- mutate(flights,
most_delay = min_rank(desc(dep_delay)))
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)
Combining multiple operations with the pipe
We don't want to create useless intermediate variables so we can use the pipe operator: %>%
(ctrl + shift + M
).
flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)
Combining multiple operations with the pipe
Behind the scenes, x %>% f(y)
turns into f(x, y)
, and x %>% f(y) %>% g(z)
turns into g(f(x, y), z)
and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
You can access the transmitted variables with .
flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(., most_delay < 10) %>%
arrange(., most_delay)
Working with the pipe is one of the key criteria for belonging to the tidyverse
. The only exception is ggplot2
: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2
, ggvis
, which does use the pipe, isn’t quite ready for prime time yet.