Skip to content
Snippets Groups Projects
Unverified Commit f4879e7b authored by Laurent Modolo's avatar Laurent Modolo
Browse files

add session 4 slides

parent 80b4b54a
No related branches found
No related tags found
No related merge requests found
session_4/img/transform-logical.png

70.2 KiB

---
title: "R#4: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "08 Nov 2019"
output:
slidy_presentation:
highlight: tango
beamer_presentation:
theme: metropolis
slide_level: 3
fig_caption: no
df_print: tibble
highlight: tango
latex_engine: xelatex
---
```{r setup, include=FALSE, cache=TRUE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
```
## R#4: data transformation
The goal of this practical is to practices data transformation with `tidyverse`.
The objectives of this session will be to:
- Filter rows with `filter()`
- Arrange rows with `arrange()`
- Select columns with `select()`
- Add new variables with `mutate()`
- Combining multiple operations with the pipe `%>%`
## **nycflights13**
`nycflights13::flights`contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in `?flights`
```{r load_data, eval=T, message=FALSE, cache=T}
library(nycflights13)
library(tidyverse)
```
## **nycflights13**
```{r display_data, eval=F, message=FALSE, cache=T}
flights
```
- **int** stands for integers.
- **dbl** stands for doubles, or real numbers.
- **chr** stands for character vectors, or strings.
- **dttm** stands for date-times (a date + a time).
- **lgl** stands for logical, vectors that contain only TRUE or FALSE.
- **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values.
- **date** stands for dates.
## Filter rows with `filter()`
`filter()` allows you to subset observations based on their values.
```{r filter_month_day, eval=T, message=T, cache=T}
filter(flights, month == 1, day == 1)
```
## Filter rows with `filter()`
`dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-`
```{r filter_month_day_sav, eval=T, message=F, cache=T}
jan1 <- filter(flights, month == 1, day == 1)
```
R either prints out the results, or saves them to a variable.
```{r filter_month_day_sav_display, eval=T, message=F, cache=T}
(dec25 <- filter(flights, month == 12, day == 25))
```
## Logical operators
Multiple arguments to `filter()` are combined with “and”: every expression must be true in order for a row to be included in the output.
```{r logical_operator, echo=FALSE, out.width='100%'}
knitr::include_graphics('img/transform-logical.png')
```
## Logical operators
Test the following operations:
```{r filter_logical_operators, eval=T, message=F, cache=T}
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
## Missing values
One important feature of R that can make comparison tricky are missing values, or `NA`s (“not availables”).
```{r filter_logical_operators_NA, eval=T, message=T, cache=T}
NA > 5
10 == NA
NA + 10
NA / 2
```
## Missing values
```{r filter_logical_operators_test_NA, eval=T, message=T, cache=T}
NA == NA
is.na(NA)
```
## Filter challenges
Find all flights that:
- Had an arrival delay of two or more hours
- Were operated by United, American, or Delta
- Departed between midnight and 6am (inclusive)
Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?
Why is `NA ^ 0` not `NA`? Why is `NA | TRUE` not `NA`? Why is `FALSE & NA` not `NA`? Can you figure out the general rule? (`NA * 0` is a tricky counter-example!)
## Arrange rows with `arrange()`
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
```{r arrange_ymd, eval=F, message=F, cache=T}
arrange(flights, year, month, day)
```
Use `desc()` to re-order by a column in descending order:
```{r arrange_desc, eval=F, message=F, cache=T}
arrange(flights, desc(dep_delay))
```
Missing values are always sorted at the end:
```{r arrange_NA, eval=F, message=F, cache=T}
arrange(tibble(x = c(5, 2, NA)), x)
arrange(tibble(x = c(5, 2, NA)), desc(x))
```
## Arrange challenges
- Sort flights to find the most delayed flights. Find the flights that left earliest.
- Sort flights to find the fastest flights.
- Which flights traveled the longest? Which traveled the shortest?
## Select columns with `select()`
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
```{r select_ymd, eval=F, message=F, cache=T}
select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
```
## Select columns with `select()`
here are a number of helper functions you can use within `select()`:
- `starts_with("abc")`: matches names that begin with “abc”.
- `ends_with("xyz")`: matches names that end with “xyz”.
- `contains("ijk")`: matches names that contain “ijk”.
- `matches("(.)\\1")`: selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
See `?select` for more details.
## Select challenges
- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
```{r select_one_of, eval=F, message=F, cache=T}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```
- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
```{r select_contains, eval=F, message=F, cache=T}
select(flights, contains("TIME"))
```
## Add new variables with `mutate()`
It’s often useful to add new columns that are functions of existing columns. That’s the job of `mutate()`.
```{r mutate, eval=F, message=F, cache=T}
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
```
**4_a**
## Add new variables with `mutate()`
You can refer to columns that you’ve just created:
```{r mutate_reuse, eval=F, message=F, cache=T}
mutate(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
)
```
## Useful creation functions
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. There is also `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`
## Mutate challenges
- Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
\pause
```{r mutate_challenges_a, eval=F, message=F, cache=T}
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
```
**4_b**
## Mutate challenges
- Compare `dep_time`, `sched_dep_time`, and `dep_delay`. How would you expect those three numbers to be related?
\pause
```{r mutate_challenge_b, eval=F, message=F, cache=T}
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
```
**4_c**
## Mutate challenges
- Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for `min_rank()`
\pause
```{r mutate_challenge_c, eval=F, message=F, cache=T}
flights_md <- mutate(flights, most_delay = min_rank(desc(dep_delay)))
filter(flights_md, most_delay < 10)
```
**4_d**
## Combining multiple operations with the pipe
We don't want to create useless intermediate variables so we can use the pipe opperator: `%>%`
(`ctrl + shift + M`).
```{r pipe_example_a, eval=F, message=F, cache=T}
flights_md <- mutate(flights,
most_delay = min_rank(desc(dep_delay)))
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)
```
## Combining multiple operations with the pipe
We don't want to create useless intermediate variables so we can use the pipe opperator: `%>%`
(`ctrl + shift + M`).
```{r pipe_example_b, eval=F, message=F, cache=T}
flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)
```
## Combining multiple operations with the pipe
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
You can access the transmited variables with `.`
```{r pipe_example_c, eval=F, message=F, cache=T}
flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(., most_delay < 10) %>%
arrange(., most_delay)
```
Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
\ No newline at end of file
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
\ No newline at end of file
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
\ No newline at end of file
flights_md <- mutate(flights, most_delay = min_rank(desc(dep_delay)))
filter(flights_md, most_delay < 10)
\ No newline at end of file
Source diff could not be displayed: it is too large. Options to address this: view the blob.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment