The goal of this practical is to practices data transformation with `tidyverse`.
The objectives of this session will be to:
- Filter rows with `filter()`
- Arrange rows with `arrange()`
- Select columns with `select()`
- Add new variables with `mutate()`
- Combining multiple operations with the pipe `%>%`
```R
install.packages("nycflights13")
```
```{r packageloaded, include=TRUE, message=FALSE}
library("tidyverse")
library("nycflights13")
```
\
# Data set : nycflights13
`nycflights13::flights`contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in `?flights`
```{r display_data, include=TRUE}
flights
```
- **int** stands for integers.
- **dbl** stands for doubles, or real numbers.
- **chr** stands for character vectors, or strings.
- **dttm** stands for date-times (a date + a time).
- **lgl** stands for logical, vectors that contain only TRUE or FALSE.
- **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values.
- **date** stands for dates.
\
# Filter rows with `filter()`
`filter()` allows you to subset observations based on their values.
```{r filter_month_day, include=TRUE}
filter(flights, month == 1, day == 1)
```
\
`dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-`
```{r filter_month_day_sav, include=TRUE}
jan1 <- filter(flights, month == 1, day == 1)
```
\
R either prints out the results, or saves them to a variable.
```{r filter_month_day_sav_display, include=TRUE}
(dec25 <- filter(flights, month == 12, day == 25))
```
\
# Logical operators
Multiple arguments to `filter()` are combined with “and”: every expression must be true in order for a row to be included in the output.
flights_sml <- mutate(flights_sml, gain = dep_delay - arr_delay,
speed = distance / air_time * 60)
```
\
### Useful creation functions
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. There is also `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`
\
# Combining multiple operations with the pipe
\
We don't want to create useless intermediate variables so we can use the pipe operator: `%>%`
( or `ctrl + shift + M`).
<div id="pquestion"> - Find the 10 most delayed flights using a ranking function. `min_rank()` </div>
Behind the scenes, `x %>% f(y)` turns into `f(x, y)`, and `x %>% f(y) %>% g(z)` turns into `g(f(x, y), z)` and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
\
Working with the pipe is one of the key criteria for belonging to the `tidyverse`. The only exception is `ggplot2`: it was written before the pipe was discovered. Unfortunately, the next iteration of `ggplot2`, `ggvis`, which does use the pipe, isn’t quite ready for prime time yet.
# Grouped summaries with `summarise()`
`summarise()` collapses a data frame to a single row:
```{r load_data, include=TRUE}
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))
```
### The power of `summarise()` with `group_by()`
This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the `dplyr` verbs on a grouped data frame they’ll be automatically applied “by group”.
You may have wondered about the na.rm argument we used above. What happens if we don’t set it?
```{r summarise_group_by_NA, include=TRUE}
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)
```
Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
# Counts
Whenever you do any aggregation, it’s always a good idea to include either a count (`n()`). That way you can check that you’re not drawing conclusions based on very small amounts of data.
```{r summarise_group_by_count, include = TRUE, warning=F, message=F, fig.width=8, fig.height=3.5}
Another useful dplyr filtering helper is `between()`. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
How many flights have a missing `dep_time`? What other variables are missing? What might these rows represent?
Why is `NA ^ 0` not `NA`? Why is `NA | TRUE` not `NA`? Why is `FALSE & NA` not `NA`? Can you figure out the general rule? (`NA * 0` is a tricky counter-example!)
### Arrange challenges :
- Sort flights to find the most delayed flights. Find the flights that left earliest.
- Sort flights to find the fastest flights.
- Which flights traveled the longest? Which traveled the shortest?
### Select challenges :
- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
```{r select_one_of, eval=F, message=F, cache=T}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```
- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
- Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.