Skip to content
Snippets Groups Projects
session_4.Rmd 12.4 KiB
Newer Older
Laurent Modolo's avatar
Laurent Modolo committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412
---
title: "R.4: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
  rmdformats::downcute:
    self_contain: true
    use_bookdown: true
    default_style: "dark"
    lightbox: true
    css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
---

```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
  position = c('top', 'right'),
  color = "white",
  tooltip_message = 'Click to copy',
  tooltip_success = 'Copied !')
```

# Introduction

The goal of this practical is to practice data transformation with `tidyverse`.
The objectives of this session will be to:

- Filter rows with `filter()`
- Arrange rows with `arrange()`
- Select columns with `select()`
- Add new variables with `mutate()`

<div class="pencadre">
For this session we are going to work with a new dataset included in the `nycflights13` package.
Install this package and load it.
As usual you will also need the `tidyverse` library.
</div>

<details><summary>Solution</summary>
<p>
```R
install.packages("nycflights13")
```

```{r packageloaded, include=TRUE, message=FALSE}
library("tidyverse")
library("nycflights13")
```
</p>
</details>
 
## Data set : nycflights13

`nycflights13::flights`Contains all 336,776 flights that departed from New York City in 2013.
The data comes from the US Bureau of Transportation Statistics, and is documented in `?flights`

```{r display_data, include=TRUE}
flights
```

## Data type

In programming languages, all variables are not equal.
When you display a `tibble` you can see the **type** of a column.
Here is a list of common variable **types** that you will encounter

- **int** stands for integers.
- **dbl** stands for doubles or real numbers.
- **chr** stands for character vectors or strings.
- **dttm** stands for date-times (a date + a time).
- **lgl** stands for logical, vectors that contain only `TRUE` or `FALSE`.
- **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values.
- **date** stands for dates.

You cannot add an **int** to a **chr**, but you can add an **int** to a **dbl** the results will be a **dbl**.
 
# `filter` rows

Variable **types** are important to keep in mind for comparisons.
The `filter()` function allows you to subset observations based on their values. 

<div class="pencadre">

What is the results of the following `filter` command ?

```{r filter_month_day, include=TRUE, eval=FALSE}
filter(flights, month == 1, day == 1)
```
</div>
 
`dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-`

<div class="pencadre">
Save the previous command in a `jan1` variable
</div
<details><summary>Solution</summary>
<p>
```{r filter_month_day_sav, include=TRUE}
jan1 <- filter(flights, month == 1, day == 1)
```
</p>
</details>
 
<div class="pencadre">
R either prints out the results, or saves them to a variable.
What happens when you put your variable assignment code between parenthesis `(` `)` ?

```{r filter_month_day_sav_display, eval=FALSE}
(dec25 <- filter(flights, month == 12, day == 25))
```
</div>
 
## Logical operators

Multiple arguments to `filter()` are combined with **AND**: every expression must be `TRUE` in order for a row to be included in the output.

In R you can use the symbols `&`, `|`, `!` and the function `xor()` to build other kinds of tests.

![](./img/transform-logical.png)
<div class="pencadre">
Test the following operations:

```{r filter_logical_operators_a, eval=FALSE}
filter(flights, month == 11 | month == 12)
```
```{r filter_logical_operators_b, eval=FALSE}
filter(flights, month %in% c(11, 12))
```
```{r filter_logical_operators_c, eval=FALSE}
filter(flights, !(arr_delay > 120 | dep_delay > 120))
```
```{r filter_logical_operators_d, eval=FALSE}
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
</div>

Combinations of logical operators is a powerful programmatic way to select subset of data.
Keep in mind, however, that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of one long one.

## Missing values

One important feature of R that can make comparison tricky is missing values, or `NA`s for **Not Availables**.
Indeed each of the variable type can contain either a value of this type (i.e., `2` for an **int**) or nothing.
The *nothing recorded in a variable* status is represented with the `NA` symbol.

As operations with `NA` values don't make sense, if you have `NA` somewhere in your operation, the results will be `NA`

```{r filter_logical_operators_NA, include=TRUE}
NA > 5
10 == NA
NA + 10
```

However, you can test for `NA`s with the function `is.na()`:

```{r filter_logical_operators_test_NA, include=TRUE}
is.na(NA)
```

`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly:

```{r filter_logical_operators_test_NA2, include=TRUE}
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
```

## Challenges

<div class="pencadre">
Find all flights that:
- Had an arrival delay of two or more hours (you can check `?flights`)
- Flew to Houston (IAH or HOU)
</div>

<details><summary>Solution</summary>
<p>
```{r filter_chalenges_a, eval=TRUE}
filter(flights, arr_delay >= 60 | arr_delay <= 120)
```
```{r filter_chalenges_b, eval=TRUE}
filter(flights, dest %in% c("IAH", "HOU"))
```
</p>
</details>

<div class="pencadre">
How many flights have a missing `dep_time` ?
</div>

<details><summary>Solution</summary>
<p>
```{r filter_chalenges_c, eval=TRUE}
filter(flights, is.na(dep_time))
```
</p>
</details>

<div class="pencadre">
Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
</div>

<details><summary>Solution</summary>
<p>
```{r filter_chalenges_d, eval=TRUE}
NA ^ 0 # ^ 0 is always 1 it's an arbitrary rule not a computation
NA | TRUE # if a member of a OR operation is TRUE the results is TRUE
FALSE & NA # if a member of a AN operation is FALSE the results is TRUE
NA * 0 # here we have a true computation
```
</p>
</details>

# Arrange rows with `arrange()`

`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.

```{r arrange_ymd, include=TRUE}
arrange(flights, year, month, day)
```

<div class="pencadre">
Use `desc()` to reorder by a column in descending order:
</div>

<details><summary>Solution</summary>
<p>
```{r arrange_desc, include=TRUE}
arrange(flights, desc(dep_delay))
```
</p>
</details>

## Missing values

Missing values are always sorted at the end:

```{r arrange_NA, include=TRUE}
arrange(tibble(x = c(5, 2, NA)), x)
arrange(tibble(x = c(5, 2, NA)), desc(x))
```

## Challenges
<div class="pencadre">

- Find the most delayed flight.
- Find the flight that left earliest.
- How could you arrange all missing values to the start ?

</div>

<details><summary>Solution</summary>
<p>
Find the most delayed flight.
```{r chalange_arrange_desc_a, include=TRUE}
arrange(flights, desc(dep_delay))
```
Find the flight that left earliest.
```{r chalange_arrange_desc_b, include=TRUE}
arrange(flights, dep_delay)
```
How could you arrange all missing values to the start
```{r chalange_arrange_desc_c, include=TRUE}
arrange(tibble(x = c(5, 2, NA)), desc(is.na(x)))
```
</p>
</details>


# Select columns with `select()`

`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

You can select by column names
```{r select_ymd_a, , include=TRUE}
select(flights, year, month, day)
```

By defining a range of columns
```{r select_ymd_b, , include=TRUE}
select(flights, year:day)
```

Or you can do a negative (`-`) to remove columns.
```{r select_ymd_c, , include=TRUE}
select(flights, -(year:day))
```

## Helper functions

here are a number of helper functions you can use within `select()`:

- `starts_with("abc")`: matches names that begin with `"abc"`.
- `ends_with("xyz")`: matches names that end with `"xyz"`.
- `contains("ijk")`: matches names that contain `"ijk"`.
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.

See `?select` for more details.

## Challenges

<div class="pencadre">

- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
<details><summary>Solution</summary>
<p>
```{r challenge_select_a, eval=FALSE}
select(flights, contains("time") | contains("delay"))
select(flights, contains("_") & !starts_with("sched") & !starts_with("time"))
```
</p>
</details>
- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
```{r select_one_of, eval=T, message=F, cache=T}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```
<details><summary>Solution</summary>
<p>
```{r challenge_select_b, eval=FALSE}
select(flights, one_of(vars))
```
</p>
</details>
- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
```{r select_contains, eval=F, message=F, cache=T}
select(flights, contains("TIME"))
```
<details><summary>Solution</summary>
<p>
```{r challenge_select_c, eval=FALSE}
select(flights, contains("TIME", ignore.case = FALSE))
```
</p>
</details>

</div>
 
# Add new variables with `mutate()`

It’s often useful to add new columns that are functions of existing columns. That’s the job of `mutate()`.

<div class="pencadre">
First let's create a smaller dataset to work on `flights_sml` that contains
- columns from `year` to `day`
- columns that ends with `delays`
- the `distance` and `air_time` columns
</div>

<details><summary>Solution</summary>
<p>
```{r mutate, include=TRUE}
(flights_sml <- select(flights,  year:day, ends_with("delay"), distance, air_time))
```
</p>
</details>

## `mutate()`

```R
mutate(tbl, new_var_a = opperation_a, ..., new_var_n = opperation_n)
```
`mutate()` allows you to add new columns (`new_var_a`, ... , `new_var_n`) and to fill them with the results of an operation.

We can create a `gain` column to check if the pilot managed to compensate is departure delay
```{r mutate_gain}
mutate(flights_sml, gain = dep_delay - arr_delay)
```

<div class="pencadre">
Using `mutate` add a new column `gain` and `speed` that contains the average speed of the plane to the `flights_sml` tibble.
</div>

<details><summary>Solution</summary>
<p>
```{r mutate_reuse, include=TRUE}
flights_sml <- mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
```
</details>
</p>


<div class="pencadre">
Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of the number of minutes since midnight.
</div>

<details><summary>Solution</summary>
<p>
```{r mutate_challenges_a, eval=F, message=F, cache=T}
mutate(
  flights,
  dep_time = (dep_time %/% 100) * 60 +
    dep_time %% 100,
  sched_dep_time = (sched_dep_time %/% 100) * 60 +
    sched_dep_time %% 100
)
```
</details>
</p>

## Useful creation functions

- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means. 
- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. There is also `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`

## See you in [R#5: Pipping and grouping](http://perso.ens-lyon.fr/laurent.modolo/R/session_5/)
hpolvech's avatar
hpolvech committed



# To go further:  Data transformation and color sets.

There are a number of color palettes available in R, thanks to different packages such as `RColorBrewer`, `Viridis` or `Ghibli`. 
We will use them here to decorate our graphs, either on data already studied in the training, `mpg`, or on more specialized data such as lists of differentially expressed genes ( GSE86356 )

```{r install_colorPal, eval=F}
install.packages(c("ghibli", "RColorBrewer", "viridis"))
```