Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
---
title: "R.4: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
---
```{r setup, include=FALSE}
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
```
# Introduction
The goal of this practical is to practice data transformation with `tidyverse`.
The objectives of this session will be to:
- Filter rows with `filter()`
- Arrange rows with `arrange()`
- Select columns with `select()`
- Add new variables with `mutate()`
<div class="pencadre">
For this session we are going to work with a new dataset included in the `nycflights13` package.
Install this package and load it.
As usual you will also need the `tidyverse` library.
</div>
<details><summary>Solution</summary>
<p>
```R
install.packages("nycflights13")
```
```{r packageloaded, include=TRUE, message=FALSE}
library("tidyverse")
library("nycflights13")
```
</p>
</details>
## Data set : nycflights13
`nycflights13::flights`Contains all 336,776 flights that departed from New York City in 2013.
The data comes from the US Bureau of Transportation Statistics, and is documented in `?flights`
```{r display_data, include=TRUE}
flights
```
## Data type
In programming languages, all variables are not equal.
When you display a `tibble` you can see the **type** of a column.
Here is a list of common variable **types** that you will encounter
- **int** stands for integers.
- **dbl** stands for doubles or real numbers.
- **chr** stands for character vectors or strings.
- **dttm** stands for date-times (a date + a time).
- **lgl** stands for logical, vectors that contain only `TRUE` or `FALSE`.
- **fctr** stands for factors, which R uses to represent categorical variables with fixed possible values.
- **date** stands for dates.
You cannot add an **int** to a **chr**, but you can add an **int** to a **dbl** the results will be a **dbl**.
# `filter` rows
Variable **types** are important to keep in mind for comparisons.
The `filter()` function allows you to subset observations based on their values.
<div class="pencadre">
What is the results of the following `filter` command ?
```{r filter_month_day, include=TRUE, eval=FALSE}
filter(flights, month == 1, day == 1)
```
</div>
`dplyr` functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, `<-`
<div class="pencadre">
Save the previous command in a `jan1` variable
</div
<details><summary>Solution</summary>
<p>
```{r filter_month_day_sav, include=TRUE}
jan1 <- filter(flights, month == 1, day == 1)
```
</p>
</details>
<div class="pencadre">
R either prints out the results, or saves them to a variable.
What happens when you put your variable assignment code between parenthesis `(` `)` ?
```{r filter_month_day_sav_display, eval=FALSE}
(dec25 <- filter(flights, month == 12, day == 25))
```
</div>
## Logical operators
Multiple arguments to `filter()` are combined with **AND**: every expression must be `TRUE` in order for a row to be included in the output.
In R you can use the symbols `&`, `|`, `!` and the function `xor()` to build other kinds of tests.

<div class="pencadre">
Test the following operations:
```{r filter_logical_operators_a, eval=FALSE}
filter(flights, month == 11 | month == 12)
```
```{r filter_logical_operators_b, eval=FALSE}
filter(flights, month %in% c(11, 12))
```
```{r filter_logical_operators_c, eval=FALSE}
filter(flights, !(arr_delay > 120 | dep_delay > 120))
```
```{r filter_logical_operators_d, eval=FALSE}
filter(flights, arr_delay <= 120, dep_delay <= 120)
```
</div>
Combinations of logical operators is a powerful programmatic way to select subset of data.
Keep in mind, however, that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of one long one.
## Missing values
One important feature of R that can make comparison tricky is missing values, or `NA`s for **Not Availables**.
Indeed each of the variable type can contain either a value of this type (i.e., `2` for an **int**) or nothing.
The *nothing recorded in a variable* status is represented with the `NA` symbol.
As operations with `NA` values don t make sense, if you have `NA` somewhere in your operation, the results will be `NA`
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
```{r filter_logical_operators_NA, include=TRUE}
NA > 5
10 == NA
NA + 10
```
However, you can test for `NA`s with the function `is.na()`:
```{r filter_logical_operators_test_NA, include=TRUE}
is.na(NA)
```
`filter()` only includes rows where the condition is `TRUE`; it excludes both `FALSE` and `NA` values. If you want to preserve missing values, ask for them explicitly:
```{r filter_logical_operators_test_NA2, include=TRUE}
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
```
## Challenges
<div class="pencadre">
Find all flights that:
- Had an arrival delay of two or more hours (you can check `?flights`)
- Flew to Houston (IAH or HOU)
</div>
<details><summary>Solution</summary>
<p>
```{r filter_chalenges_a, eval=TRUE}
filter(flights, arr_delay >= 60 | arr_delay <= 120)
```
```{r filter_chalenges_b, eval=TRUE}
filter(flights, dest %in% c("IAH", "HOU"))
```
</p>
</details>
<div class="pencadre">
How many flights have a missing `dep_time` ?
</div>
<details><summary>Solution</summary>
<p>
```{r filter_chalenges_c, eval=TRUE}
filter(flights, is.na(dep_time))
```
</p>
</details>
<div class="pencadre">
Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
</div>
<details><summary>Solution</summary>
<p>
```{r filter_chalenges_d, eval=TRUE}
NA ^ 0 # ^ 0 is always 1 it's an arbitrary rule not a computation
NA | TRUE # if a member of a OR operation is TRUE the results is TRUE
FALSE & NA # if a member of a AN operation is FALSE the results is TRUE
NA * 0 # here we have a true computation
```
</p>
</details>
# Arrange rows with `arrange()`
`arrange()` works similarly to `filter()` except that instead of selecting rows, it changes their order.
```{r arrange_ymd, include=TRUE}
arrange(flights, year, month, day)
```
<div class="pencadre">
Use `desc()` to reorder by a column in descending order:
</div>
<details><summary>Solution</summary>
<p>
```{r arrange_desc, include=TRUE}
arrange(flights, desc(dep_delay))
```
</p>
</details>
## Missing values
Missing values are always sorted at the end:
```{r arrange_NA, include=TRUE}
arrange(tibble(x = c(5, 2, NA)), x)
arrange(tibble(x = c(5, 2, NA)), desc(x))
```
## Challenges
<div class="pencadre">
- Find the most delayed flight.
- Find the flight that left earliest.
- How could you arrange all missing values to the start ?
</div>
<details><summary>Solution</summary>
<p>
Find the most delayed flight.
```{r chalange_arrange_desc_a, include=TRUE}
arrange(flights, desc(dep_delay))
```
Find the flight that left earliest.
```{r chalange_arrange_desc_b, include=TRUE}
arrange(flights, dep_delay)
```
How could you arrange all missing values to the start
```{r chalange_arrange_desc_c, include=TRUE}
arrange(tibble(x = c(5, 2, NA)), desc(is.na(x)))
```
</p>
</details>
# Select columns with `select()`
`select()` allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
You can select by column names
```{r select_ymd_a, , include=TRUE}
select(flights, year, month, day)
```
By defining a range of columns
```{r select_ymd_b, , include=TRUE}
select(flights, year:day)
```
Or you can do a negative (`-`) to remove columns.
```{r select_ymd_c, , include=TRUE}
select(flights, -(year:day))
```
## Helper functions
here are a number of helper functions you can use within `select()`:
- `starts_with("abc")`: matches names that begin with `"abc"`.
- `ends_with("xyz")`: matches names that end with `"xyz"`.
- `contains("ijk")`: matches names that contain `"ijk"`.
- `num_range("x", 1:3)`: matches `x1`, `x2` and `x3`.
See `?select` for more details.
## Challenges
<div class="pencadre">
- Brainstorm as many ways as possible to select `dep_time`, `dep_delay`, `arr_time`, and `arr_delay` from `flights`.
<details><summary>Solution</summary>
<p>
```{r challenge_select_a, eval=FALSE}
select(flights, contains("time") | contains("delay"))
select(flights, contains("_") & !starts_with("sched") & !starts_with("time"))
```
</p>
</details>
- What does the `one_of()` function do? Why might it be helpful in conjunction with this vector?
```{r select_one_of, eval=T, message=F, cache=T}
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
```
<details><summary>Solution</summary>
<p>
```{r challenge_select_b, eval=FALSE}
select(flights, one_of(vars))
```
</p>
</details>
- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
```{r select_contains, eval=F, message=F, cache=T}
select(flights, contains("TIME"))
```
<details><summary>Solution</summary>
<p>
```{r challenge_select_c, eval=FALSE}
select(flights, contains("TIME", ignore.case = FALSE))
```
</p>
</details>
</div>
# Add new variables with `mutate()`
It’s often useful to add new columns that are functions of existing columns. That’s the job of `mutate()`.
<div class="pencadre">
First let s create a smaller dataset to work on `flights_sml` that contains
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
- columns from `year` to `day`
- columns that ends with `delays`
- the `distance` and `air_time` columns
</div>
<details><summary>Solution</summary>
<p>
```{r mutate, include=TRUE}
(flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time))
```
</p>
</details>
## `mutate()`
```R
mutate(tbl, new_var_a = opperation_a, ..., new_var_n = opperation_n)
```
`mutate()` allows you to add new columns (`new_var_a`, ... , `new_var_n`) and to fill them with the results of an operation.
We can create a `gain` column to check if the pilot managed to compensate is departure delay
```{r mutate_gain}
mutate(flights_sml, gain = dep_delay - arr_delay)
```
<div class="pencadre">
Using `mutate` add a new column `gain` and `speed` that contains the average speed of the plane to the `flights_sml` tibble.
</div>
<details><summary>Solution</summary>
<p>
```{r mutate_reuse, include=TRUE}
flights_sml <- mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
```
</details>
</p>
<div class="pencadre">
Currently `dep_time` and `sched_dep_time` are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of the number of minutes since midnight.
</div>
<details><summary>Solution</summary>
<p>
```{r mutate_challenges_a, eval=F, message=F, cache=T}
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
```
</details>
</p>
## Useful creation functions
- Offsets: `lead()` and `lag()` allow you to refer to leading or lagging values. This allows you to compute running differences (e.g. `x - lag(x)`) or find when values change (`x != lag(x)`).
- Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: `cumsum()`, `cumprod()`, `cummin()`, `cummax()`; and dplyr provides `cummean()` for cumulative means.
- Logical comparisons, `<`, `<=`, `>`, `>=`, `!=`, and `==`
- Ranking: there are a number of ranking functions, but you should start with `min_rank()`. There is also `row_number()`, `dense_rank()`, `percent_rank()`, `cume_dist()`, `ntile()`
## See you in [R#5: Pipping and grouping](http://perso.ens-lyon.fr/laurent.modolo/R/session_5/)
# To go further: Data transformation and color sets.
There are a number of color palettes available in R, thanks to different packages such as `RColorBrewer`, `Viridis` or `Ghibli`.
We will use them here to decorate our graphs, either on data already studied in the training, `mpg`, or on more specialized data such as lists of differentially expressed genes ( [GSE86356](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86356) )
```{r install_colorPal, eval=F}
install.packages(c("ghibli", "RColorBrewer", "viridis"))
```
```{r load_library}
library(tidyverse)
library(RColorBrewer)
library(ghibli)
library(viridis)
```
Using `mpg` and the 'ggplot2' package, reproduce the graph studied in session 2, 3.1: color mapping.
Modify the colors representing the class of cars with the palettes `Dark2` of [RColorBrewer](https://www.datanovia.com/en/fr/blog/palette-de-couleurs-rcolorbrewer-de-a-a-z/), then `MononokeMedium` from [Ghibli](https://github.com/ewenme/ghibli).
```{r mpg_color}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()
```
Go to the links to find the appropriate function: they are very similar between the two packages.
<details><summary>Solution</summary>
<p>
```{r mpg_color1}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_color_brewer(palette = "Dark2")
```
```{r mpg_color2}
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_colour_ghibli_d("MononokeMedium")
```
</p>
</details>
The choice of colors is very important for the comprehension of a graphic. Some palettes are not suitable for everyone. For example, for people with color blindness, color gradients from green to red, or from yellow to blue should be avoided.
To display only Brewer palettes that are colorblind friendly, specify the option `colorblindFriendly = TRUE` as follows:
```{r colorblindBrewer}
display.brewer.all(colorblindFriendly = TRUE)
```
`viridis` package provide a series of color maps that are designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency.
For the next part, we will use a real data set. Anterior tibial muscle tissue was collected from 20 patients, with or without confirmed myotonic dystrophy type 1 (DM1). Illumina RNAseq was performed on these samples and the sequencing data are available on GEO with the identifier GSE86356.
First, we will use the gene count table of these samples, formatted for use in ggplot2 ( `pivot_longer()` [function](https://tidyr.tidyverse.org/reference/pivot_longer.html) ).
Open the csv file using the `read_csv2()` function. The file is located at "http://perso.ens-lyon.fr/laurent.modolo/R/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv".
<details><summary>Solution</summary>
<p>
```{r read_csv1}
expr_DM1 <- read_csv2("http://perso.ens-lyon.fr/laurent.modolo/R/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv")
With this tibble, use `ggplot2` and the `geom_tile()` function to make a heatmap.
Fit the samples on the x-axis and the genes on the y-axis.
**Tip:** Transform the counts into log10(x + 1) for a better visualization.
<details><summary>Solution</summary>
<p>
ggplot(expr_DM1, aes(samples, Genes, fill= log1p(counts))) +
geom_tile() +
labs(y="Genes", x = "Samples") +
theme(
axis.text.y = element_text(size= 4),
axis.text.x = element_text(size = 4, angle = 90)
)
**Nota bene :** The elements of the axes, and the theme in general, are modified in the `theme()` function.
With the default color gradient, even with the transformation, the heatmap is difficult to study.
R interprets a large number of colors, indicated in RGB, hexadimal, or just by name. For example :
<center>
{width=400px}
</center>
With `scale_fill_gradient2()` function, change the colors of the gradient, taking white for the minimum value and 'springgreen4' for the maximum value.
<details><summary>Solution</summary>
<p>
```{r heatmapGreen}
ggplot(expr_DM1, aes(samples, Genes, fill= log1p(counts))) +
geom_tile() +
scale_fill_gradient2(low = "white", high = "springgreen4") +
labs(y="Genes", x = "Samples") +
theme(
axis.text.y = element_text(size= 4),
axis.text.x = element_text(size = 4, angle = 90)
)
```
</p>
</details>
It s better, but still not perfect!
Now let s use the [viridis color gradient](https://gotellilab.github.io/GotelliLabMeetingHacks/NickGotelli/ViridisColorPalette.html) for this graph.
<details><summary>Solution</summary>
<p>
```{r heatmapViridis}
ggplot(expr_DM1, aes(samples, Genes, fill= log1p(counts))) +
geom_tile() +
scale_fill_viridis_c() +
labs(y="Genes", x = "Samples") +
theme(
axis.text.y = element_text(size= 4),
axis.text.x = element_text(size = 4, angle = 90)
)
```
</p>
</details>
For this last exercise, we will use the results of the differential gene expression analysis between DM1 vs WT conditions.
Open the csv file using the `read_csv2()` function. The file is located at "http://perso.ens-lyon.fr/laurent.modolo/R/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv".
<details><summary>Solution</summary>
<p>
```{r read_csv2}
tab <- read_csv2("http://perso.ens-lyon.fr/laurent.modolo/R/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
To make a Volcano plot, displaying different information about the significativity of the variation thanks to the colors, we will have to make a series of modifications on this table.
With `mutate()` and `ifelse()` [fonctions](https://dplyr.tidyverse.org/reference/if_else.html), we will have to create :
- a column 'sig' : it indicates if the gene is significant ( TRUE or FALSE ).
**Thresholds :** baseMean > 20 and padj < 0.05 and abs(log2FoldChange) >= 1.5
- a column 'UpDown' : it indicates if the gene is Up regulated or Down regulated.
<details><summary>Solution</summary>
<p>
```{r sig}
tab.sig <- tab %>%
mutate(sig = baseMean > 20 & padj < 0.05 & abs(log2FoldChange) >= 1.5 ) %>%
mutate(UpDown = ifelse(
baseMean > 20 & padj < 0.05 & log2FoldChange >= 1.5,
"Up",
ifelse(
baseMean > 20 & padj < 0.05 & log2FoldChange <= -1.5,
"Down",
"NO"
)))
tab.sig
We want to see the top10 DEGs on the graph. For this, we will use the package `ggrepel`.
Install and load the `ggrepl` package.
<details><summary>Solution</summary>
<p>
```{r ggrepel}
install.packages("ggrepel")
library(ggrepel)
```
Let s **filter** our table into a new variable, top10, to keep only the top 10 according to the adjusted pvalue. The **smaller** the adjusted pvalue, the more significant.
**Tips :** You can use the [function](https://dplyr.tidyverse.org/reference/slice.html) `slice_min()`
<details><summary>Solution</summary>
<p>
```{r top10}
top10 <- tab.sig %>%
filter(sig == TRUE) %>%
slice_min(n = 10, padj)
```
</p>
</details>
The data is ready to be used to make a volcano plot!
To make the graph below, use `ggplot2`, the functions `geom_point()`, `geom_hline()`, `geom_vline()`, `theme_minimal()`, `theme()` (to remove the legend), `geom_label_repel()` and the function `scale_color_manual()` for the colors.
</div>
**Tips 1 :** Don t forget the transformation of the adjusted pvalue.
**Tips 3 :** `geom_label_repel()` function needs a new parameter 'data' and 'label' in aes parameters.
ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) +
geom_point() +
scale_color_manual(values=c("steelblue", "lightgrey", "firebrick" )) +
geom_hline(yintercept=-log10(0.05), col="black") +
geom_vline(xintercept=c(-1.5, 1.5), col="black") +
theme_minimal() +
theme(
legend.position="none"
) +
labs(y="-log10(p-value)", x = "log2(FoldChange)") +
geom_label_repel(data = top10, mapping = aes(label = gene_symbol))
```{r VolcanoPlotSolut, echo = TRUE, results = 'hide'}
ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) +
geom_point() +
scale_color_manual(values=c("steelblue", "lightgrey", "firebrick" )) +
geom_hline(yintercept=-log10(0.05), col="black") +
geom_vline(xintercept=c(-1.5, 1.5), col="black") +
theme_minimal() +
theme(
legend.position="none"
) +
labs(y="-log10(p-value)", x = "log2(FoldChange)") +
geom_label_repel(data = top10, mapping = aes(label = gene_symbol))