- Introduction
- Data set : nycflights13
- Data type
- filter rows
- Logical operators
- Missing values
- Challenges
- Arrange rows with arrange()
- Missing values
- Challenges
- Select columns with select()
- Helper functions
- Challenges
- Add new variables with mutate()
- mutate()
- Useful creation functions
- See you in R#5: Pipping and grouping
- To go further: Data transformation and color sets.
title: "R.4: data transformation"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
date: "2021"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "dark"
lightbox: true
css: "http://perso.ens-lyon.fr/laurent.modolo/R/src/style.css"
rm(list=ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
Introduction
The goal of this practical is to practice data transformation with tidyverse
.
The objectives of this session will be to:
- Filter rows with
filter()
- Arrange rows with
arrange()
- Select columns with
select()
- Add new variables with
mutate()
Solution
```R install.packages("nycflights13") ```
library("tidyverse")
library("nycflights13")
Data set : nycflights13
nycflights13::flights
Contains all 336,776 flights that departed from New York City in 2013.
The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights
flights
Data type
In programming languages, all variables are not equal.
When you display a tibble
you can see the type of a column.
Here is a list of common variable types that you will encounter
- int stands for integers.
- dbl stands for doubles or real numbers.
- chr stands for character vectors or strings.
- dttm stands for date-times (a date + a time).
-
lgl stands for logical, vectors that contain only
TRUE
orFALSE
. - fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
- date stands for dates.
You cannot add an int to a chr, but you can add an int to a dbl the results will be a dbl.
filter
rows
Variable types are important to keep in mind for comparisons.
The filter()
function allows you to subset observations based on their values.
What is the results of the following filter
command ?
filter(flights, month == 1, day == 1)
dplyr
functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-
```{r filter_month_day_sav, include=TRUE} jan1 <- filter(flights, month == 1, day == 1) ```
(dec25 <- filter(flights, month == 12, day == 25))
Logical operators
Multiple arguments to filter()
are combined with AND: every expression must be TRUE
in order for a row to be included in the output.
In R you can use the symbols &
, |
, !
and the function xor()
to build other kinds of tests.
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
Combinations of logical operators is a powerful programmatic way to select subset of data. Keep in mind, however, that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of one long one.
Missing values
One important feature of R that can make comparison tricky is missing values, or NA
s for Not Availables.
Indeed each of the variable type can contain either a value of this type (i.e., 2
for an int) or nothing.
The nothing recorded in a variable status is represented with the NA
symbol.
As operations with NA
values don't make sense, if you have NA
somewhere in your operation, the results will be NA
NA > 5
10 == NA
NA + 10
However, you can test for NA
s with the function is.na()
:
is.na(NA)
filter()
only includes rows where the condition is TRUE
; it excludes both FALSE
and NA
values. If you want to preserve missing values, ask for them explicitly:
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
Challenges
Solution
```{r filter_chalenges_a, eval=TRUE} filter(flights, arr_delay >= 60 | arr_delay <= 120) ``` ```{r filter_chalenges_b, eval=TRUE} filter(flights, dest %in% c("IAH", "HOU")) ```
Solution
```{r filter_chalenges_c, eval=TRUE} filter(flights, is.na(dep_time)) ```
Solution
```{r filter_chalenges_d, eval=TRUE} NA ^ 0 # ^ 0 is always 1 it's an arbitrary rule not a computation NA | TRUE # if a member of a OR operation is TRUE the results is TRUE FALSE & NA # if a member of a AN operation is FALSE the results is TRUE NA * 0 # here we have a true computation ```
arrange()
Arrange rows with arrange()
works similarly to filter()
except that instead of selecting rows, it changes their order.
arrange(flights, year, month, day)
Solution
```{r arrange_desc, include=TRUE} arrange(flights, desc(dep_delay)) ```
Missing values
Missing values are always sorted at the end:
arrange(tibble(x = c(5, 2, NA)), x)
arrange(tibble(x = c(5, 2, NA)), desc(x))
Challenges
- Find the most delayed flight.
- Find the flight that left earliest.
- How could you arrange all missing values to the start ?
Solution
Find the most delayed flight. ```{r chalange_arrange_desc_a, include=TRUE} arrange(flights, desc(dep_delay)) ``` Find the flight that left earliest. ```{r chalange_arrange_desc_b, include=TRUE} arrange(flights, dep_delay) ``` How could you arrange all missing values to the start ```{r chalange_arrange_desc_c, include=TRUE} arrange(tibble(x = c(5, 2, NA)), desc(is.na(x))) ```
select()
Select columns with select()
allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
You can select by column names
select(flights, year, month, day)
By defining a range of columns
select(flights, year:day)
Or you can do a negative (-
) to remove columns.
select(flights, -(year:day))
Helper functions
here are a number of helper functions you can use within select()
:
-
starts_with("abc")
: matches names that begin with"abc"
. -
ends_with("xyz")
: matches names that end with"xyz"
. -
contains("ijk")
: matches names that contain"ijk"
. -
num_range("x", 1:3)
: matchesx1
,x2
andx3
.
See ?select
for more details.
Challenges
- Brainstorm as many ways as possible to select
dep_time
,dep_delay
,arr_time
, andarr_delay
fromflights
.
Solution
```{r challenge_select_a, eval=FALSE} select(flights, contains("time") | contains("delay")) select(flights, contains("_") & !starts_with("sched") & !starts_with("time")) ```
Solution
```{r challenge_select_b, eval=FALSE} select(flights, one_of(vars)) ```
Solution
```{r challenge_select_c, eval=FALSE} select(flights, contains("TIME", ignore.case = FALSE)) ```
mutate()
Add new variables with It’s often useful to add new columns that are functions of existing columns. That’s the job of mutate()
.
Solution
```{r mutate, include=TRUE} (flights_sml <- select(flights, year:day, ends_with("delay"), distance, air_time)) ```
mutate()
mutate(tbl, new_var_a = opperation_a, ..., new_var_n = opperation_n)
mutate()
allows you to add new columns (new_var_a
, ... , new_var_n
) and to fill them with the results of an operation.
We can create a gain
column to check if the pilot managed to compensate is departure delay
mutate(flights_sml, gain = dep_delay - arr_delay)
Solution
```{r mutate_reuse, include=TRUE} flights_sml <- mutate(flights_sml, gain = dep_delay - arr_delay, speed = distance / air_time * 60 ) ```
Solution
```{r mutate_challenges_a, eval=F, message=F, cache=T} mutate( flights, dep_time = (dep_time %/% 100) * 60 + dep_time %% 100, sched_dep_time = (sched_dep_time %/% 100) * 60 + sched_dep_time %% 100 ) ```
Useful creation functions
- Offsets:
lead()
andlag()
allow you to refer to leading or lagging values. This allows you to compute running differences (e.g.x - lag(x)
) or find when values change (x != lag(x)
). - Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes:
cumsum()
,cumprod()
,cummin()
,cummax()
; and dplyr providescummean()
for cumulative means. - Logical comparisons,
<
,<=
,>
,>=
,!=
, and==
- Ranking: there are a number of ranking functions, but you should start with
min_rank()
. There is alsorow_number()
,dense_rank()
,percent_rank()
,cume_dist()
,ntile()
R#5: Pipping and grouping
See you inTo go further: Data transformation and color sets.
There are a number of color palettes available in R, thanks to different packages such as RColorBrewer
, Viridis
or Ghibli
.
We will use them here to decorate our graphs, either on data already studied in the training, mpg
, or on more specialized data such as lists of differentially expressed genes ( GSE86356 )
install.packages(c("ghibli", "RColorBrewer", "viridis"))