Skip to content
Snippets Groups Projects
Select Git revision
  • 9adeb3c1dbd5515b6720835681eb852d80ea81a8
  • main default
  • master protected
  • quarto-rebuild
4 results

session_4.Rmd

Blame
  • Forked from LBMC / Hub / formations / R_basis
    21 commits ahead of the upstream repository.
    Gilquin's avatar
    Gilquin authored
    correct unconsistent code formatting with styler package
    correct deprecated warnings
    9adeb3c1
    History
    title: "R.4: data transformation"
    author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr), Hélène Polvèche [hpolveche@istem.fr](mailto:hpolveche@istem.fr)"
    date: "2022"
    library(fontawesome)
    
    if ("conflicted" %in% .packages()) {
      conflicted::conflicts_prefer(dplyr::filter)
    }
    rm(list = ls())
    knitr::opts_chunk$set(echo = TRUE)
    knitr::opts_chunk$set(comment = NA)

    Introduction

    The goal of this session is to practice data transformation with tidyverse. The objectives will be to:

    • Filter rows with filter()
    • Arrange rows with arrange()
    • Select columns with select()
    • Add new variables with mutate()

    For this session, we are going to work with a new dataset included in the nycflights13 package.

    Install this package and load it. As usual you will also need the `tidyverse` library.
    Solution

    install.packages("nycflights13")
    library("tidyverse")
    library("nycflights13")

    Data set : nycflights13

    nycflights13::flights contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights

    ?flights

    You can display the first rows of the dataset to have an overview of the data.

    flights

    You can use the function colnames(dataset) to get all the column names of a table:

    colnames(flights)

    Data type

    In programming languages, variables can have different types. When you display a tibble you can see the type of a column. Here is a list of common variable types that you will encounter:

    • int stands for integers.
    • dbl stands for doubles or real numbers.
    • chr stands for character vectors or strings.
    • dttm stands for date-times (a date + a time).
    • lgl stands for logical, vectors that contain only TRUE or FALSE.
    • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
    • date stands for dates.

    It's important for you to know about and understand the different types because certain operations are only allowed between certain types. For instance, you cannot add an int to a chr, but you can add an int to a dbl the results will be a dbl.

    filter rows

    Variable types are important to keep in mind for comparisons. The filter() function allows you to subset observations based on their values.

    The good reflex to take when you meet a new function of a package is to look at the help with ?function_name to learn how to use it and to know the different arguments.

    ?filter

    Use test to filter on a column

    You can use the relational operators (<,>,==,<=,>=,!=) to make a test on a column and keep rows for which the results is TRUE.

    filter(flights, air_time >= 680)
    filter(flights, carrier == "HA")
    filter(flights, origin != "JFK")

    The operator %in% is very useful to test if a value is in a list.

    filter(flights, carrier %in% c("OO", "AS"))
    filter(flights, month %in% c(5, 6, 7, 12))

    dplyr functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, <-.

    Save the flights longer than 680 minutes in a `long_flights` variable.
    Solution

    long_flights <- filter(flights, air_time >= 680)

    Logical operators to filter on several columns

    Multiple arguments to filter() are combined with AND: every expression must be TRUE in order for a row to be included in the output.

    filter(flights, month == 12, day == 25)

    In R you can use the symbols & (and), | (or), ! (not) and the function xor() to build other kinds of tests.

    Display the `long_flights` variable and predict the results of:
    filter(long_flights, day <= 15 & carrier == "HA")
    filter(long_flights, day <= 15 | carrier == "HA")
    filter(long_flights, (day <= 15 | carrier == "HA") & (!month > 2))
    Solution

    long_flights
    
    filter(long_flights, day <= 15 & carrier == "HA")
    filter(long_flights, day <= 15 | carrier == "HA")
    filter(long_flights, (day <= 15 | carrier == "HA") & (!month > 2))
    Test the following operations and translate them with words.
    filter(flights, month == 11 | month == 12)
    filter(flights, month %in% c(11, 12))
    filter(flights, !(arr_delay > 120 | dep_delay > 120))
    filter(flights, arr_delay <= 120 & dep_delay <= 120)
    filter(flights, arr_delay <= 120, dep_delay <= 120)

    Combining logical operators is a powerful programmatic way to select subset of data. However, keep in mind that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of a long one.

    R either prints out the results, or saves them to a variable. What happens when you put your variable assignment code between parenthesis `(` `)` ?
    (dec25 <- filter(flights, month == 12, day == 25))

    Missing values

    One important feature of R that can make comparison tricky are missing values, or NAs for Not Availables. Indeed, each of the variable type can contain either a value of this type (i.e., 2 for an int) or nothing. The nothing recorded in a variable status is represented with the NA symbol.

    As operations with NA values don't make sense, if you have NA somewhere in your operation, the results will be NA:

    NA > 5
    10 == NA
    NA + 10

    However, you can test for NAs with the function is.na():

    is.na(NA)

    filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

    df <- tibble(
      x = c("A", "B", "C"),
      y = c(1, NA, 3)
    )
    df
    filter(df, y > 1)
    filter(df, is.na(y) | y > 1)

    Challenges

    Find all flights that:
    • Had an arrival delay (arr_delay) of two or more hours (you can check ?flights)
    • Flew to Houston (IAH or HOU)
    Solution

    ```{r filter_chalenges_b, eval=TRUE} filter(flights, arr_delay >= 120 & dest %in% c("IAH", "HOU")) ```

    How many flights have a missing `dep_time` ?
    Solution

    filter(flights, is.na(dep_time))
    Why is `NA ^ 0` not missing? Why is `NA | TRUE` not missing? Why is `FALSE & NA` not missing? Can you figure out the general rule? (`NA * 0` is a tricky counterexample!)
    Solution

    NA^0 # ^ 0 is always 1 it's an arbitrary rule not a computation
    NA | TRUE # if a member of a OR operation is TRUE the results is TRUE
    FALSE & NA # if a member of a AND operation is FALSE the results is FALSE
    NA * 0 # here we have a true computation

    Arrange rows with arrange()

    arrange() works similarly to filter() except that instead of selecting rows, it changes their order.

    arrange(flights, distance, dep_delay)

    You can use desc() to reorder by a column in descending order:

    arrange(flights, distance, desc(dep_delay))

    Missing values

    Missing values are always sorted at the end:

    df <- tibble(
      x = c("A", "B", "C"),
      y = c(1, NA, 3)
    )
    df
    
    arrange(df, y)
    arrange(df, desc(y))

    Challenges

    • Find the most delayed flight at arrival (arr_delay).
    • Find the flight that left earliest (dep_delay).
    • How could you arrange all missing values to the start in the df tibble ?
    Solution

    Find the most delayed flight at arrival.

    arrange(flights, desc(arr_delay))

    Find the flight that left earliest.

    arrange(flights, dep_delay)

    How could you arrange all missing values to the start in the df tibble ?

    arrange(df, desc(is.na(y)))

    Select columns with select()

    select() lets you quickly zoom in on a useful subset using operations based on variable names.

    You can select by column names:

    select(flights, year, month, day)

    By defining a range of columns:

    select(flights, year:day)

    Or, you can use a negative (-) to remove columns:

    select(flights, -(year:day))

    You can also rename column names on the fly:

    select(flights, Y = year, M = month, D = day)

    Helper functions

    Here are a number of helper functions you can use within select():

    • starts_with("abc"): matches column names that begin with "abc".
    • ends_with("xyz"): matches column names that end with "xyz".
    • contains("ijk"): matches column names that contain "ijk".
    • num_range("x", 1:3): matches x1, x2 and x3.
    • where(test_function): selects columns for which the result is TRUE.

    See ?select for more details.

    Challenges

    • Brainstorm as many ways as possible to select only dep_time, dep_delay, arr_time, and arr_delay from flights. You can associate several selections arguments with | , & and !.

    The simplest way to start:

    df_dep_arr <- select(flights, dep_time, dep_delay, arr_time, arr_delay)
    colnames(df_dep_arr)
    Other solutions

    select(flights, dep_time, dep_delay, arr_time, arr_delay)
    select(flights, starts_with("dep"), starts_with("arr"))
    select(flights, starts_with("dep") | starts_with("arr"))
    select(flights, matches("^(dep|arr)"))
    select(flights, dep_time:arr_delay & !starts_with("sched"))
    • What does the any_of() function do?
    • Why might it be helpful in conjunction with this vector? What is the difference with all_of() (hint : add "toto" to vars) ?
    vars <- c("year", "month", "day", "dep_delay", "arr_delay")
    Solution

    select(flights, any_of(vars))
    select(flights, all_of(vars))

    From the help message (?all_of()) :

    • all_of() is for strict selection. If any of the variables in the character vector is missing, an error is thrown.
    • any_of() doesn't check for missing variables. It is particularly useful with negative selections, when you would like to make sure a variable is removed.
    vars <- c(vars, "toto")
    select(flights, any_of(vars))
    select(flights, all_of(vars))
    • Select all columns which contain character values ? numeric values ?
    Solution

    select(flights, where(is.character))
    select(flights, where(is.numeric))
    • Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
    select(flights, contains("TIME"))
    Solution

    select(flights, contains("TIME", ignore.case = FALSE))

    Add new variables with mutate()

    It's often useful to add new columns that are functions of existing columns. That's the job of mutate().

    First let's create a thinner dataset to work on `flights_thin` that contains:
    • columns from year to day
    • columns that ends with delay
    • the distance and air_time columns
    • the dep_time and sched_dep_time columns

    Then let's create an even smaller toy dataset to test your commands before using them on the larger one (It a good reflex to take). For that you can use the function head or sample_n for a random sampling alternative.

    • select only 5 rows
    Solution

    (flights_thin <- select(flights, year:day, ends_with("delay"), distance, air_time, contains("dep_time")))
    (flights_thin_toy <- head(flights_thin, n = 5))
    (flights_thin_toy2 <- sample_n(flights_thin, size = 5))

    mutate()

    mutate(tbl, new_var_a = opperation_a, ..., new_var_n = opperation_n)

    mutate() allows you to add new columns (new_var_a, ... , new_var_n) and to fill them with the results of an operation.

    We can create a gain column, which can be the difference between departure and arrival delays, to check whether the pilot has managed to compensate for his departure delay.

    mutate(flights_thin_toy, gain = dep_delay - arr_delay)

    Using mutate to add a new column gain and speed that contains the average speed of the plane to the flights_thin_toy tibble (speed = distance / time).

    Solution

    flights_thin_toy <- mutate(flights_thin_toy,
      gain = dep_delay - arr_delay,
      speed = distance / air_time * 60
    )
    flights_thin_toy
    Currently `dep_time` and `sched_dep_time` are convenient to look at, but difficult to work with, as they're not really continuous numbers (see the help to get more information on these columns). In the flight dataset, convert them to a more convenient representation of the number of minutes since midnight.

    Hints :

    • dep_time and sched_dep_time are in the HHMM format (see the help to get these information). So you have to first get the number of hours HH, convert them in minutes and then add the number of minutes MM.

    • For example: 20:03 will be display 2003, so to convert it in minutes you have to do 20 * 60 + 03 (= 1203).

    • To split the number HHMM in hours (HH) and minutes (MM) you have to use an euclidean division of HHMM by 100 to get the number of hours as the divisor and the number of minute as the remainder. For that, use the modulo operator %% to get the remainder and it's friend %/% which returns the divisor.

    HH <- 2003 %/% 100
    HH
    MM <- 2003 %% 100
    MM
    HH * 60 + MM

    It is always a good idea to decompose a problem in small parts. First, only start with dep_time. Build the HH and MM columns. Then, try to write both conversions in one row.

    Partial solution

    mutate(
      flights_thin_toy,
      HH = dep_time %/% 100,
      MM = dep_time %% 100,
      dep_time2 = HH * 60 + MM
    )

    Note: You can use the .after option to tell where to put the new columns,

    mutate(
      flights_thin_toy,
      HH = dep_time %/% 100,
      MM = dep_time %% 100,
      dep_time2 = HH * 60 + MM,
      .after = "dep_time"
    )

    or .keep = "used" to keep only the columns used for the calculus which can be usefull for debugging,

    mutate(
      flights_thin_toy,
      HH = dep_time %/% 100,
      MM = dep_time %% 100,
      dep_time2 = HH * 60 + MM,
      .keep = "used"
    )

    In one row (or you can also remove columns HH and MM using select):

    mutate(
      flights_thin_toy,
      dep_time2 = dep_time %/% 100 * 60 + dep_time %% 100,
      .after = "dep_time"
    )

    Note: You can also directly replace a column by the result of the mutate operation,

    mutate(
      flights_thin_toy,
      dep_time = dep_time * 60 + dep_time
    )
    Final solution

    mutate(
      flights,
      dep_time = (dep_time %/% 100) * 60 + dep_time %% 100,
      sched_dep_time = (sched_dep_time %/% 100) * 60 + sched_dep_time %% 100
    )

    Useful creation functions

    • Offsets: lead(x) and lag(x) allow you to refer to the previous or next values of the column x.
      This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).
    • R provides functions for running cumulative sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.
    • Logical comparisons, <, <=, >, >=, !=, and ==.
    • Ranking: there are a number of ranking functions, the most frequently used being min_rank(). They differ by the way ties are treated, etc. Try ?mutate, ?min_rank, ?rank, for more information.

    See you in R.5: Pipping and grouping {.unnumbered .unlisted}

    To go further: Data transformation and color sets.

    There are a number of color palettes available in R, thanks to different packages such as RColorBrewer, Viridis or Ghibli. We will use them here to decorate our graphs, either on data already studied in the training, mpg, or on more specialized data such as lists of differentially expressed genes ( GSE86356 )

    install.packages(c("ghibli", "RColorBrewer", "viridis"))
    library(tidyverse)
    
    library(RColorBrewer)
    library(ghibli)
    library(viridis)

    RColorBrewer & Ghibli

    Using mpg and the ggplot2 package, reproduce the graph studied in @sec-color-mapping. Modify the colors representing the class of cars with the palettes Dark2 of RColorBrewer, then MononokeMedium from Ghibli.

    ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
      geom_point()

    Go to the links to find the appropriate function: they are very similar between the two packages.

    Solution

    ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
      geom_point() +
      scale_color_brewer(palette = "Dark2")
    ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
      geom_point() +
      scale_colour_ghibli_d("MononokeMedium")

    The choice of colors is very important for the comprehension of a graphic. Some palettes are not suitable for everyone. For example, for people with color blindness, color gradients from green to red, or from yellow to blue should be avoided.

    To display only Brewer palettes that are colorblind friendly, specify the option colorblindFriendly = TRUE as follows:

    display.brewer.all(colorblindFriendly = TRUE)

    Viridis

    The viridis package provides a series of color maps that are designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency.

    For the next part, we will use a real data set. Anterior tibial muscle tissue was collected from 20 patients, with or without confirmed myotonic dystrophy type 1 (DM1). Illumina RNAseq was performed on these samples and the sequencing data are available on GEO with the identifier GSE86356.

    First, we will use the gene count table of these samples, formatted for use in ggplot2 ( pivot_longer() function ).

    Open the csv file using the read_csv2() function. The file is located at "https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv".

    Solution

    Download the file "Expression_matrice_pivot_longer_DEGs_GSE86356.csv" and save it in your working directory. You may have to set you working directory using setwd()

    expr_DM1 <- read_csv2("Expression_matrice_pivot_longer_DEGs_GSE86356.csv")
    
    expr_DM1

    or you can read it from the following url:

    (expr_DM1 <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv"))

    With this tibble, use ggplot2 and the geom_tile() function to make a heatmap. Fit the samples on the x-axis and the genes on the y-axis.

    Tip: Transform the counts into log10(x + 1) for a better visualization.

    Solution

    (DM1_tile_base <-
      ggplot(expr_DM1, aes(samples, Genes, fill = log1p(counts))) +
      geom_tile() +
      labs(y = "Genes", x = "Samples") +
      theme(
        axis.text.y = element_text(size = 6),
        axis.text.x = element_text(size = 6, angle = 90)
      ))

    Nota bene: The elements of the axes, and the theme in general, are modified in the theme() function.

    With the default color gradient, even with the transformation, the heatmap is difficult to study.

    R interprets a large number of colors, indicated in RGB, hexadecimal, or just by name. For example :

    ![](./img/colorsR.png){width=400px}

    With scale_fill_gradient2() function, change the colors of the gradient, taking "white" for the minimum value and "springgreen4" for the maximum value.

    Solution

    DM1_tile_base + scale_fill_gradient2(low = "white", high = "springgreen4")
    

    It's better, but still not perfect! Now let's use the viridis color gradient for this graph.

    Solution

    DM1_tile_base + scale_fill_viridis_c()

    Volcano Plot

    For this last exercise, we will use the results of the differential gene expression analysis between DM1 vs WT conditions.

    Open the csv file using the read_csv2() function. The file is located at "http://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv".

    Solution

    Download the file "EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv" and save it in your working directory.

    tab <- read_csv2("EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
    
    tab
    tab <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
    tab

    To make a Volcano plot, displaying different information on the significance of variation using colors, we will have to make a series of modifications on this table.

    With mutate() and ifelse() fonctions, we will have to create:

    • a column 'sig': it indicates if the gene is significant ( TRUE or FALSE ).
      Thresholds: baseMean > 20 and padj < 0.05 and abs(log2FoldChange) >= 1.5

    • a column 'UpDown': it indicates if the gene is significantly up-regulated (Up), down-regulated (Down), or not significantly regulated (NO).

    Solution

    (
      tab.sig <- mutate(
        tab,
        sig = baseMean > 20 & padj < 0.05 & abs(log2FoldChange) >= 1.5,
        UpDown = ifelse(sig, ### we can use in the same mutate a column created by a previous line
          ifelse(log2FoldChange > 0, "Up", "Down"), "NO"
        )
      )
    )

    We want to see the top10 DEGs on the graph. For this, we will use the package ggrepel. Install and load the ggrepel package.

    Solution

    install.packages("ggrepel")
    library(ggrepel)

    Let's filter out the table into a new variable, top10, to keep only the significant differentially expressed genes, those with the top 10 adjusted pvalue. The smaller the adjusted pvalue, the more significant the gene.

    Tips: You can use the function slice_min()

    Solution

    (top10 <- arrange(tab.sig, desc(sig), padj))
    (top10 <- mutate(top10, row_N = row_number()))
    (top10 <- filter(top10, row_N <= 10))
    (top10 <- filter(tab.sig, sig == TRUE))
    (top10 <- slice_min(top10, padj, n = 10))

    The data is ready to be used to make a volcano plot!

    To make the graph below, use `ggplot2`, the functions `geom_point()`, `geom_hline()`, `geom_vline()`, `theme_minimal()`, `theme()` (to remove the legend), `geom_label_repel()` and the function `scale_color_manual()` for the colors.
    • Tips 1: Don't forget the transformation of the adjusted pvalue.
    • Tips 2: Feel free to search your favorite Web browser for help.
    • Tips 3: geom_label_repel() function needs a new parameter 'data' and 'label' in aes parameters.
    ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) +
      geom_point() +
      scale_color_manual(values = c("steelblue", "lightgrey", "firebrick")) +
      geom_hline(yintercept = -log10(0.05), col = "black") +
      geom_vline(xintercept = c(-1.5, 1.5), col = "black") +
      theme_minimal() +
      theme(legend.position = "none") +
      labs(y = "-log10(p-value)", x = "log2(FoldChange)") +
      geom_label_repel(data = top10, mapping = aes(label = gene_symbol))
    
    Solution

    ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) +
      geom_point() +
      scale_color_manual(values = c("steelblue", "lightgrey", "firebrick")) +
      geom_hline(yintercept = -log10(0.05), col = "black") +
      geom_vline(xintercept = c(-1.5, 1.5), col = "black") +
      theme_minimal() +
      theme(legend.position = "none") +
      labs(y = "-log10(p-value)", x = "log2(FoldChange)") +
      geom_label_repel(data = top10, mapping = aes(label = gene_symbol))