Skip to content
Snippets Groups Projects
Select Git revision
  • 94772e1662fd63fc7b0a1b8f417083d44bbe7da5
  • master default protected
  • dev
  • v2.0.0
  • v0.4.0
  • v0.3.0
  • v0.2.9
  • v0.2.8
  • v0.2.7
  • v0.2.6
  • v0.1.0
  • v0.2.5
  • v0.2.4
  • v0.2.3
  • v0.2.2
  • v0.2.1
  • v0.2.0
  • v0.1.2
18 results

building_your_pipeline.md

Blame
  • # https://www.gnu.org/licenses/agpl-3.0.txt
    title: "Introduction to clustering"
    author: "Ghislain Durif, Laurent Modolo, Franck Picard"
    output:
      rmdformats::downcute:
        self_contain: true
        use_bookdown: true
        default_style: "light"
        lightbox: true
        css: "./www/style_Rmd.css"
    

    Creative Commons License
    This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

    if (!require("remotes"))
      install.packages("remotes")
    if (!require("tidyverse"))
      install.packages("tidyverse")
    library(tidyverse) # to manipule data and make plot
    if (!require("Seurat"))
      install.packages("Seurat")
    if (!require("factoextra"))
      install.packages("factoextra")
    library(factoextra) # manipulate pca results
    if (!require("fontawesome"))
      install.packages("fontawesome")
    library(fontawesome)
    if (!require("igraph"))
      install.packages("igraph")
    library(igraph)
    if (!require("cccd"))
      install.packages("cccd")
    library(cccd)
    if (!require("mclust"))
      install.packages("mclust")
    library(mclust)
    
    rm(list = ls())
    knitr::opts_chunk$set(echo = TRUE)
    knitr::opts_chunk$set(comment = NA)
    if (!require("klippy")) {
      remotes::install_github("rlesur/klippy")
    }
    klippy::klippy(
      position = c('top', 'right'),
      color = "white",
      tooltip_message = 'Click to copy',
      tooltip_success = 'Copied !')

    Introduction

    The goal of single-cell transcriptomics is to measure the transcriptional states of large numbers of cells simultaneously. The input to a single-cell RNA sequencing (scRNAseq) method is a collection of cells. Formally, the desired output is a transcript or genes (M) x cells (N) matrix X^{N \times M} that describes, for each cell, the abundance of its constituent transcripts or genes. More generally, single-cell genomics methods seek to measure not just transcriptional state, but other modalities in cells, e.g., protein abundances, epigenetic states, cellular morphology, etc.

    We will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.

    Loading the data

    if (!file.exists("practical_b.Rdata")) {
      if (!require("SeuratData"))
        remotes::install_github('satijalab/seurat-data', upgrade = T)
      library(SeuratData)
      InstallData("pbmc3k")
      pbmc <- LoadData("pbmc3k", type = "pbmc3k.final")
      pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
      pbmc <- ScaleData(pbmc)
      pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
      data <- Assays(pbmc, slot = "RNA")@scale.data
      cell_annotation <- pbmc@meta.data$seurat_annotations
      colnames(data) <- str_c(1:ncol(data), "_", cell_annotation)
      var_gene_2000 <- VariableFeatures(pbmc)
      save(data, cell_annotation, var_gene_2000, file = "practical_b.Rdata")
    }
    load("practical_b.Rdata", verbose = T)
    library(tidyverse)
    library(factoextra)
    library(igraph)
    library(cccd)
    library(mclust)
    load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_b.Rdata"), verbose = T)

    You loaded 3 variables

    • data a single-cell RNA-Sequencing count matrix
    • cell_annotation a vector containing the cell-type of the cells
    • var_gene_2000 an ordered vector of the 2000 most variable genes
    Check the dimension of your data
    Solution

    The scRNASeq data

    dim(data)

    The number of cell types

    length(cell_annotation)
    table(cell_annotation)
    length(var_gene_2000)
    head(var_gene_2000)
    Why do you think that we need a list of the 2000 most variable genes ?

    Distances

    The clustering algorithms seen this morning rely on Gram matrices. You can compute the Euclidean distance matrices of data with the dist() function (but don't try to run it on the r nrow(data) genes)

    The following code computes the cell-to-cell Euclidean distances for the 10 most variable genes and the first 100 cells

    c2c_dist_10 <- data[var_gene_2000[1:10], 1:100] %>% 
      t() %>% 
      dist()
    Use the following code to study the impact of the number of genes on the distances
    c2c_dist_10 %>%
      as.vector() %>% 
      as_tibble() %>% 
      ggplot() +
      geom_histogram(aes(x = value))

    What happens when the number of dimensions increases ?

    Solution

    ```{r, cache=T} c2c_dist_n <- tibble( n_var = c(seq(from = 10, to = 200, by = 50), seq(from = 200, to = 2000, by = 500), 2000) ) %>% mutate( cell_dist = purrr::map(n_var, function(n_var, data, var_gene_2000){ data[var_gene_2000[1:n_var], 1:100] %>% t() %>% dist() %>% as.vector() %>% as_tibble() }, data = data, var_gene_2000) ) c2c_dist_n %>% unnest(c(cell_dist)) %>% ggplot() + geom_histogram(aes(x = value)) + facet_wrap(~n_var) ```

    To circumvent this problem we are going to use the PCA as a dimension reduction technique.

    Use the `prcomp()` function to compute `data_pca` from the 600 most variable genes. You can check the results with the following code:
      fviz_pca_ind(
        data_pca,
        geom = "point",
        col.ind = cell_annotation
      )
    Solution

    ```{r} data_pca <- data[var_gene_2000[1:600], ] %>% t() %>% prcomp() ```

    data_pca %>%
      fviz_pca_ind(
        geom = "point",
        col.ind = cell_annotation
      )
    You can use the `fviz_eig()` function to choose the number of dimensions that you are going to use to compute your distances matrix. Save this matrix in the `data_dist` variable
    Solution

    Check the variability explained by the axes of the PCA

    fviz_eig(data_pca)

    Compute the distance matrix on the 2 first PCs

    data_dist <- dist(data_pca$x[,1:2])

    Hierarchical Clustering

    The hclust() function performs a hierarchical clustering analysis from a distance matrix.

    data_hclust <- hclust(data_dist)
    You can use the `plot()` function to plot the resulting dendrogram.
    Solution

    ```{r} data_hclust %>% plot() ```

    Too much information drawn the information, the function cutree() can help you solve this problem.

    Which choice of `k` would you take ?

    Modify the following code to display your clustering.

    data_pca %>%
      fviz_pca_ind(
        geom = "point",
        col.ind = cell_annotation # we want a as.factor()
      )
    Solution

    ```{r} data_pca %>% fviz_pca_ind( geom = "point", col.ind = as.factor(cutree(data_hclust, k = 9)) ) ```

    The adjusted Rand index is computed to compare two classifications. This index has zero expected value in the case of random partitions, and it is bounded above by 1 in the case of perfect agreement between two partitions.

    Use the `adjustedRandIndex()` function to compare the `cell_annotation` to your hierarchical clustering
    Solution

    ```{r} adjustedRandIndex( cutree(data_hclust, k=9), cell_annotation ) ```

    Modify the following code to study the relation between the adjusted Rand index and the number of PCs used to compute the distance
    data_pca$x[,1:3] %>% 
      dist() %>% 
      hclust() %>% 
      cutree(k = 9) %>% 
      adjustedRandIndex(cell_annotation)

    What can you conclude ?

    Solution

    tibble(
      n_pcs = seq(from = 3, to = 100, by = 10)
    ) %>% 
      mutate(
        ari = purrr::map(n_pcs, function(n_pcs, data_pca, cell_annotation){
          data_pca$x[,1:n_pcs] %>% 
            dist() %>% 
            hclust() %>% 
            cutree(k = 9) %>% 
            adjustedRandIndex(cell_annotation)
        }, data_pca = data_pca, cell_annotation = cell_annotation)
      ) %>% 
      unnest(ari) %>% 
      ggplot() +
      geom_line(aes(x = n_pcs, y = ari))
    Is it a PCA problem ?
    Solution

    ```{r} tibble( n_gene = seq(from = 3, to = 600, by = 20) ) %>% mutate( ari = purrr::map(n_gene, function(n_gene, data, var_gene_2000, cell_annotation){ data[var_gene_2000[1:n_gene], ] %>% t() %>% dist() %>% hclust() %>% cutree(k = 9) %>% adjustedRandIndex(cell_annotation) }, data = data, var_gene_2000 = var_gene_2000, cell_annotation = cell_annotation) ) %>% unnest(ari) %>% ggplot() + geom_line(aes(x = n_gene, y = ari)) ```

    k-means clustering

    The kmeans() function performs a k-means clustering analysis from a distance matrix.

    data_kmeans <- kmeans(data_dist, centers = 9)
    Why is the `centers` parameter required for `kmeans()` and not for the `hclust()` function ?

    We want to compare the cells annotation to our clustering.

    data_pca %>%
      fviz_pca_ind(
        geom = "point",
        col.ind = cell_annotation
      )
    Using the `str()` function make the following plot from your k-means results.
    data_pca %>%
      fviz_pca_ind(
        geom = "point",
        col.ind = as.factor(data_kmeans$cluster)
      )
    Solution

    ```{r, eval=F} data_pca %>% fviz_pca_ind( geom = "point", col.ind = as.factor(data_kmeans$cluster) ) ```

    Use the `adjustedRandIndex()` function to compare the `cell_annotation` to your k-means clustering.
    Solution

    ```{r} adjustedRandIndex( data_kmeans$cluster, cell_annotation ) ```

    Maybe the real number of clusters in the PCs data is not k=9. We can use different metrics to evaluate the effect of the number of clusters on the quality of the clustering.

    • WSS: the Within-Cluster-Sum of Squared Errors (each point vs the centroid) for different values of k, \sum_{i=1}^n (x_i - c_i)^2
    • The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). s(i) = \frac{b(i) - a(i)}{\max{a(i),b(i)}} with a(i) the mean distance between i and cells in the same cluster and b(i) the mean distance between i and cells in different clusters. We plot \frac{1}{n}\sum_{i=1}^n s(i)
    The `fviz_nbclust()` function makes the following plot from your k-means results.
    Solution

    For the total within-cluster-sum of squared errors

    fviz_nbclust(data_pca$x[, 1:3], kmeans, method = "wss")

    For the average silhouette width

    fviz_nbclust(data_pca$x[, 1:3], kmeans, method = "silhouette")
    With `fviz_nbclust()` you can make the same analysis for your hierarchical clustering. The function `hcut()` allow you to perform `hclust()` and `cutree()` in one step.
    Solution

    ```{r} fviz_nbclust(data_pca$x[, 1:3], hcut, method = "wss") ```

    fviz_nbclust(data_pca$x[, 1:3], hcut, method = "silhouette")

    Graph-based clustering

    We are going to use the cluster_louvain() function to perform a graph-based clustering. This function takes into input an undirected graph instead of a distance matrix.

    The nng() function computes a k-nearest neighbor graph. With the mutual = T option, this graph is undirected.

    Check the effect of the `mutual = T` on the `data_knn` with the following code
    data_knn <- data_dist %>%
      as.matrix() %>% 
      nng(k = 30, mutual = T)
    Solution

    ```{r, eval=F} data_knn_F <- data_dist %>% as.matrix() %>% nng(k = 30, mutual = F)

    str(data_knn) str(data_knn_F)

    </p>
    </details>
    
    The `cluster_louvain()` function implements the multi-level modularity optimization algorithm for finding community structure in a graph. Use this function on `data_knn` to create a `data_louvain` variable.
    
    You can check the clustering results with `membership(data_louvain)`.
    
    <div class="pencadre">
    For which `resolution` value do you get 9 clusters ?
    </div>
    
    <details><summary>Solution</summary>
    <p>
    ```{r}
    data_louvain <- data_knn %>% 
      cluster_louvain(resolution = 0.41)
    data_pca %>%
      fviz_pca_ind(
        geom = "point",
        col.ind = as.factor(membership(data_louvain))
      )
    Use the `adjustedRandIndex()` function to compare the `cell_annotation` to your graph-based clustering.
    Solution

    ```{r} adjustedRandIndex( membership(data_louvain), cell_annotation ) ```

    The .Rmd file corresponding to this page is available here under the AGPL3 Licence