-
Ghislain Durif authoredGhislain Durif authored
# https://www.gnu.org/licenses/agpl-3.0.txt
title: "Practice: Introduction to clustering"
author: "Ghislain Durif, Laurent Modolo, Franck Picard"
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
if (!require("remotes"))
install.packages("remotes")
if (!require("tidyverse"))
install.packages("tidyverse")
library(tidyverse) # to manipule data and make plot
if (!require("Seurat"))
install.packages("Seurat")
if (!require("factoextra"))
install.packages("factoextra")
library(factoextra) # manipulate pca results
if (!require("fontawesome"))
install.packages("fontawesome")
library(fontawesome)
if (!require("igraph"))
install.packages("igraph")
library(igraph)
if (!require("cccd"))
install.packages("cccd")
library(cccd)
if (!require("mclust"))
install.packages("mclust")
library(mclust)
if (!require("umap"))
install.packages("umap")
rm(list = ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
if("conflicted" %in% .packages()) conflicted::conflicts_prefer(dplyr::filter)
Introduction
The goal of single-cell transcriptomics is to measure the transcriptional states of large numbers of cells simultaneously. The input to a single-cell RNA sequencing (scRNAseq) method is a collection of cells. Formally, the desired output is a transcript or genes (
We will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.
Loading the data
if (!file.exists("practical_b.Rdata")) {
if (!require("SeuratData"))
remotes::install_github('satijalab/seurat-data', upgrade = T)
library(SeuratData)
InstallData("pbmc3k")
pbmc <- LoadData("pbmc3k", type = "pbmc3k.final")
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- ScaleData(pbmc)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
data <- Assays(pbmc, slot = "RNA")@scale.data
cell_annotation <- pbmc@meta.data$seurat_annotations
colnames(data) <- str_c(1:ncol(data), "_", cell_annotation)
var_gene_2000 <- VariableFeatures(pbmc)
save(data, cell_annotation, var_gene_2000, file = "practical_b.Rdata")
}
load("practical_b.Rdata", verbose = T)
library(tidyverse)
library(factoextra)
library(igraph)
library(cccd)
library(mclust)
library(umap)
load(url("https://lbmc.gitbiopages.ens-lyon.fr/hub/formations/ens_m1_ml/practical_b.Rdata"), verbose = T)
You loaded 3 variables
-
data
a single-cell RNA-Sequencing count matrix -
cell_annotation
a vector containing the cell-type of the cells -
var_gene_2000
an ordered vector of the 2000 most variable genes
Solution
The scRNASeq data
dim(data)
The number of cell types
length(cell_annotation)
table(cell_annotation)
length(var_gene_2000)
head(var_gene_2000)
Distances
The clustering algorithms seen this morning rely on Gram matrices.
You can compute the Euclidean distance matrices of data
with the dist()
function (but don't try to run it on the r nrow(data)
genes)
The following code computes the cell-to-cell Euclidean distances for the 10 most variable genes and the first 100 cells
c2c_dist_10 <- data[var_gene_2000[1:10], 1:100] %>%
t() %>%
dist()
c2c_dist_10 %>%
as.vector() %>%
as_tibble() %>%
ggplot() +
geom_histogram(aes(x = value))
What happens when the number of dimensions increases ?
Solution
```{r, cache=T} c2c_dist_n <- tibble( n_var = c(seq(from = 10, to = 200, by = 50), seq(from = 200, to = 2000, by = 500), 2000) ) %>% mutate( cell_dist = purrr::map(n_var, function(n_var, data, var_gene_2000){ data[var_gene_2000[1:n_var], 1:100] %>% t() %>% dist() %>% as.vector() %>% as_tibble() }, data = data, var_gene_2000) ) c2c_dist_n %>% unnest(c(cell_dist)) %>% ggplot() + geom_histogram(aes(x = value)) + facet_wrap(~n_var) ```
To circumvent this problem we are going to use the PCA as a dimension reduction technique.
fviz_pca_ind(
data_pca,
geom = "point",
col.ind = cell_annotation
)
Solution
```{r} data_pca <- data[var_gene_2000[1:600], ] %>% t() %>% prcomp() ```
data_pca %>%
fviz_pca_ind(
geom = "point",
col.ind = cell_annotation
)
Solution
Check the variability explained by the axes of the PCA
fviz_eig(data_pca)
Compute the distance matrix on the 2 first PCs
data_dist <- dist(data_pca$x[,1:2])
Hierarchical Clustering
The hclust()
function performs a hierarchical clustering analysis from a distance matrix.
data_hclust <- hclust(data_dist)
Solution
```{r} data_hclust %>% plot() ```
Too much information can drawn the information, the function cutree()
can help you solve this problem.
data_pca %>%
fviz_pca_ind(
geom = "point",
col.ind = cell_annotation # we want a as.factor()
)
Solution
```{r} data_pca %>% fviz_pca_ind( geom = "point", col.ind = as.factor(cutree(data_hclust, k = 9)) ) ```
The adjusted Rand index can be computed to compare two classifications. This index has and expected value of zero in the case of random partitions, and it is bounded above by 1 in the case of perfect agreement between two partitions.
Solution
```{r} adjustedRandIndex( cutree(data_hclust, k=9), cell_annotation ) ```
data_pca$x[,1:3] %>%
dist() %>%
hclust() %>%
cutree(k = 9) %>%
adjustedRandIndex(cell_annotation)
What can you conclude ?
Solution
tibble(
n_pcs = seq(from = 3, to = 100, by = 10)
) %>%
mutate(
ari = purrr::map(n_pcs, function(n_pcs, data_pca, cell_annotation){
data_pca$x[,1:n_pcs] %>%
dist() %>%
hclust() %>%
cutree(k = 9) %>%
adjustedRandIndex(cell_annotation)
}, data_pca = data_pca, cell_annotation = cell_annotation)
) %>%
unnest(ari) %>%
ggplot() +
geom_line(aes(x = n_pcs, y = ari))
Solution
```{r} tibble( n_gene = seq(from = 3, to = 600, by = 20) ) %>% mutate( ari = purrr::map(n_gene, function(n_gene, data, var_gene_2000, cell_annotation){ data[var_gene_2000[1:n_gene], ] %>% t() %>% dist() %>% hclust() %>% cutree(k = 9) %>% adjustedRandIndex(cell_annotation) }, data = data, var_gene_2000 = var_gene_2000, cell_annotation = cell_annotation) ) %>% unnest(ari) %>% ggplot() + geom_line(aes(x = n_gene, y = ari)) ```
k-means clustering
The kmeans()
function performs a k-means clustering analysis from a distance matrix.
data_kmeans <- kmeans(data_dist, centers = 9)
We want to compare the cells annotation to our clustering.
data_pca %>%
fviz_pca_ind(
geom = "point",
col.ind = cell_annotation
)
data_pca %>%
fviz_pca_ind(
geom = "point",
col.ind = as.factor(data_kmeans$cluster)
)
Solution
```{r, eval=F} data_pca %>% fviz_pca_ind( geom = "point", col.ind = as.factor(data_kmeans$cluster) ) ```
Solution
```{r} adjustedRandIndex( data_kmeans$cluster, cell_annotation ) ```
Maybe the real number of clusters in the PCs data is not
- WSS: the Within-Cluster-Sum of Squared Errors (each point vs the centroid) for different values of k,\sum_{i=1}^n (x_i - c_i)^2
- The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation). s(i) = \frac{b(i) - a(i)}{\max{a(i),b(i)}}witha(i)the mean distance betweeniand cells in the same cluster andb(i)the mean distance betweeniand cells in different clusters. We plot\frac{1}{n}\sum_{i=1}^n s(i)
Solution
For the total within-cluster-sum of squared errors
fviz_nbclust(data_pca$x[, 1:3], kmeans, method = "wss")
For the average silhouette width
fviz_nbclust(data_pca$x[, 1:3], kmeans, method = "silhouette")
Solution
```{r} fviz_nbclust(data_pca$x[, 1:3], hcut, method = "wss") ```
fviz_nbclust(data_pca$x[, 1:3], hcut, method = "silhouette")
Graph-based clustering
We are going to use the cluster_louvain()
function to perform a graph-based clustering.
This function takes into input an undirected graph instead of a distance matrix.
The nng()
function computes a k-nearest neighbor graph. With the mutual = T
option, this graph is undirected.
data_knn <- data_dist %>%
as.matrix() %>%
nng(k = 30, mutual = T)
Solution
```{r, eval=F} data_knn <- data_dist %>% as.matrix() %>% nng(k = 30, mutual = T)
data_knn_F <- data_dist %>% as.matrix() %>% nng(k = 30, mutual = F)
str(data_knn) str(data_knn_F)
</p>
</details>
<div class="red_pencadre">
Why do we need a knn ?
</div>
The `cluster_louvain()` function implements the multi-level modularity optimization algorithm for finding community structure in a graph. Use this function on `data_knn` to create a `data_louvain` variable.
You can check the clustering results with `membership(data_louvain)`.
<div class="pencadre">
For which `resolution` value do you get 9 clusters ?
</div>
<details><summary>Solution</summary>
<p>
```{r}
data_louvain <- data_knn %>%
cluster_louvain(resolution = 0.41)
data_pca %>%
fviz_pca_ind(
geom = "point",
col.ind = as.factor(membership(data_louvain))
)
Solution
```{r} adjustedRandIndex( membership(data_louvain), cell_annotation ) ```
Graph-based dimension reduction
Uniform Manifold Approximation and Projection (UMAP) is an algorithm for dimensional reduction. Its details are described by McInnes, Healy, and Melville and its official implementation is available through a python package umap-learn.
library(umap)
data_umap <- umap(data_pca$x[, 1:10])
data_umap$layout %>%
as_tibble(.name_repair = "universal") %>%
mutate(cell_type = cell_annotation) %>%
ggplot() +
geom_point(aes(x = ...1, y = ...2, color = cell_type))
The .Rmd file corresponding to this page is available here under the AGPL3 Licence
k-means clustering algorithm
Implementing your own You are going to implement your own
- Assign point to the cluster with the nearest centroid
- Compute the new cluster centroids
Justify each of your function.
data_pca %>%
fviz_pca_ind(
geom = "point",
col.ind = as.factor(kmeans_example(data_pca$x[,1:2], k = 9))
)