Verified Commit 782394bd authored by Laurent Modolo's avatar Laurent Modolo
Browse files

4_clustering: update

parent c7755c0f
Pipeline #329 failed with stage
in 29 seconds
---
title: "single-cell RNA-Seq data: Clustering"
title: "single-cell RNA-Seq data: Dimension reduction"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "Friday 3 June 2022"
date: "Thursday 30 June 2022"
output:
beamer_presentation:
df_print: tibble
......@@ -28,23 +28,446 @@ classoption: aspectratio=169
5. Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
6. Differental expression analysis (Friday 8 July 2022 - 14:00)
# Introduction
## Programme
1. Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
2. Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
3. Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
4. Clustering and annotation (Thursday 23 June 2022 - 14:00)
- Supervised Clustering
- Cell-type annotation
- types of clustering
- distance
- k-means
- hclust
- Louvin
- Supervised Clustering
- Cell-type annotation
- Detection of rare cell-type
5. Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
6. Differential expression analysis (Friday 8 July 2022 - 14:00)
## Different kind of clustering
\includegraphics[width=\textwidth]{img/learning_type.png}
# Distances
## Cell-to-cell distance
\begin{center}
\[
X_{cells \times genes} =
\begin{bmatrix}
x_{1,1} & x_{1,2} & x_{1,3} & \cdots & x_{1,c} \\
x_{2,1} & x_{2,2} & x_{2,3} & \cdots & x_{2,c} \\
x_{3,1} & x_{3,2} & x_{3,3} & \cdots & x_{3,c} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & x_{n,3} & \cdots & x_{n,c} \\
\end{bmatrix}
\]
\end{center}
We have $25-34^5$ rows (genes or transcripts) and up to $10^6$ columns (cells)
## Cell-to-cell distance
\begin{center}
\[
D_{cells \times cells} =
\begin{bmatrix}
d_{1,1} & d_{1,2} & d_{1,3} & \cdots & d_{1,c} \\
d_{2,1} & d_{2,2} & d_{2,3} & \cdots & d_{2,c} \\
d_{3,1} & d_{3,2} & d_{3,3} & \cdots & d_{3,c} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
d_{c,1} & d_{c,2} & d_{c,3} & \cdots & d_{c,c} \\
\end{bmatrix}
\]
\end{center}
We have up to $10^6$ rows and $10^6$ columns (cells)
## Classical distance
### 3 properties of a distance
\includegraphics[width=\textwidth]{img/distance.png}
## Classical distance
### Manhattan $\sum_{i=1}^n |x_i-y_i|$
\begin{center}
\includegraphics[width=0.5\textwidth]{img/manhattan.png}
\end{center}
## Classical distance
### Euclidienne $\sqrt{\sum_{i=1}^n (x_i-y_i)^2}$
\begin{center}
\includegraphics[width=0.5\textwidth]{img/euclidienne.png}
\end{center}
## Classical distance
### Hermann-Minkowski ${\left(\sum_{i=1}^n (x_i-y_i)^p\right)^{\frac{1}{p}}}$
\begin{center}
\includegraphics[width=0.7\textwidth]{img/hermann-minkowski.png}
\end{center}
## Statistical divergence
### Kullback-Leibler divergence $D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}$
the amount of information lost when $Q$ is used to approximate $P$
\begin{columns}
\column{0.3\textwidth}
\begin{itemize}
\item $\log \frac{P(i)}{Q(i)} = 0$ when $P = Q$
\item $\log \frac{P(i)}{Q(i)} > 0$ when $P > Q$
\item $\log \frac{P(i)}{Q(i)} < 0$ when $P < Q$
\end{itemize}
weighted by $P$
\column{0.7\textwidth}
\begin{center}
\href{https://lilianweng.github.io/posts/2018-08-12-vae/}{
\includegraphics[width=\textwidth]{img/KL.png}
}
\end{center}
\end{columns}
## Statistical distance
### Kantorovich / Wasserstein: optimal transport
\vspace{-1em}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\href{https://lilianweng.github.io/posts/2018-08-12-vae/}{
\includegraphics[width=\textwidth]{img/kantorovich_1.png}
}
\end{center}
\column{0.5\textwidth}
\begin{center}
\href{https://lilianweng.github.io/posts/2018-08-12-vae/}{
\includegraphics[width=\textwidth]{img/kantorovich_2.png}
}
\end{center}
\end{columns}
\vspace{-1em}
\[W_p(\mu,\nu):=\left( \inf_{\pi\in\Pi(\mu,\nu)} \int_{\mathcal X} d(x,y)^p \mathrm{d}\pi(x,y) \right)^{1/p}\]
## Curse of Dimensionality
\begin{center}
\begin{columns}
\column{0.55\textwidth}
\href{https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages}{
\includegraphics[width=\textwidth]{img/curse_of_dimensionality.png}
}
Euclidian distances between 200 random cells
\column{0.5\textwidth}
\begin{itemize}
\item In the Euclidean space, the density is the number of points per unit volume
\item As dimensionality increases, the volume increases rapidly
\item Unless the number of points increases exponentially with dimensionality, the density tends to 0
\end{itemize}
\end{columns}
\end{center}
# Clustering
## Cell-to-cell distance
### We want to find cluster of *similar* cells
\begin{center}
\[
D_{cells \times cells} =
\begin{bmatrix}
d_{1,1} & d_{1,2} & d_{1,3} & \cdots & d_{1,c} \\
d_{2,1} & d_{2,2} & d_{2,3} & \cdots & d_{2,c} \\
d_{3,1} & d_{3,2} & d_{3,3} & \cdots & d_{3,c} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
d_{c,1} & d_{c,2} & d_{c,3} & \cdots & d_{c,c} \\
\end{bmatrix}
\]
\end{center}
We have up to $10^6$ rows and $10^6$ columns (cells)
## k-means algorithm
### Finding $k$ clusters of *similar* cells
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\href{https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages}{
\includegraphics[width=\textwidth]{img/kmean_1.png}
}
\column{0.5\textwidth}
The algorithm randomly chooses a centroid for each cluster. In our example, we choose a of 3, and therefore the algorithm randomly picks 3 centroids.
\end{columns}
\end{center}
## k-means algorithm
### Finding $k$ clusters of *similar* cells
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\href{https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages}{
\includegraphics[width=\textwidth]{img/kmean_2.png}
}
\column{0.5\textwidth}
The algorithm assigns each point to the closest centroid to get initial clusters.
\end{columns}
\end{center}
## k-means algorithm
### Finding $k$ clusters of *similar* cells
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\href{https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages}{
\includegraphics[width=\textwidth]{img/kmean_3.png}
}
\column{0.5\textwidth}
For every cluster, the algorithm recomputes the centroid by taking the average of all points in the cluster.
Since the centroids change, the algorithm then re-assigns the points to the closest centroid.
\end{columns}
\end{center}
## k-means algorithm
### Finding $k$ clusters of *similar* cells
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\href{https://developers.google.com/machine-learning/clustering/algorithm/advantages-disadvantages}{
\includegraphics[width=\textwidth]{img/kmean_4.png}
}
\column{0.5\textwidth}
The algorithm repeats the calculation of centroids and assignment of points until points stop changing clusters.
When clustering large datasets, you stop the algorithm before reaching convergence, using other criteria instead.
\end{columns}
\end{center}
## k-means algorithm
### choice of $k$ the number of clusters
### The Elbow Method
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\href{https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb}{
\includegraphics[width=\textwidth]{img/kmean_WSS.png}
}
\column{0.5\textwidth}
Compute the Within-Cluster-Sum of Squared Errors (WSS) (each point vs the centroid) for different values of $k$
\end{columns}
\end{center}
## k-means algorithm
### choice of $k$ the number of clusters
### The Silhouette Method
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\href{https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb}{
\includegraphics[width=\textwidth]{img/kmean_silhouette.png}
}
\column{0.5\textwidth}
The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).
\[s(i) = \frac{b(i) - a(i)}{\max\{a(i),b(i)\}} \]
with
\begin{itemize}
\item $a(i)$ the mean distance between $i$ and aussi cells in the same cluster
\item $b(i)$ the mean distance between $i$ and aussi cells in different clusters
\end{itemize}
We plot $\frac{1}{n}\sum_{i=1}^n s(i)$
\end{columns}
\end{center}
## k-means algorithm: clust-tree
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\href{https://academic.oup.com/gigascience/article/7/7/giy083/5052205?login=false}{
\includegraphics[width=0.8\textwidth]{img/clustree_1.png}
}
\column{0.5\textwidth}
\href{https://academic.oup.com/gigascience/article/7/7/giy083/5052205?login=false}{
\includegraphics[width=0.8\textwidth]{img/clustree_2.png}
}
\end{columns}
\end{center}
## hclust algorithm
### We Aggregate cluster with the intercluster distance
\begin{center}
\begin{columns}
\column{0.5\textwidth}
We start with clusters of one cell
At each step we merge clusters with their closest neighbor
\begin{itemize}
\item {\bf Single linkage}
\item Complete linkage
\item Average linkage
\end{itemize}
\column{0.5\textwidth}
\href{https://www.r-bloggers.com/2017/12/how-to-perform-hierarchical-clustering-using-r/}{
\includegraphics[width=\textwidth]{img/hclust_single_linkage.png}
}
\end{columns}
\end{center}
## hclust algorithm
### We Aggregate cluster with the intercluster distance
\begin{center}
\begin{columns}
\column{0.5\textwidth}
We start with clusters of one cell
At each step we merge clusters with their closest neighbor
\begin{itemize}
\item Single linkage
\item {\bf Complete linkage}
\item Average linkage
\end{itemize}
\column{0.5\textwidth}
\href{https://www.r-bloggers.com/2017/12/how-to-perform-hierarchical-clustering-using-r/}{
\includegraphics[width=\textwidth]{img/hclust_complete_linkage.png}
}
\end{columns}
\end{center}
## hclust algorithm
### We Aggregate cluster with the intercluster distance
\begin{center}
\begin{columns}
\column{0.5\textwidth}
We start with clusters of one cell
At each step we merge clusters with their closest neighbor
\begin{itemize}
\item Single linkage
\item Complete linkage
\item {\bf Average linkage}
\end{itemize}
\column{0.5\textwidth}
\href{https://www.r-bloggers.com/2017/12/how-to-perform-hierarchical-clustering-using-r/}{
\includegraphics[width=\textwidth]{img/hclust_average_linkage.png}
}
\end{columns}
\end{center}
## hclust algorithm
### choice of $k$ the number of clusters
\begin{center}
\href{https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/}{
\includegraphics[width=0.7\textwidth]{img/hclust_k2.png}
}
\end{center}
## hclust algorithm
### choice of $k$ the number of clusters
\begin{center}
\href{https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/}{
\includegraphics[width=0.7\textwidth]{img/hclust_k4.png}
}
\end{center}
## Neighberhood graph
## $k$-NN graph
## Louvain algorithm
### Move nodes
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_1.png}
}
\end{center}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_2.png}
}
\end{center}
\end{columns}
## Louvain algorithm
### Aggregate
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_2.png}
}
\end{center}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_3.png}
}
\end{center}
\end{columns}
## Louvain algorithm
### Move nodes
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_3.png}
}
\end{center}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_4.png}
}
\end{center}
\end{columns}
# Rand index
......@@ -54,5 +477,3 @@ cell which flip assignement must be labeled as ambiguous
# References
1. [Gibson, Greg. ‘Perspectives on Rigor and Reproducibility in Single Cell Genomics’. PLOS Genetics 18, no. 5 (10 May 2022): e1010210.](https://doi.org/10.1371/journal.pgen.1010210)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment