Verified Commit 3065c700 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

4_clustering: update

parent 782394bd
Pipeline #330 failed with stage
in 47 seconds
......@@ -24,26 +24,21 @@ classoption: aspectratio=169
1. Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
2. Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
3. Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
4. Clustering and annotation (Thursday 23 June 2022 - 14:00)
5. Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
6. Differental expression analysis (Friday 8 July 2022 - 14:00)
4. Clustering and annotation (Thursday 30 June 2022 2022 - 14:00)
5. Pseudo-time and velocity inference (Friday 8 July 2022 - 14:00)
6. Differental expression analysis (Monday 11 July 2022 - 15:30)
## Programme
1. Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
2. Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
3. Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
4. Clustering and annotation (Thursday 23 June 2022 - 14:00)
- types of clustering
- distance
- k-means
- hclust
- Louvin
- Supervised Clustering
- Cell-type annotation
- Detection of rare cell-type
5. Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
6. Differential expression analysis (Friday 8 July 2022 - 14:00)
4. Clustering and annotation (Thursday 30 June 2022 2022 - 14:00)
- Distances
- Clustering
- Classification
5. Pseudo-time and velocity inference (Friday 8 July 2022 - 14:00)
6. Differental expression analysis (Monday 11 July 2022 - 15:30)
## Different kind of clustering
......@@ -156,6 +151,145 @@ weighted by $P$
\[W_p(\mu,\nu):=\left( \inf_{\pi\in\Pi(\mu,\nu)} \int_{\mathcal X} d(x,y)^p \mathrm{d}\pi(x,y) \right)^{1/p}\]
## Cell to cell correlations
### Pearson $0.7992528$
```{r cor_pearson, include=F, echo=F, warning=F, message=F, cache = T}
library(tidyverse)
library(copula)
cop <- normalCopula(param = 0.8, dim = 2)
x <- rMvdc(
1000,
mvdc(
cop,
margins = c("norm", "gamma"),
paramMargins = list(c(0, 1), c(3, 5))
)) %>%
tibble(
x1 = .[, 1],
x2 = .[, 2],
)
x %>%
ggplot(aes(x = x1, y = x2)) +
geom_point(alpha = 0) +
geom_density2d_filled() +
geom_smooth(method = "lm", se = F) +
theme_bw() +
theme(legend.position = "none")
ggsave("img/cor_pearson.pdf", width = 4, height = 4)
```
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/cor_pearson}
\end{center}
\column{0.5\textwidth}
\[\rho_{X,Y}= \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}\]
where:
\begin{itemize}
\item $\operatorname{cov}$ is the covariance
\item $\sigma_X$ is the standard deviation of $X$
\item $\sigma_Y$ is the standard deviation of $Y$
\end{itemize}
\end{columns}
## Cell to cell correlations
### Spearman $0.8169925$
```{r cor_spearman, include=F, echo=F, warning=F, message=F, cache = T}
x %>%
mutate(
r1 = rank(x1),
r2 = rank(x2)
) %>%
ggplot(aes(x = r1, y = r2)) +
geom_point(alpha = 0) +
geom_density2d_filled() +
geom_smooth(method = "lm", se = F) +
theme_bw() +
theme(legend.position = "none")
ggsave("img/cor_spearman.pdf", width = 4, height = 4)
```
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/cor_spearman}
\end{center}
\column{0.5\textwidth}
\[
r_s =
\rho_{\operatorname{R}(X),\operatorname{R}(Y)} =
\frac{\operatorname{cov}(\operatorname{R}(X), \operatorname{R}(Y))}
{\sigma_{\operatorname{R}(X)} \sigma_{\operatorname{R}(Y)}},
\]
where:
\begin{itemize}
\item $R(X)$ is the rank of $X$
\item $R(Y)$ is the rank of $Y$
\item $\operatorname{cov}$ is the covariance
\item $\sigma_X$ is the standard deviation of $X$
\item $\sigma_Y$ is the standard deviation of $Y$
\end{itemize}
\end{columns}
## Cell to cell correlations
### Kendall $0.6231151$
```{r cor_kendall, include=F, echo=F, warning=F, message=F, cache = T}
x %>%
ggplot(aes(x = x1, y = x2)) +
geom_point(alpha = 0) +
geom_density2d_filled() +
theme_bw() +
geom_rect(aes(xmin = 0, ymin = 5, xmax = max(x1), ymax = max(x2)), color = "red", fill = NA) +
geom_rect(aes(xmax = 0, ymax = 5, xmin = min(x1), ymin = min(x2)), color = "red", fill = NA) +
geom_point(aes(x = 0, y = 5), size = 2) +
geom_label(aes(x = mean(c(0, max(x1))), y = mean(c(5, max(x2))), label = "concordant pairs"), color = "red") +
geom_label(aes(x = mean(c(0, min(x1))), y = mean(c(5, min(x2))), label = "concordant pairs"), color = "red") +
geom_label(aes(x = mean(c(0, min(x1))), y = mean(c(5, max(x2))), label = "discordant pairs")) +
geom_label(aes(x = mean(c(0, max(x1))), y = mean(c(5, min(x2))), label = "discordant pairs")) +
theme(legend.position = "none")
ggsave("img/cor_kendall.pdf", width = 4, height = 4)
```
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/cor_kendall}
\end{center}
\column{0.5\textwidth}
\[
\tau = \frac{(A) - (B)}{
{n \choose 2} }
\]
Where:
\begin{itemize}
\item $A$ is number of concordant pairs
\item $B$ is number of discordant pairs
\item ${n \choose 2} = {n (n-1) \over 2}$ is the Binomial coefficient for the number of ways to choose $2$ items from $n$ items.
\end{itemize}
\end{columns}
## SIMLR
### The choice of the distance metric is primordial
\begin{center}
\href{https://doi-orgr/10.1038/nmeth.4207}{
\includegraphics[width=\textwidth]{img/simlr.png}
}
\end{center}
## Curse of Dimensionality
\begin{center}
\begin{columns}
......@@ -173,6 +307,38 @@ Euclidian distances between 200 random cells
\end{columns}
\end{center}
## Curse of sparcity
\begin{center}
\begin{columns}
\column{0.55\textwidth}
\begin{center}
\[
\begin{bmatrix}
x_{1,1} & x_{1,2} \\
0 & 0 \\
0 & 0 \\
0 & x_{3,2} \\
0 & 0 \\
\vdots & \vdots \\
x_{c,1} & 0 \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
When we have a large number of $0$:
\begin{itemize}
\item distances are low
\item correlations are high
\end{itemize}
We may need to filter out gene on their number of $0$s
\end{columns}
\end{center}
# Clustering
## Cell-to-cell distance
......@@ -388,7 +554,7 @@ At each step we merge clusters with their closest neighbor
## hclust algorithm
### choice of $k$ the number of clusters
### choice of $k$ the number of clusters: $k = 2$
\begin{center}
\href{https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/}{
......@@ -398,7 +564,7 @@ At each step we merge clusters with their closest neighbor
## hclust algorithm
### choice of $k$ the number of clusters
### choice of $k$ the number of clusters: $k = 3$
\begin{center}
\href{https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/}{
......@@ -406,31 +572,50 @@ At each step we merge clusters with their closest neighbor
}
\end{center}
## Neighberhood graph
# Neighberhood graph
## $k$-NN graph
### Instead of considering every the relations between every cells we can focus on $k$ neighbors
\begin{center}
\href{http://www.biomedcentral.com/1471-2164/13/S7/S27}{
\includegraphics[width=\textwidth]{img/knn_k2.png}
}
k = 2
\end{center}
## SNN graph
### The number of shared neighbors allows us to put weights on the graph
\begin{center}
\href{http://www.biomedcentral.com/1471-2164/13/S7/S27}{
\includegraphics[width=\textwidth]{img/knn_k2.png}
}
k = 2
## Louvain algorithm
### Move nodes
### Move nodes to optimized the modularity
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_1.png}
\includegraphics[width=0.7\textwidth]{img/louvain_1.png}
}
\end{center}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_2.png}
\includegraphics[width=0.7\textwidth]{img/louvain_2.png}
}
\end{center}
\end{columns}
Measures the density of links inside communities compared to links between communities
## Louvain algorithm
### Aggregate
......@@ -439,41 +624,219 @@ At each step we merge clusters with their closest neighbor
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_2.png}
\includegraphics[width=0.7\textwidth]{img/louvain_2.png}
}
\end{center}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_3.png}
\includegraphics[width=0.7\textwidth]{img/louvain_3.png}
}
\end{center}
\end{columns}
Measures the density of links inside communities compared to links between communities
## Louvain algorithm
### Move nodes
### Move nodes to optimized the modularity
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_3.png}
\includegraphics[width=0.7\textwidth]{img/louvain_3.png}
}
\end{center}
\column{0.5\textwidth}
\begin{center}
\href{https://www.nature.com/articles/s41598-019-41695-z/}{
\includegraphics[width=\textwidth]{img/louvain_4.png}
\includegraphics[width=0.7\textwidth]{img/louvain_4.png}
}
\end{center}
\end{columns}
Measures the density of links inside communities compared to links between communities
## Validation methods
\begin{center}
Partition data in 2 and compare the two clustering
\end{center}
### Adjusted Rand Index (ARI)
**ARI** ranges from $0$, for poor matching (a random clustering), to $1$ for a perfect agreement
### Adjusted mutual information (AMI)
**AMI** takes a value of $1$ when the two clusterings are identical and $0$ when the **MI** between two partitions equals the value expected due to chance alone.
### V-measure
Geometric mean between the **homogeneity** (how much the sample in a cluster are similar) and the **Completeness** (how much similar samples are put together by the clustering algorithm)
\begin{center}
Cell which flip assignement must be labeled as ambiguous
\end{center}
## SC3
\begin{center}
\href{https://www.nature.com/articles/nmeth.4236}{
\includegraphics[width=\textwidth]{img/sc3.png}
}
\end{center}
## Mixture model
### We model the probability of belonging to one groupe
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\href{https://en.wikipedia.org/wiki/Mixture_model}{
\includegraphics[width=0.7\textwidth]{img/mixture_model.png}
}
\end{center}
\column{0.5\textwidth}
We can fit
\[p(x, \theta) = \sum_{i=1}^{K}\alpha_i p_i(x, \theta_{i})\]
with
\begin{itemize}
\item $K$ the number of cluster
\item $\alpha_i$ the proportion of cluster $i$
\item $\theta$ the model parameters
\end{itemize}
\end{columns}
# Rand index
## Mixture model
### We cannot direcly cluster the multidimentional distribution of gene expression
partition data into and compare the two clustering
\begin{center}
\href{https://doi.org/10.1093/bioinformatics/btac136}{
\includegraphics[width=\textwidth]{img/nnIFA_1.png}
}
\end{center}
cell which flip assignement must be labeled as ambiguous
## Mixture model
### We cannot direcly cluster the multidimentional distribution of gene expression
\begin{center}
\href{https://doi.org/10.1093/bioinformatics/btac136}{
\includegraphics[width=\textwidth]{img/nnIFA_2.png}
}
\end{center}
## Mixture model
### $X \simeq W H$
\begin{center}
\href{https://arxiv.org/abs/2104.13171}{
\includegraphics[width=\textwidth]{img/ssNMF.png}
}
\end{center}
## Mixture model
### scDeepCluster
\begin{center}
\href{https://www.nature.com/articles/s42256-019-0037-0/}{
\includegraphics[width=\textwidth]{img/scDeepCluster.png}
}
\end{center}
# Annotation
## Rsingler (Spearman correlation)
\begin{center}
\href{https://www.nature.com/articles/s41590-018-0276-y}{
\includegraphics[width=0.9\textwidth]{img/rsingler.png}
}
\end{center}
## Cell-ID (multiple correspondence analysis)
\begin{center}
\href{https://doi.org/10.1038/s41587-021-00896-6}{
\includegraphics[width=0.7\textwidth]{img/cell_id.png}
}
\end{center}
## Harmony
\begin{center}
\href{https://doi.org/10.1038/s41592-019-0619-0}{
\includegraphics[width=\textwidth]{img/harmony_1.png}
}
\end{center}
## Harmony
\begin{center}
\href{https://doi.org/10.1038/s41592-019-0619-0}{
\includegraphics[width=0.7\textwidth]{img/harmony_2.png}
}
\end{center}
## GNN
\begin{center}
\href{https://distill.pub/2021/gnn-intro/}{
\includegraphics[width=\textwidth]{img/GNN_1.png}
}
\end{center}
## GNN
\begin{center}
\href{https://distill.pub/2021/gnn-intro/}{
\includegraphics[width=\textwidth]{img/GNN_2.png}
}
\end{center}
## GNN
\begin{center}
\href{https://distill.pub/2021/gnn-intro/}{
\includegraphics[width=\textwidth]{img/GNN_3.png}
}
\end{center}
## GNN
\begin{center}
\href{https://distill.pub/2021/gnn-intro/}{
\includegraphics[width=\textwidth]{img/GNN_4.png}
}
\end{center}
## GNN
\begin{center}
\href{https://distill.pub/2021/gnn-intro/}{
\includegraphics[width=\textwidth]{img/GNN_5.png}
}
\end{center}
## GNN
### scGAE
\begin{center}
\href{https://doi.org/10.1101/2021.02.16.431357}{
\includegraphics[width=\textwidth]{img/scGAE.png}
}
\end{center}
# single-cell RNA-Seq pseudo-time and velocity inference *Monday 8 July 2022*
# References
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment