Verified Commit 0b77e007 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

3_dimension_reduciton update

parent b8f4ed5b
Pipeline #327 failed with stage
in 30 seconds
......@@ -60,7 +60,7 @@ classoption: aspectratio=169
\]
\end{center}
We have $20-30^5$ rows (genes or transcripts) and up to $10^6$ columns (cells)
We have $25-34^5$ rows (genes or transcripts) and up to $10^6$ columns (cells)
## Geometric interpretation of $X_{cells \times genes}$
......@@ -71,15 +71,44 @@ We have $20-30^5$ rows (genes or transcripts) and up to $10^6$ columns (cells)
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,c} \\
x_{2,c} \\
x_{3,c} \\
x_{1,i} \\
x_{2,i} \\
x_{3,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_1.png}
\end{center}
\end{columns}
\end{center}
## Geometric interpretation of $X_{cells \times genes}$
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,i} \\
x_{2,i} \\
x_{3,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_2.png}
\end{center}
\end{columns}
\end{center}
......@@ -92,29 +121,144 @@ x_{3,c} \\
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,c} \\
x_{2,c} \\
x_{3,c} \\
x_{1,i} \\
x_{2,i} \\
x_{3,i} \\
\vdots \\
x_{n,c} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_2.png}
\end{center}
\end{columns}
We have $25-34^5$ rows (genes or transcripts)
\end{center}
## Real dimension of counts data
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_3.png}
\end{center}
\begin{center}
Could a cell vector covert $\mathbb{R}_{+}^{n}$ ?
\end{center}
## Real dimension of counts data
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_4.png}
\end{center}
\begin{center}
Some cell transcription state cannot be reached
\end{center}
## Real dimension of counts data
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_5.png}
\end{center}
\begin{center}
The cell transcription space lies in $\Omega \ll \mathbb{R}_{+}^{n}$
\end{center}
## Real dimension of counts data
\begin{center}
\includegraphics[height=4cm]{./img/wadington_landscape.png}
\end{center}
\begin{center}
There are multiple factors that define the transcription landscape
\end{center}
## Real dimension of counts data
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,i} \\
x_{2,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/two_gene_corr.png}
\end{center}
\end{columns}
\end{center}
## Real dimension of counts data
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,i} \\
x_{2,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/two_gene_corr_elispe.png}
\end{center}
\end{columns}
\end{center}
## What can we do about the dimension of the data
$X_{cells \times genes}$, with $25-34^5$ genes / transcripts and up to $10^6$ cells
\begin{itemize}
\item When dimension increase, data become sparse
\item Neighbors are far away
\item Correlations become spurious
\item Algorithmic constraints
\item Numerical Instabilities
\item Data are concentrated on local subspaces
\end{itemize}
Dimension reduction is mandatory for any analysis (clustering, visualization, inference)
## Count matrices are sparce for scRNASeq
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{itemize}
\item low amount of mRNA
\item we have many count values equal to zeros
\end{itemize}
\begin{tikzpicture}[fill=blue,fill opacity=0.7,draw,scale=3,rounded corners=0.5pt]
\def\cubecol{blue}
\def\opacity{0.8}
\filldraw (-0.5,-0.5,-0.5) -- ++(1,0,0) -- ++(0,1,0) -- ++(-1, 0, 0) -- cycle;
\filldraw (-0.5,-0.5,-0.5) -- ++(1,0,0) -- ++(0,0,1) -- ++(-1, 0, 0) -- cycle;
\filldraw (-0.5,-0.5,-0.5) -- ++(0,1,0) -- ++(0,0,1) -- ++(0, -1, 0) -- cycle;
\filldraw[fill=blue!20] (0.5,0.5,0.5) -- ++(-1,0,0) -- ++(0,-1,0) -- ++(1, 0, 0) -- cycle;
\filldraw[fill=blue!50!black!50] (0.5,0.5,0.5) -- ++(-1,0,0) -- ++(0,0,-1) -- ++(1, 0, 0) -- cycle;
\filldraw[fill=blue!20!black!80] (0.5,0.5,0.5) -- ++(0,-1,0) -- ++(0,0,-1) -- ++(0, 1, 0) -- cycle;
\end{tikzpicture}
{\bf zeros $\neq$ dropouts}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/detection_vs_mean.png}
\end{center}
\end{columns}
\end{center}
# Linear dimension reduction
## Matrix factorization
### Low dimensional representation
......@@ -133,9 +277,478 @@ x_{3,c} \\
What is the sens of ${\bf \approx}$ ?
\end{center}
# Linear dimention reduction
## PCA
The most widely used dimention reduction technique is the PCA (Principal Component Analysis)
\begin{itemize}
\item Find a linear projection of $\mathbf{X}$ with maximum variance
\vskip 0.3cm
\item PCA algorithm: \hskip 0.8cm $\underset{\substack{\mathbf{U}\in\mathbb{R}^{n\times K}, \mathbf{V}\in\mathbb{R}^{p\times K}}}{argmin} \ \big\Vert \mathbf{X} - \mathbf{U}tr\mathbf{V}\big\Vert_F^{\,2}$ \\
\vskip 0.4cm
\item \textbf{Least squares approximation}
\end{itemize}
## PCA
### Preprocessing before dimension reduction
\begin{itemize}
\item data centering: subtracting the mean for each gene
\item data scaling: dividing by the variance of the gene
\begin{itemize}
\item ensure equal contribution from each gene
\item shrinkage of features containing strong signal
\item inflation of features with no signal
\end{itemize}
\item library normalization
\begin{itemize}
\item make observations comparable to each other
\end{itemize}
\item variance stabilization
\begin{itemize}
\item avoid bias toward the highly abundant features
\end{itemize}
\end{itemize}
## PCA
\begin{center}
\begin{columns}
\column{0.5\textwidth}
Intuition behind {\it a linear projection of $\mathbf{X}$ with maximum variance}
\begin{center}
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,i} \\
x_{2,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/two_gene_corr_lm_1.png}
\end{center}
\end{columns}
\end{center}
## PCA
\begin{center}
\begin{columns}
\column{0.5\textwidth}
Intuition behind {\it a linear projection of $\mathbf{X}$ with maximum variance}
\begin{center}
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,i} \\
x_{2,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/two_gene_corr_lm_2.png}
\end{center}
\end{columns}
\end{center}
## PCA
\begin{center}
\begin{columns}
\column{0.5\textwidth}
Intuition behind {\it a linear projection of $\mathbf{X}$ with maximum variance}
\begin{center}
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,i} \\
x_{2,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/two_gene_corr_lm_3.png}
\end{center}
\end{columns}
\end{center}
## PCA
\begin{center}
\begin{columns}
\column{0.5\textwidth}
Intuition behind {\it a linear projection of $\mathbf{X}$ with maximum variance}
\begin{center}
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,i} \\
x_{2,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/two_gene_corr_lm_4.png}
\end{center}
\end{columns}
\end{center}
## PCA
\begin{center}
\begin{columns}
\column{0.5\textwidth}
Intuition behind {\it a linear projection of $\mathbf{X}$ with maximum variance}
\begin{center}
\[
X_{1 \times genes} =
\begin{bmatrix}
x_{1,i} \\
x_{2,i} \\
\end{bmatrix}
\]
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/two_gene_corr_lm_4.png}
\end{center}
\end{columns}
\end{center}
\begin{center}
{\bf We iterate this process for the $n-$ dimension}
\end{center}
## PCA
### The choise of the number of components
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{itemize}
\item each new axis is orthogonal to the previous ones
\item each axis is a linear projection with maximum variance left
\end{itemize}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/number_pc_1.png}
\end{center}
\end{columns}
\end{center}
# Non-Linear dimention reduction
# t-SNE
## Beyond Linear projections
\begin{itemize}
\item Linear methods are powerful for planar structures
\item High dimensional datasets are characterized by multiscale properties (local / global structures)
\item May not be the most powerful for manifolds
\item Non Linear projection methods aim at preserving local characteristics of distances
\end{itemize}
\begin{center}
\includegraphics[width=0.4\textwidth]{img/som_vs_pca.png}
\end{center}
## t-SNE
### t-Distributed Stochastic Neighbor Embedding
Create compelling two-dimensonal *maps* from data with hundreds or even thousands of dimensions.
\begin{itemize}
\item $(x_1, \hdots, x_n)$ are the points in the high dimensional space $\mathbb{R}^p$,
\item Consider a similarity between points:
$$
p_{i | j} = \frac{ \exp(- \| x_i - x_j \|^2 / 2 \sigma_i^2 ) }{\sum_{k \neq i} \exp(- \| x_k - x_j \|^2 / 2 \sigma_k^2)},
\,\, p_{ij} = (p_{i | j} + p_{j | i})/ 2N
$$
\item $\sigma$ smooths the data (linked to the regularity of the target manifold)
\item $\sigma_i$ Should adjust to local densities (neighborhood of point $i$)
\item $\sigma$ is chosen such that the entropy of $p$ is fixed to a given value of the so-called perplexity
\end{itemize}
## t-SNE
$p_{ji}$ can be interpreted as the probability that $x_i$ would pick $x_j$ as its neighbor if a neighbor were picked in proportion to their probability density under a Gaussian distribution.
\begin{center}
\href{https://towardsdatascience.com/t-sne-clearly-explained-d84c537f53a}{
\includegraphics[width=0.45\textwidth]{img/tsne_1.png}
}
\end{center}
## t-SNE
$p_{ji}$ can be interpreted as the probability that $x_i$ would pick $x_j$ as its neighbor if a neighbor were picked in proportion to their probability density under a Gaussian distribution.
\begin{center}
\href{https://towardsdatasience.com/t-sne-clearly-explained-d84c537f53a}{
\includegraphics[width=0.45\textwidth]{img/tsne_2.png}
}
\end{center}
## t-SNE
$p_{ji}$ can be interpreted as the probability that $x_i$ would pick $x_j$ as its neighbor if a neighbor were picked in proportion to their probability density under a Student distribution.
\begin{center}
\href{https://fr.wikipedia.org/wiki/Loi_de_Student}{
\includegraphics[width=0.5\textwidth]{img/tsne_3.png}
}
\end{center}
## t-SNE
### Non-linear projection
\begin{center}
\includegraphics[width=0.8\textwidth]{img/tsne_complex_shape.png}
\end{center}
## t-SNE
### hyperparameters really matter
\begin{center}
\includegraphics[width=\textwidth]{img/hyperparameters_really_matter.png}
\end{center}
The original paper says, *“The performance of SNE is fairly robust to changes in the perplexity, and typical values are between 5 and 50.”*
## t-SNE
### Cluster sizes in a t-SNE plot mean nothing
\begin{center}
\includegraphics[width=\textwidth]{img/tsne_cluster_sizes.png}
\end{center}
## t-SNE
### Distances between clusters might not mean anything
\begin{center}
\includegraphics[width=\textwidth]{img/tsne_cluster_dist.png}
\end{center}
# UMAP
## UMAP
### Uniform manifold approximation and projection
UMAP is comprised of two steps: First, compute a graph representing your data, second, learn an embedding for that graph:
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=\textwidth]{img/umap_steps.png}
}
\end{center}
## UMAP
### Construction of the graph
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_1.png}
}
\end{center}
## UMAP
### linear distance
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_2.png}
}
\end{center}
## UMAP
### Disconnected graph the point are not uniformly distributed
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_3.png}
}
\end{center}
## UMAP
### locally varying metric
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_4.png}
}
\end{center}
## UMAP
### we actually have a local metric space associated to each point
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_5.png}
}
\end{center}
## UMAP
### fuzzy confidence decay in terms of distance beyond the first nearest neighbor
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_6.png}
}
\end{center}
## UMAP
### local metrics are not compatible
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_7.png}
}
\end{center}
## UMAP
### The combined weight is the probability that at least one of the edges exists
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_8.png}
}
\end{center}
## UMAP
### Choice of the number of neighbors: size of the local metric spaces
\begin{center}
\begin{columns}
\column{0.33\textwidth}
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=\textwidth]{img/umap_n_2.png}
}
\end{center}
\column{0.33\textwidth}
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=\textwidth]{img/umap_n_10.png}
}
\end{center}
\column{0.33\textwidth}
\begin{center}
\href{https://arxiv.org/abs/1802.03426}{
\includegraphics[width=\textwidth]{img/umap_n_100.png}
}
\end{center}
\end{columns}
\end{center}
## UMAP
### Distances between clusters might not mean anything
\begin{center}
\href{https://arxiv.org/abs/1802.03426}{
\includegraphics[width=0.5\textwidth]{img/umap_global.png}
}
\end{center}
## UMAP vs t-SNE vs PCA
\begin{center}
\href{https://arxiv.org/abs/1802.03426}{
\includegraphics[width=0.7\textwidth]{img/umap_tsne_pca.png}
}
\end{center}
# Auto-encoder
# Variational Auto-encoder
\ No newline at end of file
## Auto-encoder
### Autoencoders are trained to minimise reconstruction errors.
\begin{center}
\href{https://gricad-gitlab.univ-grenoble-alpes.fr/talks/fidle/-/wikis/home}{