Verified Commit c7755c0f authored by Laurent Modolo's avatar Laurent Modolo
Browse files

dimension_reduction.Rmd update

parent 0b77e007
Pipeline #328 failed with stage
in 48 seconds
......@@ -19,23 +19,23 @@ classoption: aspectratio=169
# Introduction
## Programme
## Program
1. Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
2. Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
3. Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
4. Clustering and annotation (Thursday 23 June 2022 - 14:00)
5. Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
6. Differental expression analysis (Friday 8 July 2022 - 14:00)
6. Differential expression analysis (Friday 8 July 2022 - 14:00)
## Programme
## Program
1. Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
2. Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
3. Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
- Dimension of the data
- Linear dimention reduction
- Non-Linear dimention reduction
- Linear dimension reduction
- Non-Linear dimension reduction
- t-SNE
- UMAP
- Auto-encoder
......@@ -140,7 +140,7 @@ x_{n,c} \\
We have $25-34^5$ rows (genes or transcripts)
\end{center}
## Real dimension of counts data
## Real dimension of count data
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_3.png}
......@@ -149,7 +149,7 @@ We have $25-34^5$ rows (genes or transcripts)
Could a cell vector covert $\mathbb{R}_{+}^{n}$ ?
\end{center}
## Real dimension of counts data
## Real dimension of count data
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_4.png}
......@@ -158,7 +158,7 @@ We have $25-34^5$ rows (genes or transcripts)
Some cell transcription state cannot be reached
\end{center}
## Real dimension of counts data
## Real dimension of count data
\begin{center}
\includegraphics[width=0.6\textwidth]{img/cell_vector_5.png}
......@@ -167,17 +167,8 @@ We have $25-34^5$ rows (genes or transcripts)
The cell transcription space lies in $\Omega \ll \mathbb{R}_{+}^{n}$
\end{center}
## Real dimension of counts data
\begin{center}
\includegraphics[height=4cm]{./img/wadington_landscape.png}
\end{center}
\begin{center}
There are multiple factors that define the transcription landscape
\end{center}
## Real dimension of counts data
## Real dimension of count data
\begin{center}
\begin{columns}
......@@ -200,7 +191,7 @@ x_{2,i} \\
\end{columns}
\end{center}
## Real dimension of counts data
## Real dimension of count data
\begin{center}
\begin{columns}
......@@ -223,6 +214,16 @@ x_{2,i} \\
\end{columns}
\end{center}
## Real dimension of count data
\begin{center}
\includegraphics[height=4cm]{./img/wadington_landscape.png}
\end{center}
\begin{center}
There are multiple factors that define the transcription landscape
\end{center}
## What can we do about the dimension of the data
$X_{cells \times genes}$, with $25-34^5$ genes / transcripts and up to $10^6$ cells
......@@ -238,21 +239,25 @@ $X_{cells \times genes}$, with $25-34^5$ genes / transcripts and up to $10^6$ ce
Dimension reduction is mandatory for any analysis (clustering, visualization, inference)
## Count matrices are sparce for scRNASeq
## Count matrices are sparse for scRNASeq
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\column{0.45\textwidth}
\begin{itemize}
\item low amount of mRNA
\item we have many count values equal to zeros
\end{itemize}
{\bf zeros $\neq$ dropouts}
\vspace{2em}
\begin{center}
{\bf zeros $\neq$ dropouts}
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/detection_vs_mean.png}
\includegraphics[width=1.2\textwidth]{img/detection_vs_mean.png}
\end{center}
\end{columns}
\end{center}
......@@ -261,7 +266,7 @@ Dimension reduction is mandatory for any analysis (clustering, visualization, in
## Matrix factorization
### Low dimensional representation
### Low-dimensional representation
\begin{itemize}
\item Cells: $ {\bf U} \in \mathbb{R}^{n\times \textcolor{red}{K}} $
\item Genes: $ {\bf V} \in \mathbb{R}^{p\times \textcolor{red}{K}} $
......@@ -274,24 +279,24 @@ Dimension reduction is mandatory for any analysis (clustering, visualization, in
\end{center}
\begin{center}
What is the sens of ${\bf \approx}$ ?
What is the sense of ${\bf \approx}$ ?
\end{center}
## PCA
The most widely used dimention reduction technique is the PCA (Principal Component Analysis)
The most widely used dimension reduction technique is the PCA (Principal Component Analysis)
\begin{itemize}
\item Find a linear projection of $\mathbf{X}$ with maximum variance
\vskip 0.3cm
\item PCA algorithm: \hskip 0.8cm $\underset{\substack{\mathbf{U}\in\mathbb{R}^{n\times K}, \mathbf{V}\in\mathbb{R}^{p\times K}}}{argmin} \ \big\Vert \mathbf{X} - \mathbf{U}tr\mathbf{V}\big\Vert_F^{\,2}$ \\
\item PCA algorithm: \hskip 0.8cm $\underset{\substack{\mathbf{U}\in\mathbb{R}^{n\times K}, \mathbf{V}\in\mathbb{R}^{p\times K}}}{argmin} \ \big\Vert \mathbf{X} - \mathbf{U}\mathbf{V}^{T}\big\Vert_F^{\,2}$ \\
\vskip 0.4cm
\item \textbf{Least squares approximation}
\end{itemize}
## PCA
### Preprocessing before dimension reduction
### Reprocessing before dimension reduction
\begin{itemize}
\item data centering: subtracting the mean for each gene
\item data scaling: dividing by the variance of the gene
......@@ -437,7 +442,7 @@ x_{2,i} \\
## PCA
### The choise of the number of components
### The choice of the number of components
\begin{center}
......@@ -456,9 +461,66 @@ x_{2,i} \\
\end{columns}
\end{center}
## Count matrices are sparse for scRNASeq
\begin{center}
\begin{columns}
\column{0.45\textwidth}
\begin{itemize}
\item low amount of mRNA
\item we have many count values equal to zeros
\end{itemize}
\vspace{2em}
\begin{center}
{\bf zeros $\neq$ dropouts}
\end{center}
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=1.2\textwidth]{img/detection_vs_mean.png}
\end{center}
\end{columns}
\end{center}
## Space matrix factorization
### Low-dimensional representation
\begin{itemize}
\item Cells: $ {\bf U} \in \mathbb{R}^{n\times \textcolor{red}{K}} $
\item Genes: $ {\bf V} \in \mathbb{R}^{p\times \textcolor{red}{K}} $
\end{itemize}
\vspace{1em}
\begin{center}
\includegraphics[height=4cm]{./img/sparce_matrix_factorization.png}
\end{center}
## count matrix factorization
### Non-negative matrix factorization
\begin{center}
\href{https://doi.org/10.1093/bioinformatics/btz177}{
\includegraphics[width=\textwidth]{./img/count_matrix_factorization.png}
}
\end{center}
\vspace{-2em}
\[
\mathbf{X} \sim \mathcal{P}\left(\Delta\right), \Delta \simeq \mathbf{U}\mathbf{V}^{T}
\]
\[
\mathbf{U}_{ik} \sim \mathcal{Gamma}\left(\alpha_{1,k}, \sigma_{1,k}\right)
\]
\[
\mathbf{V}_{jk} \sim \mathcal{Gamma}\left(\alpha_{2,k}, \sigma_{2,k}\right)
\]
# Non-Linear dimention reduction
# Non-Linear dimension reduction
## Beyond Linear projections
......@@ -466,7 +528,7 @@ x_{2,i} \\
\item Linear methods are powerful for planar structures
\item High dimensional datasets are characterized by multiscale properties (local / global structures)
\item May not be the most powerful for manifolds
\item Non Linear projection methods aim at preserving local characteristics of distances
\item Non-Linear projection methods aim at preserving local characteristics of distances
\end{itemize}
\begin{center}
......@@ -476,17 +538,17 @@ x_{2,i} \\
## t-SNE
### t-Distributed Stochastic Neighbor Embedding
Create compelling two-dimensonal *maps* from data with hundreds or even thousands of dimensions.
Create compelling two-dimensional *maps* from data with hundreds or even thousands of dimensions.
\begin{itemize}
\item $(x_1, \hdots, x_n)$ are the points in the high dimensional space $\mathbb{R}^p$,
\item $(x_1, \hdots, x_n)$ are the points in the high-dimensional space $\mathbb{R}^p$,
\item Consider a similarity between points:
$$
p_{i | j} = \frac{ \exp(- \| x_i - x_j \|^2 / 2 \sigma_i^2 ) }{\sum_{k \neq i} \exp(- \| x_k - x_j \|^2 / 2 \sigma_k^2)},
\,\, p_{ij} = (p_{i | j} + p_{j | i})/ 2N
$$
\item $\sigma$ smooths the data (linked to the regularity of the target manifold)
\item $\sigma_i$ Should adjust to local densities (neighborhood of point $i$)
\item $\sigma_i$ should adjust to local density (neighborhood of point $i$)
\item $\sigma$ is chosen such that the entropy of $p$ is fixed to a given value of the so-called perplexity
\end{itemize}
......@@ -512,7 +574,7 @@ $p_{ji}$ can be interpreted as the probability that $x_i$ would pick $x_j$ as it
## t-SNE
$p_{ji}$ can be interpreted as the probability that $x_i$ would pick $x_j$ as its neighbor if a neighbor were picked in proportion to their probability density under a Student distribution.
$p_{ji}$ can be interpreted as the probability that $x_i$ would pick $x_j$ as its neighbor if a neighbor were picked in proportion to their probability density under a student distribution.
\begin{center}
\href{https://fr.wikipedia.org/wiki/Loi_de_Student}{
......@@ -585,7 +647,7 @@ UMAP is comprised of two steps: First, compute a graph representing your data, s
\end{center}
## UMAP
### Disconnected graph the point are not uniformly distributed
### Disconnected graph: the points are not uniformly distributed
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_3.png}
......@@ -609,7 +671,7 @@ UMAP is comprised of two steps: First, compute a graph representing your data, s
\end{center}
## UMAP
### fuzzy confidence decay in terms of distance beyond the first nearest neighbor
### fuzzy confidence decay in terms of distance beyond the first-nearest neighbor
\begin{center}
\href{https://umap-learn.readthedocs.io}{
\includegraphics[width=0.9\textwidth]{img/umap_6.png}
......@@ -683,7 +745,7 @@ UMAP is comprised of two steps: First, compute a graph representing your data, s
## Auto-encoder
### Autoencoders are trained to minimise reconstruction errors.
### Autoencoders are trained to minimize reconstruction errors.
\begin{center}
\href{https://gricad-gitlab.univ-grenoble-alpes.fr/talks/fidle/-/wikis/home}{
......@@ -693,7 +755,7 @@ UMAP is comprised of two steps: First, compute a graph representing your data, s
## Auto-encoder
### Autoencoders are trained to minimise reconstruction errors.
### Autoencoders are trained to minimize reconstruction errors.
\begin{center}
\href{https://gricad-gitlab.univ-grenoble-alpes.fr/talks/fidle/-/wikis/home}{
......@@ -703,7 +765,7 @@ UMAP is comprised of two steps: First, compute a graph representing your data, s
## Auto-encoder
### Autoencoders are trained to minimise reconstruction errors.
### Autoencoders are trained to minimize reconstruction errors.
\begin{center}
\href{https://gricad-gitlab.univ-grenoble-alpes.fr/talks/fidle/-/wikis/home}{
......@@ -726,8 +788,6 @@ UMAP is comprised of two steps: First, compute a graph representing your data, s
## Variational Auto-encoder
Auto-encoder are not often used in scRNASeq data.
An autoencoder performs a direct projection into the latent space and an upsampling from this latent space.
\begin{center}
......@@ -738,7 +798,7 @@ An autoencoder performs a direct projection into the latent space and an upsampl
## Variational Auto-encoder
An variational auto-encoder performs a direct projection into a parameter space.
A variational auto-encoder performs a direct projection into a parameter space.
\begin{center}
\href{https://gricad-gitlab.univ-grenoble-alpes.fr/talks/fidle/-/wikis/home}{
......@@ -747,8 +807,26 @@ An variational auto-encoder performs a direct projection into a parameter space.
\end{center}
## Mixture of different methods
\begin{center}
\href{https://doi.org/10.1093/bib/bbab345}{
\includegraphics[width=\textwidth]{img/dim_red_comp_1.png}
}
\end{center}
## Mixture of different methods
\begin{center}
\href{https://www.nature.com/articles/s41587-021-00875-x}{
\includegraphics[width=\textwidth]{img/dim_red_comp_2.png}
}
\end{center}
## computational complexity
\begin{center}
\includegraphics[width=\textwidth]{img/computational_complexity.png}
\end{center}
\ No newline at end of file
\end{center}
# single-cell RNA-Seq data: Clustering & Annotation *Thursday 23 June 2022*
\ No newline at end of file
3_dimension_reduction/img/AE_5.png

135 KB | W: | H:

3_dimension_reduction/img/AE_5.png

134 KB | W: | H:

3_dimension_reduction/img/AE_5.png
3_dimension_reduction/img/AE_5.png
3_dimension_reduction/img/AE_5.png
3_dimension_reduction/img/AE_5.png
  • 2-up
  • Swipe
  • Onion skin
3_dimension_reduction/img/VAE_1.png

228 KB | W: | H:

3_dimension_reduction/img/VAE_1.png

225 KB | W: | H:

3_dimension_reduction/img/VAE_1.png
3_dimension_reduction/img/VAE_1.png
3_dimension_reduction/img/VAE_1.png
3_dimension_reduction/img/VAE_1.png
  • 2-up
  • Swipe
  • Onion skin
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment