Verified Commit ead99e62 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

2_normalization.Rmd: update

parent fe2e795a
Pipeline #324 failed with stage
in 28 seconds
......@@ -957,6 +957,8 @@ With $K$ the $k$-compatibility class counts and $\beta$ the transcript quantific
\includegraphics[width=\textwidth]{img/scasa_vs_other.png}
\end{center}
# scRNA data normalization: Friday 8 June 2022
## References
......
---
title: "single-cell RNA-Seq: Normalization"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "Friday 3 June 2022"
date: "Friday 8 June 2022"
output:
beamer_presentation:
df_print: tibble
......@@ -107,7 +107,6 @@ classoption: aspectratio=169
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\begin{tikzpicture}
\fill
(0.5,3.5) node {\bf $\text{gene}_1$}
......@@ -115,13 +114,13 @@ classoption: aspectratio=169
-- (0.5,1.5) node {\bf $\vdots$}
-- (0.5,0.5) node {\bf $\text{gene}_n$};
\fill
(1.5,4.5) node {\bf{$\text{bc}_1$}}
(1.5,4.5) node {\bf $\text{bc}_1$}
-- (1.5,3.5) node {mRNA}
-- (1.5,2.5) node {mRNA}
-- (1.5,1.5) node {$\vdots$}
-- (1.5,0.5) node {mRNA};
\fill
(2.5,4.5) node {\color{red}\bf{$\text{bc}_2$}}
(2.5,4.5) node {\color{red}\bf $\text{bc}_2$}
-- (2.5,3.5) node {\color{red}mRNA}
-- (2.5,2.5) node {\color{red}mRNA}
-- (2.5,1.5) node {\color{red}$\vdots$}
......@@ -133,14 +132,13 @@ classoption: aspectratio=169
-- (3.5,1.5) node {$\ddots$}
-- (3.5,0.5) node {$\cdots$};
\fill
(4.5,4.5) node {\bf{$\text{bc}_c$}}
(4.5,4.5) node {\bf $\text{bc}_c$}
-- (4.5,3.5) node {mRNA}
-- (4.5,2.5) node {mRNA}
-- (4.5,1.5) node {$\vdots$}
-- (4.5,0.5) node {mRNA};
\draw (1,0) grid (5,4);
\end{tikzpicture}
\end{center}
\column{0.5\textwidth}
......@@ -206,7 +204,6 @@ Most of the droplets will be empty
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\begin{tikzpicture}
\fill
(0.5,3.5) node {\bf $\text{gene}_1$}
......@@ -214,55 +211,327 @@ Most of the droplets will be empty
-- (0.5,1.5) node {\bf $\vdots$}
-- (0.5,0.5) node {\bf $\text{gene}_n$};
\fill
(1.5,4.5) node {\bf{$\text{cell}_1$}}
(1.5,4.5) node {\bf $\text{cell}_1$}
-- (1.5,3.5) node {mRNA}
-- (1.5,2.5) node {mRNA}
-- (1.5,1.5) node {$\vdots$}
-- (1.5,0.5) node {mRNA};
\fill
(2.5,4.5) node {\color{red}\bf{$\text{2 cells}_2$}}
(2.5,4.5) node {\color{red}\bf $\text{2 cells}_2$}
-- (2.5,3.5) node {\color{red}mRNA}
-- (2.5,2.5) node {\color{red}mRNA}
-- (2.5,1.5) node {\color{red}$\vdots$}
-- (2.5,0.5) node {\color{red}mRNA};
\fill
(3.5,4.5) node {\bf{$\cdots$}}
(3.5,4.5) node {\bf $\cdots$}
-- (3.5,3.5) node {$\cdots$}
-- (3.5,2.5) node {$\cdots$}
-- (3.5,1.5) node {$\ddots$}
-- (3.5,0.5) node {$\cdots$};
\fill
(4.5,4.5) node {\bf{$\text{cell}_c$}}
(4.5,4.5) node {\bf $\text{cell}_c$}
-- (4.5,3.5) node {mRNA}
-- (4.5,2.5) node {mRNA}
-- (4.5,1.5) node {$\vdots$}
-- (4.5,0.5) node {mRNA};
\draw (1,0) grid (5,4);
\end{tikzpicture}
\end{center}
\column{0.5\textwidth}
{\large Some cells are many cells.}
{\large Some cells are many cellsr:}
\begin{itemize}
\item not all tissues are easily dissociable
\item two cells glued together will share the same droplet
\item two different cells can share the same droplet by chance
\end{itemize}
\vspace{1em}
cell barcode corresponding to $n$-plet should be in monority the the preparation went well.
Cell barcode corresponding to $n$-plet should be in monority the the preparation went well.
\end{columns}
\end{center}
## Cell filtering
apoptotic cells express MT genes
\begin{center}
\includegraphics[width=0.75\textwidth]{img/mouse_human_mix.png}
\end{center}
## Cell filtering
\begin{block}{hypothesis}
Cell barcode corresponding to $n$-plet should be in monority the the preparation went well.
\end{block}
### Algorithm
1. Simulate thousands of doublets by adding together two randomly chosen single-cell profiles.
2. For each original cell, compute the density of simulated doublets in the surrounding neighborhood.
3. For each original cell, compute the density of other observed cells in the neighborhood.
4. Return the ratio between the two density as a **doublet score** for each cell.
## Cell filtering
\begin{center}
\includegraphics[width=\textwidth]{img/doublet_detection_comparison.png}
\end{center}
Different algorithm are available to compare cells to synthetic doublets
## Cell filtering
\begin{center}
\includegraphics[width=0.8\textwidth]{img/features_for_QC_1.png}
\end{center}
\vspace{-1.5em}
We can use hard thresholds to remove putative poor quality cells
\vspace{-0.5em}
\begin{itemize}
\item apoptotic cells express MT genes
\item incefficient RT or PCR amplification
\end{itemize}
## Cell filtering
\begin{center}
\includegraphics[width=0.8\textwidth]{img/features_for_QC_2.png}
\end{center}
Cell expressing few genes also contains few mRNA molecule
# Normalization
## Counts model
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
\end{center}
## Counts distribution
### Random variable
A variable whose values depends on outcomes of a random phenomenon or experiment.
### For a given gene:
We consider $X$ a **random variable** with $x$ a realisation of $X$ the number of mRNA's observed in a cell.
\begin{itemize}
\item The random variable $X$ follow a statitical distribution $F$
\item We write $X \sim F$
\end{itemize}
## Counts model
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
\end{center}
With a transcription rate $\lambda_g(t)$ the observed mRNA count follow a Poisson distribution $\mathcal{P}(\lambda_g(t))$.
## Counts distribution
$P(X = x)$ for $\mathcal{P}(\lambda_g)$
\begin{center}
\includegraphics[width=0.6\textwidth]{./img/poisson.png}
\end{center}
$\lambda_g$ the rate of mRNA production is equal to the variance in the number
of mRNA.
## Counts
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\begin{tikzpicture}
\fill
(0.5,3.5) node {\bf $\text{gene}_1$}
-- (0.5,2.5) node {\bf $\text{gene}_2$}
-- (0.5,1.5) node {\bf $\vdots$}
-- (0.5,0.5) node {\bf $\text{gene}_n$};
\fill
(1.5,4.5) node {\bf{$\text{cell}_1$}}
-- (1.5,3.5) node {mRNA}
-- (1.5,2.5) node {\color{red}mRNA}
-- (1.5,1.5) node {$\vdots$}
-- (1.5,0.5) node {mRNA};
\fill
(2.5,4.5) node {\bf{$\text{cell}_2$}}
-- (2.5,3.5) node {mRNA}
-- (2.5,2.5) node {\color{red}mRNA}
-- (2.5,1.5) node {$\vdots$}
-- (2.5,0.5) node {mRNA};
\fill
(3.5,4.5) node {\bf{$\cdots$}}
-- (3.5,3.5) node {$\cdots$}
-- (3.5,2.5) node {\color{red}$\cdots$}
-- (3.5,1.5) node {$\ddots$}
-- (3.5,0.5) node {$\cdots$};
\fill
(4.5,4.5) node {\bf{$\text{cell}_c$}}
-- (4.5,3.5) node {mRNA}
-- (4.5,2.5) node {\color{red}mRNA}
-- (4.5,1.5) node {$\vdots$}
-- (4.5,0.5) node {mRNA};
\draw (1,0) grid (5,4);
\end{tikzpicture}
\end{center}
\column{0.6\textwidth}
For a gene $g$, {\bf each cell is an observation} of the mRNA count of $g$
As we have a large number of cells, we have access to the:
\begin{itemize}
\item empirical mean
\item empirical variance
\item empirical distribution
\end{itemize}
\vspace{1em}
bulk RNASeq $\sim 3$ observation per gene
\end{columns}
\end{center}
## Counts distributions
$P(X = x)$ for $\mathcal{P}(\mu)$
\begin{center}
\includegraphics[width=0.6\textwidth]{./img/poisson.png}
\end{center}
$\mu$ the rate of mRNA production is equal to the variability in the number
of mRNA.
**We often have more variability! (broader distributions)**
## Counts model
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
\end{center}
Cells are not exact replicates of one anothers: a large numbers of factors can be different between two cells
$\lambda_g(t)$ is a **random variable**
## Counts distributions
\begin{center}
\begin{columns}
\column{0.4\textwidth}
$X \sim \mathcal{P}(\lambda)$: $\sigma^2 = \lambda$
\vspace{2em}
$X \sim \mathcal{NB}(\lambda, \sigma)$: $\sigma^2 = \lambda + \alpha \lambda^2$
\column{0.6\textwidth}
\vspace{1em}
\includegraphics[width=0.9\textwidth]{./img/mu_vs_var.png}
\end{columns}
\end{center}
## Counts distributions
$P(X = x)$ for $\mathcal{NB}(\mu, \sigma)$
\begin{center}
\includegraphics[width=0.8\textwidth]{./img/poisson.png}
\end{center}
## Counts distributions
$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 10)$
\begin{center}
\includegraphics[width=0.8\textwidth]{./img/NB_sigma_10.png}
\end{center}
## Counts distributions
$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 2)$
\begin{center}
\includegraphics[width=0.8\textwidth]{./img/NB_sigma_2.png}
\end{center}
## Counts distributions
$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 1)$
\begin{center}
\includegraphics[width=0.8\textwidth]{./img/NB_sigma_1.png}
\end{center}
## Variance of count data
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\begin{tikzpicture}
\fill
(0.5,3.5) node {\bf $\text{gene}_1$}
-- (0.5,2.5) node {\bf $\text{gene}_2$}
-- (0.5,1.5) node {\bf $\vdots$}
-- (0.5,0.5) node {\bf $\text{gene}_n$};
\fill
(1.5,4.5) node {\bf{$\text{cell}_1$}}
-- (1.5,3.5) node {mRNA}
-- (1.5,2.5) node {\color{red}mRNA}
-- (1.5,1.5) node {$\vdots$}
-- (1.5,0.5) node {mRNA};
\fill
(2.5,4.5) node {\bf{$\text{cell}_2$}}
-- (2.5,3.5) node {mRNA}
-- (2.5,2.5) node {\color{red}mRNA}
-- (2.5,1.5) node {$\vdots$}
-- (2.5,0.5) node {mRNA};
\fill
(3.5,4.5) node {\bf{$\cdots$}}
-- (3.5,3.5) node {$\cdots$}
-- (3.5,2.5) node {\color{red}$\cdots$}
-- (3.5,1.5) node {$\ddots$}
-- (3.5,0.5) node {$\cdots$};
\fill
(4.5,4.5) node {\bf{$\text{cell}_c$}}
-- (4.5,3.5) node {mRNA}
-- (4.5,2.5) node {\color{red}mRNA}
-- (4.5,1.5) node {$\vdots$}
-- (4.5,0.5) node {mRNA};
\draw (1,0) grid (5,4);
\end{tikzpicture}
\end{center}
\column{0.6\textwidth}
Which mRNA comparison seems the most significant to you :
\begin{itemize}
\item $50$ vs $5$
\item $10050$ vs $10000$
\end{itemize}
We want to consider that larger
\end{columns}
\end{center}
# Variance stabilization
# Depth normalization
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment