Verified Commit 21b8b0f5 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

2_normalization.Rmd: update

parent ead99e62
Pipeline #325 failed with stage
in 40 seconds
......@@ -10,3 +10,4 @@
*.tex
*.fdb_latexmk
.DS_Store
.Rproj.user
---
title: "single-cell RNA-Seq: Normalization"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "Friday 8 June 2022"
date: "Wednesday 8 June 2022"
output:
beamer_presentation:
df_print: tibble
......@@ -239,7 +239,7 @@ Most of the droplets will be empty
\column{0.5\textwidth}
{\large Some cells are many cellsr:}
{\large Some cells are many cells:}
\begin{itemize}
\item not all tissues are easily dissociable
......@@ -249,7 +249,7 @@ Most of the droplets will be empty
\vspace{1em}
Cell barcode corresponding to $n$-plet should be in monority the the preparation went well.
Cell barcode corresponding to $n$-plet should be in the minority if the preparation went well.
\end{columns}
\end{center}
......@@ -263,7 +263,7 @@ Cell barcode corresponding to $n$-plet should be in monority the the preparation
## Cell filtering
\begin{block}{hypothesis}
Cell barcode corresponding to $n$-plet should be in monority the the preparation went well.
Cell barcode corresponding to $n$-plet should be in the minority if the preparation went well.
\end{block}
......@@ -280,7 +280,7 @@ Cell barcode corresponding to $n$-plet should be in monority the the preparation
\includegraphics[width=\textwidth]{img/doublet_detection_comparison.png}
\end{center}
Different algorithm are available to compare cells to synthetic doublets
Different algorithms are available to compare cells to synthetic doublets
## Cell filtering
......@@ -301,37 +301,96 @@ We can use hard thresholds to remove putative poor quality cells
\includegraphics[width=0.8\textwidth]{img/features_for_QC_2.png}
\end{center}
Cell expressing few genes also contains few mRNA molecule
Cell expressing few genes also contains few mRNA molecules
## Gene filtering
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
\begin{tikzpicture}
\fill
(0.5,3.5) node {\bf $\text{gene}_1$}
-- (0.5,2.5) node {\bf $\text{gene}_2$}
-- (0.5,1.5) node {\bf $\vdots$}
-- (0.5,0.5) node {\bf $\text{gene}_n$};
\fill
(1.5,4.5) node {\bf{$\text{cell}_1$}}
-- (1.5,3.5) node {mRNA}
-- (1.5,2.5) node {\color{red}0}
-- (1.5,1.5) node {$\vdots$}
-- (1.5,0.5) node {mRNA};
\fill
(2.5,4.5) node {\bf{$\text{0 cell}_2$}}
-- (2.5,3.5) node {mRNA}
-- (2.5,2.5) node {\color{red}0}
-- (2.5,1.5) node {$\vdots$}
-- (2.5,0.5) node {mRNA};
\fill
(3.5,4.5) node {\bf{$\cdots$}}
-- (3.5,3.5) node {$\cdots$}
-- (3.5,2.5) node {$\cdots$}
-- (3.5,1.5) node {$\ddots$}
-- (3.5,0.5) node {$\cdots$};
\fill
(4.5,4.5) node {\bf{$\text{cell}_c$}}
-- (4.5,3.5) node {mRNA}
-- (4.5,2.5) node {\color{red}0}
-- (4.5,1.5) node {$\vdots$}
-- (4.5,0.5) node {mRNA};
\draw (1,0) grid (5,4);
\end{tikzpicture}
\end{center}
\column{0.5\textwidth}
{\large We don't need to keep genes not expressed across cell-types}
\vspace {1em}
\begin{enumerate}
\item Remove gene $i$ if $\sum_{j=1}^{N} x_{ij} = 0$
\item Remove gene $i$ if $\frac{1}{N}\sum_{j=1}^{N} x_{ij} < 1$
\end{enumerate}
\vspace {1em}
We cannot correctly estimate the variability of gene $i$ if condition {\bf 2} is not fulfilled.
\end{columns}
\end{center}
# Normalization
## Counts model
## Expression model
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
\end{center}
## Counts distribution
## Measurement model
### Random variable
A variable whose values depends on outcomes of a random phenomenon or experiment.
A variable whose values depend on outcomes of a random phenomenon or experiment.
### For a given gene:
We consider $X$ a **random variable** with $x$ a realisation of $X$ the number of mRNA's observed in a cell.
We consider $X$ a **random variable** with $x$ a realization of $X$ the number of mRNAs observed for a gene in a cell at time $t$
\begin{itemize}
\item The random variable $X$ follow a statitical distribution $F$
\item The random variable $X$ follows a statistical distribution $F$
\item We write $X \sim F$
\end{itemize}
## Counts model
## Measurement model
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
\end{center}
With a transcription rate $\lambda_g(t)$ the observed mRNA count follow a Poisson distribution $\mathcal{P}(\lambda_g(t))$.
With a transcription rate $\lambda_g(t)$ the observed mRNA count follow a Poisson distribution
\[X \sim \mathcal{P}(\lambda_g(t))\]
## Counts distribution
......@@ -341,8 +400,7 @@ $P(X = x)$ for $\mathcal{P}(\lambda_g)$
\includegraphics[width=0.6\textwidth]{./img/poisson.png}
\end{center}
$\lambda_g$ the rate of mRNA production is equal to the variance in the number
of mRNA.
The expectation of $X$, $E(X)$ is equal to it's variance $Var(X)$ (both are equal to $\lambda_g$)
## Counts
......@@ -388,16 +446,20 @@ of mRNA.
For a gene $g$, {\bf each cell is an observation} of the mRNA count of $g$
\vspace{1em}
As we have a large number of cells, we have access to the:
\vspace{-1em}
\begin{itemize}
\item empirical mean
\item empirical variance
\item empirical distribution
\end{itemize}
of $x_g$
\vspace{1em}
bulk RNASeq $\sim 3$ observation per gene
In bulk RNASeq, we have $\sim 3$ observation per gene the task is more difficult.
\end{columns}
\end{center}
......@@ -410,33 +472,48 @@ $P(X = x)$ for $\mathcal{P}(\mu)$
\includegraphics[width=0.6\textwidth]{./img/poisson.png}
\end{center}
$\mu$ the rate of mRNA production is equal to the variability in the number
of mRNA.
**We often have more variability! (broader distributions)**
\begin{center}
{\bf We observe more variability! (broader distributions)}
\end{center}
## Counts model
## Expression model
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
\end{center}
Cells are not exact replicates of one anothers: a large numbers of factors can be different between two cells
Cells are not exact replicates of one another: a large number of factors can be different between two cells
$\lambda_g(t)$ is a **random variable**
\begin{center}
$\lambda_g(t)$ is also a {\bf random variable}
\end{center}
## Counts distributions
\begin{center}
\begin{columns}
\column{0.4\textwidth}
$X \sim \mathcal{P}(\lambda)$: $\sigma^2 = \lambda$
\column{0.5\textwidth}
We have a hierarchical model:
\[
\begin{aligned}[t]
\lambda \sim & F(.) \\
X | \lambda \sim & \mathcal{P}(\lambda)
\end{aligned}[t]
\]
\vspace{2em}
$X \sim \mathcal{NB}(\lambda, \sigma)$: $\sigma^2 = \lambda + \alpha \lambda^2$
If $F = \mathcal{Gamma(\lambda, \alpha)}$, $X$ follow a Negative-Binomial distribution
\column{0.6\textwidth}
\[
X \sim \mathcal{NB}(\lambda, \alpha)
\]
with $Var(X) = \lambda + \alpha \lambda^2$
\column{0.5\textwidth}
\vspace{1em}
\includegraphics[width=0.9\textwidth]{./img/mu_vs_var.png}
......@@ -445,7 +522,7 @@ $X \sim \mathcal{NB}(\lambda, \sigma)$: $\sigma^2 = \lambda + \alpha \lambda^2$
## Counts distributions
$P(X = x)$ for $\mathcal{NB}(\mu, \sigma)$
$P(X = x)$ for $\mathcal{P}(\lambda)$
\begin{center}
\includegraphics[width=0.8\textwidth]{./img/poisson.png}
......@@ -454,7 +531,7 @@ $P(X = x)$ for $\mathcal{NB}(\mu, \sigma)$
## Counts distributions
$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 10)$
$P(X = x)$ for $\mathcal{NB}(\lambda, \alpha = 10)$
\begin{center}
\includegraphics[width=0.8\textwidth]{./img/NB_sigma_10.png}
......@@ -463,7 +540,7 @@ $P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 10)$
## Counts distributions
$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 2)$
$P(X = x)$ for $\mathcal{NB}(\lambda, \alpha = 2)$
\begin{center}
\includegraphics[width=0.8\textwidth]{./img/NB_sigma_2.png}
......@@ -471,7 +548,7 @@ $P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 2)$
## Counts distributions
$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 1)$
$P(X = x)$ for $\mathcal{NB}(\lambda, \alpha = 1)$
\begin{center}
\includegraphics[width=0.8\textwidth]{./img/NB_sigma_1.png}
......@@ -519,29 +596,321 @@ $P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 1)$
\column{0.6\textwidth}
\[
Var(X) = \lambda + \alpha \lambda^2
\]
Which mRNA comparison seems the most significant to you :
\[
\begin{aligned}
x_{1,1} = 50 & \text{ vs } x_{1,2} = 5 \\
x_{1,1} = 10050 & \text{ vs } x_{2,2} = 10000
\end{aligned}
\]
\vspace{1em}
We don't want gene with high variance to have an outsized impact on the analysis.
\end{columns}
\end{center}
## Variance in count depth
### The technical replicate unit is the cell
\begin{center}
\includegraphics[width=\textwidth]{img/cell_depth.png}
\end{center}
## Goals of the normalization step
### Ideally, we want the normalization step to be able to:
- remove differences of depth between cells
- stabilize the variability of the count data
- be monotonous
- make different data sets similar
## Depth normalization
The simplest depth normalization is to scale every column of the $X$ matrix to a given factor
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\[
x_{ij}^{norm} = \frac{x_{ij}}{\sum_{i=1}^{n} x_{ij}} \times \alpha
\]
\begin{itemize}
\item $50$ vs $5$
\item $10050$ vs $10000$
\item $\alpha = 10^{6}$: CPM
\item $\alpha = 10^{4}$: CP10K
\end{itemize}
We want to consider that larger
\column{0.4\textwidth}
\vspace{1em}
\begin{center}
\includegraphics[width=\textwidth]{img/total_umi_dist.png}
\end{center}
\end{columns}
\end{center}
## Variance stabilization
The simplest variance stabilization is the natural $log$ transform (base 10)
\begin{center}
\begin{columns}
\column{0.5\textwidth}
\begin{center}
$P(X_{ij} = 0) > 0$
\includegraphics[width=\textwidth]{img/log10_vs_log1p.png}
\end{center}
\column{0.5\textwidth}
\begin{center}
We use $log(x + 1)$ or $log1p(x)$
\includegraphics[width=\textwidth]{img/var_log1p.png}
\end{center}
\end{columns}
\end{center}
# Variance stabilization
## Monotonicity of the normalization
The $log1p(CPM)$ is monotonous:
\[
\forall x_i, x_j, \text{such that}, x_i < x_j, log1p\left(\frac{x_i}{\sum x} 10^6\right) < log1p\left(\frac{x_j}{\sum x} 10^6\right)
\]
\vspace{2em}
\begin{center}
{\bf This property is crucial for the task of finding marker genes}
\end{center}
## Variance stabilization
\begin{center}
\includegraphics[width=0.8\textwidth]{img/log_transform_6_groups.png}
\end{center}
\begin{center}
Variance stabilization for 6 groups of increasing gene mean
\end{center}
## SCTransform
\begin{center}
\begin{columns}
\column{0.4\textwidth}
SCTransform use a {\bf linear model} of the {\bf expected count} per gene
\[
X_{ij} \sim GammaPoisson\left(\lambda_{ij}, \alpha_{i}\right)
\]
\[
\log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right)
\]
with $s$ the vector of column sums of the count matrix $X$ (total UMI per cell)
\column{0.5\textwidth}
\begin{center}
\includegraphics[width=\textwidth]{img/regression_linaire.png}
\end{center}
\end{columns}
\end{center}
## SCTransform
### We fit the model for each gene
\[
\log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right)
\]
\begin{center}
\includegraphics[width=\textwidth]{img/sctransform_reg_parameter.png}
\end{center}
\begin{center}
The pink line corresponds to a smooth average of the parameter values
\end{center}
## SCTransform
Modeling each gene separately results in overfitting, particularly for low-abundance genes that are detected in only a minor subset of cells and are modeled with a high variance.
\begin{center}
\includegraphics[width=\textwidth]{img/sctransform_regularization.png}
\end{center}
## SCTransform
### Pearson residuals
\[
\begin{aligned}
r_{ij} &= \frac{x - \lambda_{ij}}{\sqrt{\lambda_{ij} + \alpha \lambda_{ij}^2}} \\
\lambda_{ij} &= exp\left(\beta_{i0} + \beta_{is} \log \left(s\right)\right)
\end{aligned}
\]
\begin{center}
SCTransform use the {\bf Pearson residuals} as the normalized count values
\end{center}
\vspace{2em}
We can *regress* other factors
\[
\log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right) + \dots + \beta_{iv} \log \left(v\right)
\]
## Sanity
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
\end{center}
## Sanity
\begin{center}
\begin{columns}
\column{0.33\textwidth}
transcription activity in cell $c$
\column{0.33\textwidth}
mRNA count in cell $c$
\column{0.33\textwidth}
cell state
\end{columns}
\end{center}
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model.png}
\end{center}
\begin{center}
\begin{columns}
\column{0.33\textwidth}
\[
\vec{\alpha}_c =
\begin{bmatrix}
\frac{x_{1c}}{\sum x_c}\\
\frac{x_{2c}}{\sum x_c}\\
\vdots \\
\frac{x_{nc}}{\sum x_c}\\
\end{bmatrix}
\]
\column{0.33\textwidth}
\[
m_{gc}|a_{gc} \sim \mathcal{P}\left(a_{gc}\right)
\]
\column{0.33\textwidth}
\[
n_{gc}|p_{c}, a_{gc} \sim \mathcal{P}\left(p_{c} a_{gc}\right)
\]
\end{columns}
\end{center}
## Sanity
\begin{center}
\begin{columns}
\column{0.33\textwidth}
transcription activity in cell $c$
\column{0.33\textwidth}
mRNA count in cell $c$
\column{0.33\textwidth}
cell state
\end{columns}
\end{center}
\begin{center}
\includegraphics[width=\textwidth]{img/sanity_model.png}
\end{center}
The probability of obtaining the UMI counts $\vec{n}_c$ given the cell state $\vec{\alpha}_c$ is a product over genes of Poisson distributions with means $N_{c}\alpha_{gc}$, where $N_c$ is the total UMI count in cell $c$.
\begin{center}
Sanity will estimate $log\left(\alpha_{gi}\right) \sim \mathcal{N}\left(\mu, \sigma\right)$
\end{center}
## Goals of Normalization
\begin{center}
\includegraphics[width=\textwidth]{./img/normalization_marker_gene.png}
\end{center}
# Depth normalization
# Monotonicity of the normalization
# batch effects
# heterogeneous data
# Goals of Normalization
## batch effects
\begin{center}
\includegraphics[width=\textwidth]{./img/batch_effect.png}
\end{center}
**batch** effects appear when you mix data from different:
- experiments
- time points
- plates
- sequencing lane
- etc.
## batch effects
\begin{center}
\begin{columns}
\column{0.5\textwidth}
{\bf SCTransform}
We can regress out a categorial variable
\column{0.5\textwidth}
{\bf Sanity}
We can run Sanity on each separate batch and combine the results.
\end{columns}
\end{center}
## Harmony
\begin{center}
\includegraphics[width=\textwidth]{./img/harmony.png}
\end{center}
# single-cell RNA-Seq dimension reduction *Monday 13 June 2022*
- Variance stabilization
- Depth normalization
- Monotonicity of the normalization
- Make different data set similar
\ No newline at end of file
## References
\ No newline at end of file
library(tidyverse)
setwd("2_normalization/")