normalization.Rmd
- Introduction
- Introduction
- Program
- Introduction
- Program
- Quality control
- Cell filtering
- Cell filtering
- Cell filtering
- Cell filtering
- Cell filtering
- Cell filtering
- Cell filtering
- hypothesis
- Algorithm
- Cell filtering
- Cell filtering
- Cell filtering
- Gene filtering
- Normalization
- Expression model
- Measurement model
- Random variable
- For a given gene:
- Measurement model
- Counts distribution
- Counts
- Counts distributions
- Expression model
- Counts distributions
- Counts distributions
- Counts distributions
- Counts distributions
- Counts distributions
- Variance of count data
- Variance in count depth
- The technical replicate unit is the cell
- Goals of the normalization step
- Ideally, we want the normalization step to be able to:
- Depth normalization
- Variance stabilization
- Monotonicity of the normalization
- Variance stabilization
- SCTransform
- SCTransform
- We fit the model for each gene
- SCTransform
- SCTransform
- Pearson residuals
- Sanity
- Sanity
- Sanity
- Goals of Normalization
- batch effects
- batch effects
- batch effects
- Harmony
- single-cell RNA-Seq dimension reduction Monday 13 June 2022
- References
title: "single-cell RNA-Seq: Normalization"
author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
date: "Wednesday 8 June 2022"
output:
beamer_presentation:
df_print: tibble
fig_caption: no
highlight: tango
latex_engine: xelatex
slide_level: 2
theme: metropolis
ioslides_presentation:
highlight: tango
slidy_presentation:
highlight: tango
classoption: aspectratio=169
Introduction
Introduction
Program
- Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
- Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
- Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
- Clustering and annotation (Thursday 23 June 2022 - 14:00)
- Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
- Differential expression analysis (Friday 8 July 2022 - 14:00)
Introduction
Program
- Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
- Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
- Quality control
- Normalization
- Variance stabilization
- Depth normalization
- The monotonicity of the normalization
- batch effects
- Heterogeneous data
- Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
- Clustering and annotation (Thursday 23 June 2022 - 14:00)
- Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
- Differential expression analysis (Friday 8 July 2022 - 14:00)
Quality control
Cell filtering
\begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} \begin{tikzpicture} \fill (0.5,3.5) node {\bf
\column{0.5\textwidth}
{\large Some cells are not cells.}
\begin{itemize} \item matrix columns are defined by {\bf cell barcode sequences} \item {\bf cell barcode sequences identify droplet} in the 10X protocol \end{itemize}
\end{columns} \end{center}
Cell filtering
\begin{center} \begin{columns} \column{0.5\textwidth} \begin{tikzpicture} \fill (0.5,3.5) node {\bf
\column{0.5\textwidth}
{\large Some cells are not cells.}
\begin{itemize} \item {\bf v2} chemistry
\vspace{1em}
To avoid cell barcode collision we need [ \text{\bf cell number} \ll \text{\bf cell barcode number} ]
Most of the droplets will be empty
\end{columns} \end{center}
Cell filtering
\begin{center} \begin{columns} \column{0.35\textwidth} Sequenced empty droplets: \begin{itemize} \item do not express many genes \item looks like experimental noise \end{itemize}
\vspace{1em}
The number of UMI per cell barcode
\column{0.7\textwidth} \vspace{1.5em} \includegraphics[width=\textwidth]{img/cell_barcode_rank_vs_umi.png} \end{columns} \end{center}
Cell filtering
\begin{center} \begin{columns} \column{0.35\textwidth} We have {\bf two populations} of cell barcode: \begin{itemize} \item a {\bf low} total UMI counts one \item a {\bf high} total UMI counts one \end{itemize}
\column{0.7\textwidth} \vspace{1.5em} \includegraphics[width=\textwidth]{img/cell_barcode_rank_vs_umi.png} \end{columns} \end{center}
Cell filtering
\begin{center} \begin{columns} \column{0.5\textwidth} \begin{tikzpicture} \fill (0.5,3.5) node {\bf
\column{0.5\textwidth}
{\large Some cells are many cells:}
\begin{itemize} \item not all tissues are easily dissociable \item two cells glued together will share the same droplet \item two different cells can share the same droplet by chance \end{itemize}
\vspace{1em}
Cell barcode corresponding to
\end{columns} \end{center}
Cell filtering
\begin{center} \includegraphics[width=0.75\textwidth]{img/mouse_human_mix.png} \end{center}
Cell filtering
hypothesis
Cell barcode corresponding to
Algorithm
- Simulate thousands of doublets by adding together two randomly chosen single-cell profiles.
- For each original cell, compute the density of simulated doublets in the surrounding neighborhood.
- For each original cell, compute the density of other observed cells in the neighborhood.
- Return the ratio between the two density as a doublet score for each cell.
Cell filtering
\begin{center} \includegraphics[width=\textwidth]{img/doublet_detection_comparison.png} \end{center}
Different algorithms are available to compare cells to synthetic doublets
Cell filtering
\begin{center} \includegraphics[width=0.8\textwidth]{img/features_for_QC_1.png} \end{center} \vspace{-1.5em} We can use hard thresholds to remove putative poor quality cells \vspace{-0.5em} \begin{itemize} \item apoptotic cells express MT genes \item incefficient RT or PCR amplification \end{itemize}
Cell filtering
\begin{center} \includegraphics[width=0.8\textwidth]{img/features_for_QC_2.png} \end{center}
Cell expressing few genes also contains few mRNA molecules
Gene filtering
\begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} \begin{tikzpicture} \fill (0.5,3.5) node {\bf
\column{0.5\textwidth}
{\large We don't need to keep genes not expressed across cell-types}
\vspace {1em} \begin{enumerate} \item Remove gene i if \sum_{j=1}^{N} x_{ij} = 0 \item Remove gene i if \frac{1}{N}\sum_{j=1}^{N} x_{ij} < 1 \end{enumerate}
\vspace {1em}
We cannot correctly estimate the variability of gene i if condition {\bf 2} is not fulfilled.
\end{columns} \end{center}
Normalization
Expression model
\begin{center} \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png} \end{center}
Measurement model
Random variable
A variable whose values depend on outcomes of a random phenomenon or experiment.
For a given gene:
We consider X a random variable with x a realization of X the number of mRNAs observed for a gene in a cell at time t
\begin{itemize} \item The random variable X follows a statistical distribution F \item We write X \sim F \end{itemize}
Measurement model
\begin{center} \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png} \end{center}
With a transcription rate \lambda_g(t) the observed mRNA count follow a Poisson distribution
[X \sim \mathcal{P}(\lambda_g(t))]
Counts distribution
P(X = x) for \mathcal{P}(\lambda_g)
\begin{center} \includegraphics[width=0.6\textwidth]{img/poisson.png} \end{center}
The expectation of X, E(X) is equal to it's variance Var(X) (both are equal to \lambda_g)
Counts
\begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} \begin{tikzpicture} \fill (0.5,3.5) node {\bf \text{gene}_1} -- (0.5,2.5) node {\bf \text{gene}_2} -- (0.5,1.5) node {\bf \vdots} -- (0.5,0.5) node {\bf \text{gene}_n}; \fill (1.5,4.5) node {\bf{\text{cell}_1}} -- (1.5,3.5) node {mRNA} -- (1.5,2.5) node {\color{red}mRNA} -- (1.5,1.5) node {\vdots} -- (1.5,0.5) node {mRNA}; \fill (2.5,4.5) node {\bf{\text{cell}_2}} -- (2.5,3.5) node {mRNA} -- (2.5,2.5) node {\color{red}mRNA} -- (2.5,1.5) node {\vdots} -- (2.5,0.5) node {mRNA}; \fill (3.5,4.5) node {\bf{\cdots}} -- (3.5,3.5) node {\cdots} -- (3.5,2.5) node {\color{red}\cdots} -- (3.5,1.5) node {\ddots} -- (3.5,0.5) node {\cdots}; \fill (4.5,4.5) node {\bf{\text{cell}_c}} -- (4.5,3.5) node {mRNA} -- (4.5,2.5) node {\color{red}mRNA} -- (4.5,1.5) node {\vdots} -- (4.5,0.5) node {mRNA}; \draw (1,0) grid (5,4); \end{tikzpicture} \end{center}
\column{0.6\textwidth}
For a gene g, {\bf each cell is an observation} of the mRNA count of g
\vspace{1em} As we have a large number of cells, we have access to the: \vspace{-1em} \begin{itemize} \item empirical mean \item empirical variance \item empirical distribution \end{itemize}
of x_g
\vspace{1em}
In bulk RNASeq, we have \sim 3 observation per gene the task is more difficult.
\end{columns} \end{center}
Counts distributions
P(X = x) for \mathcal{P}(\mu)
\begin{center} \includegraphics[width=0.6\textwidth]{img/poisson.png} \end{center}
\begin{center} {\bf We observe more variability! (broader distributions)} \end{center}
Expression model
\begin{center} \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png} \end{center}
Cells are not exact replicates of one another: a large number of factors can be different between two cells
\begin{center} \lambda_g(t) is also a {\bf random variable} \end{center}
Counts distributions
\begin{center} \begin{columns} \column{0.5\textwidth}
We have a hierarchical model:
[ \begin{aligned}[t] \lambda \sim & F(.) \ X | \lambda \sim & \mathcal{P}(\lambda) \end{aligned}[t] ]
\vspace{2em}
If F = \mathcal{Gamma(\lambda, \alpha)}, X follow a Negative-Binomial distribution
[ X \sim \mathcal{NB}(\lambda, \alpha) ]
with Var(X) = \lambda + \alpha \lambda^2
\column{0.5\textwidth} \vspace{1em} \includegraphics[width=0.9\textwidth]{img/mu_vs_var.png}
\end{columns} \end{center}
Counts distributions
P(X = x) for \mathcal{P}(\lambda)
\begin{center} \includegraphics[width=0.8\textwidth]{img/poisson.png} \end{center}
Counts distributions
P(X = x) for \mathcal{NB}(\lambda, \alpha = 10)
\begin{center} \includegraphics[width=0.8\textwidth]{img/NB_sigma_10.png} \end{center}
Counts distributions
P(X = x) for \mathcal{NB}(\lambda, \alpha = 2)
\begin{center} \includegraphics[width=0.8\textwidth]{img/NB_sigma_2.png} \end{center}
Counts distributions
P(X = x) for \mathcal{NB}(\lambda, \alpha = 1)
\begin{center} \includegraphics[width=0.8\textwidth]{img/NB_sigma_1.png} \end{center}
Variance of count data
\begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} \begin{tikzpicture} \fill (0.5,3.5) node {\bf \text{gene}_1} -- (0.5,2.5) node {\bf \text{gene}_2} -- (0.5,1.5) node {\bf \vdots} -- (0.5,0.5) node {\bf \text{gene}_n}; \fill (1.5,4.5) node {\bf{\text{cell}_1}} -- (1.5,3.5) node {mRNA} -- (1.5,2.5) node {\color{red}mRNA} -- (1.5,1.5) node {\vdots} -- (1.5,0.5) node {mRNA}; \fill (2.5,4.5) node {\bf{\text{cell}_2}} -- (2.5,3.5) node {mRNA} -- (2.5,2.5) node {\color{red}mRNA} -- (2.5,1.5) node {\vdots} -- (2.5,0.5) node {mRNA}; \fill (3.5,4.5) node {\bf{\cdots}} -- (3.5,3.5) node {\cdots} -- (3.5,2.5) node {\color{red}\cdots} -- (3.5,1.5) node {\ddots} -- (3.5,0.5) node {\cdots}; \fill (4.5,4.5) node {\bf{\text{cell}_c}} -- (4.5,3.5) node {mRNA} -- (4.5,2.5) node {\color{red}mRNA} -- (4.5,1.5) node {\vdots} -- (4.5,0.5) node {mRNA}; \draw (1,0) grid (5,4); \end{tikzpicture} \end{center}
\column{0.6\textwidth}
[ Var(X) = \lambda + \alpha \lambda^2 ]
Which mRNA comparison seems the most significant to you : [ \begin{aligned} x_{1,1} = 50 & \text{ vs } x_{1,2} = 5 \ x_{1,1} = 10050 & \text{ vs } x_{2,2} = 10000 \end{aligned} ] \vspace{1em}
We don't want gene with high variance to have an outsized impact on the analysis.
\end{columns} \end{center}
Variance in count depth
The technical replicate unit is the cell
\begin{center} \includegraphics[width=\textwidth]{img/cell_depth.png} \end{center}
Goals of the normalization step
Ideally, we want the normalization step to be able to:
- remove differences of depth between cells
- stabilize the variability of the count data
- be monotonous
- make different data sets similar
Depth normalization
The simplest depth normalization is to scale every column of the X matrix to a given factor
\begin{center} \begin{columns} \column{0.5\textwidth}
[ x_{ij}^{norm} = \frac{x_{ij}}{\sum_{i=1}^{n} x_{ij}} \times \alpha ]
\begin{itemize} \item \alpha = 10^{6}: CPM \item \alpha = 10^{4}: CP10K \end{itemize}
\column{0.4\textwidth} \vspace{1em} \begin{center} \includegraphics[width=\textwidth]{img/total_umi_dist.png} \end{center}
\end{columns} \end{center}
Variance stabilization
The simplest variance stabilization is the natural log transform (base 10)
\begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} P(X_{ij} = 0) > 0
\includegraphics[width=\textwidth]{img/log10_vs_log1p.png}
\end{center}
\column{0.5\textwidth} \begin{center} We use log(x + 1) or log1p(x)
\includegraphics[width=\textwidth]{img/var_log1p.png} \end{center}
\end{columns} \end{center}
Monotonicity of the normalization
The log1p(CPM) is monotonous:
[ \forall x_i, x_j, \text{such that}, x_i < x_j, log1p\left(\frac{x_i}{\sum x} 10^6\right) < log1p\left(\frac{x_j}{\sum x} 10^6\right) ]
\vspace{2em}
\begin{center} {\bf This property is crucial for the task of finding marker genes} \end{center}
Variance stabilization
\begin{center} \includegraphics[width=0.8\textwidth]{img/log_transform_6_groups.png} \end{center}
\begin{center} Variance stabilization for 6 groups of increasing gene mean \end{center}
SCTransform
\begin{center} \begin{columns} \column{0.4\textwidth}
SCTransform use a {\bf linear model} of the {\bf expected count} per gene
[ X_{ij} \sim GammaPoisson\left(\lambda_{ij}, \alpha_{i}\right) ]
[ \log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right) ]
with s the vector of column sums of the count matrix X (total UMI per cell)
\column{0.5\textwidth} \begin{center} \includegraphics[width=\textwidth]{img/regression_linaire.png} \end{center}
\end{columns} \end{center}
SCTransform
We fit the model for each gene
[ \log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right) ]
\begin{center} \includegraphics[width=\textwidth]{img/sctransform_reg_parameter.png} \end{center}
\begin{center} The pink line corresponds to a smooth average of the parameter values \end{center}
SCTransform
Modeling each gene separately results in overfitting, particularly for low-abundance genes that are detected in only a minor subset of cells and are modeled with a high variance.
\begin{center} \includegraphics[width=\textwidth]{img/sctransform_regularization.png} \end{center}
SCTransform
Pearson residuals
[ \begin{aligned} r_{ij} &= \frac{x - \lambda_{ij}}{\sqrt{\lambda_{ij} + \alpha \lambda_{ij}^2}} \ \lambda_{ij} &= exp\left(\beta_{i0} + \beta_{is} \log \left(s\right)\right) \end{aligned} ]
\begin{center} SCTransform use the {\bf Pearson residuals} as the normalized count values \end{center}
\vspace{2em} We can regress other factors
[ \log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right) + \dots + \beta_{iv} \log \left(v\right) ]
Sanity
\begin{center} \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png} \end{center}
Sanity
\begin{center} \begin{columns} \column{0.33\textwidth}
transcription activity in cell c
\column{0.33\textwidth}
mRNA count in cell c
\column{0.33\textwidth}
cell state
\end{columns} \end{center} \begin{center} \includegraphics[width=\textwidth]{img/sanity_model.png} \end{center}
\begin{center} \begin{columns} \column{0.33\textwidth}
[ \vec{\alpha}c = \begin{bmatrix} \frac{x{1c}}{\sum x_c}\ \frac{x_{2c}}{\sum x_c}\ \vdots \ \frac{x_{nc}}{\sum x_c}\ \end{bmatrix} ]
\column{0.33\textwidth}
[ m_{gc}|a_{gc} \sim \mathcal{P}\left(a_{gc}\right) ]
\column{0.33\textwidth}
[ n_{gc}|p_{c}, a_{gc} \sim \mathcal{P}\left(p_{c} a_{gc}\right) ]
\end{columns} \end{center}
Sanity
\begin{center} \begin{columns} \column{0.33\textwidth}
transcription activity in cell c
\column{0.33\textwidth}
mRNA count in cell c
\column{0.33\textwidth}
cell state
\end{columns} \end{center} \begin{center} \includegraphics[width=\textwidth]{img/sanity_model.png} \end{center}
The probability of obtaining the UMI counts \vec{n}_c given the cell state \vec{\alpha}_c is a product over genes of Poisson distributions with means N_{c}\alpha_{gc}, where N_c is the total UMI count in cell c.
\begin{center} Sanity will estimate log\left(\alpha_{gi}\right) \sim \mathcal{N}\left(\mu, \sigma\right) \end{center}
Goals of Normalization
\begin{center} \includegraphics[width=\textwidth]{img/normalization_marker_gene.png} \end{center}
batch effects
batch effects
\begin{center} \includegraphics[width=\textwidth]{img/batch_effect.png} \end{center}
batch effects appear when you mix data from different:
- experiments
- time points
- plates
- sequencing lane
- etc.
batch effects
\begin{center} \begin{columns} \column{0.5\textwidth} {\bf SCTransform}
We can regress out a categorial variable \column{0.5\textwidth} {\bf Sanity}
We can run Sanity on each separate batch and combine the results.
\end{columns} \end{center}
Harmony
\begin{center} \includegraphics[width=\textwidth]{img/harmony.png} \end{center}