2_normalization.Rmd: update

ead99e62 · Laurent Modolo · fe2e795a · ead99e62 · ead99e62 · ead99e62
Verified Commit ead99e62 authored Jun 5, 2022 by Laurent Modolo
--- a/1_scrnaseq_data/scrnaseq_data.Rmd
+++ b/1_scrnaseq_data/scrnaseq_data.Rmd
@@ -957,6 +957,8 @@ With $K$ the $k$-compatibility class counts and $\beta$ the transcript quantific
 \includegraphics[width=\textwidth]{img/scasa_vs_other.png}
 \end{center}

+# scRNA data normalization: Friday 8 June 2022
+

 ## References


--- a/2_normalization/img/NB_sigma_1.png
+++ b/2_normalization/img/NB_sigma_1.png
--- a/2_normalization/img/NB_sigma_10.png
+++ b/2_normalization/img/NB_sigma_10.png
--- a/2_normalization/img/NB_sigma_2.png
+++ b/2_normalization/img/NB_sigma_2.png
--- a/2_normalization/img/doublet_detection_comparison.png
+++ b/2_normalization/img/doublet_detection_comparison.png
--- a/2_normalization/img/features_for_QC_1.png
+++ b/2_normalization/img/features_for_QC_1.png
--- a/2_normalization/img/features_for_QC_2.png
+++ b/2_normalization/img/features_for_QC_2.png
--- a/2_normalization/img/mouse_human_mix.png
+++ b/2_normalization/img/mouse_human_mix.png
--- a/2_normalization/img/mu_vs_var.png
+++ b/2_normalization/img/mu_vs_var.png
--- a/2_normalization/img/poisson.png
+++ b/2_normalization/img/poisson.png
--- a/2_normalization/img/sanity_model_a.png
+++ b/2_normalization/img/sanity_model_a.png
--- a/2_normalization/img/sanity_model_a_bis.png
+++ b/2_normalization/img/sanity_model_a_bis.png
--- a/2_normalization/normalization.Rmd
+++ b/2_normalization/normalization.Rmd
 ---
 title: "single-cell RNA-Seq: Normalization"
 author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
-date: "Friday 3 June 2022"
+date: "Friday 8 June 2022"
 output:
  beamer_presentation:
    df_print: tibble
@@ -107,7 +107,6 @@ classoption: aspectratio=169
 \begin{center}
 \begin{columns}
 \column{0.5\textwidth}
-\begin{center}
 \begin{tikzpicture}
  \fill
      (0.5,3.5) node {\bf $\text{gene}_1$}
@@ -115,13 +114,13 @@ classoption: aspectratio=169
   -- (0.5,1.5) node {\bf $\vdots$}
   -- (0.5,0.5) node {\bf $\text{gene}_n$};
  \fill
-      (1.5,4.5) node {\bf{$\text{bc}_1$}}
+      (1.5,4.5) node {\bf $\text{bc}_1$}
   -- (1.5,3.5) node {mRNA}
   -- (1.5,2.5) node {mRNA}
   -- (1.5,1.5) node {$\vdots$}
   -- (1.5,0.5) node {mRNA};
  \fill
-      (2.5,4.5) node {\color{red}\bf{$\text{bc}_2$}}
+      (2.5,4.5) node {\color{red}\bf $\text{bc}_2$}
   -- (2.5,3.5) node {\color{red}mRNA}
   -- (2.5,2.5) node {\color{red}mRNA}
   -- (2.5,1.5) node {\color{red}$\vdots$}
@@ -133,14 +132,13 @@ classoption: aspectratio=169
   -- (3.5,1.5) node {$\ddots$}
   -- (3.5,0.5) node {$\cdots$};
  \fill
-      (4.5,4.5) node {\bf{$\text{bc}_c$}}
+      (4.5,4.5) node {\bf $\text{bc}_c$}
   -- (4.5,3.5) node {mRNA}
   -- (4.5,2.5) node {mRNA}
   -- (4.5,1.5) node {$\vdots$}
   -- (4.5,0.5) node {mRNA};
  \draw (1,0) grid (5,4);
 \end{tikzpicture}
-\end{center}

 \column{0.5\textwidth}

@@ -206,7 +204,6 @@ Most of the droplets will be empty
 \begin{center}
 \begin{columns}
 \column{0.5\textwidth}
-\begin{center}
 \begin{tikzpicture}
  \fill
      (0.5,3.5) node {\bf $\text{gene}_1$}
@@ -214,55 +211,327 @@ Most of the droplets will be empty
   -- (0.5,1.5) node {\bf $\vdots$}
   -- (0.5,0.5) node {\bf $\text{gene}_n$};
  \fill
-      (1.5,4.5) node {\bf{$\text{cell}_1$}}
+      (1.5,4.5) node {\bf $\text{cell}_1$}
   -- (1.5,3.5) node {mRNA}
   -- (1.5,2.5) node {mRNA}
   -- (1.5,1.5) node {$\vdots$}
   -- (1.5,0.5) node {mRNA};
  \fill
-      (2.5,4.5) node {\color{red}\bf{$\text{2 cells}_2$}}
+      (2.5,4.5) node {\color{red}\bf $\text{2 cells}_2$}
   -- (2.5,3.5) node {\color{red}mRNA}
   -- (2.5,2.5) node {\color{red}mRNA}
   -- (2.5,1.5) node {\color{red}$\vdots$}
   -- (2.5,0.5) node {\color{red}mRNA};
  \fill
-      (3.5,4.5) node {\bf{$\cdots$}}
+      (3.5,4.5) node {\bf $\cdots$}
   -- (3.5,3.5) node {$\cdots$}
   -- (3.5,2.5) node {$\cdots$}
   -- (3.5,1.5) node {$\ddots$}
   -- (3.5,0.5) node {$\cdots$};
  \fill
-      (4.5,4.5) node {\bf{$\text{cell}_c$}}
+      (4.5,4.5) node {\bf $\text{cell}_c$}
   -- (4.5,3.5) node {mRNA}
   -- (4.5,2.5) node {mRNA}
   -- (4.5,1.5) node {$\vdots$}
   -- (4.5,0.5) node {mRNA};
  \draw (1,0) grid (5,4);
 \end{tikzpicture}
-\end{center}

 \column{0.5\textwidth}

-{\large Some cells are many cells.}
+{\large Some cells are many cellsr:}

 \begin{itemize}
  \item not all tissues are easily dissociable
  \item two cells glued together will share the same droplet
+  \item two different cells can share the same droplet by chance
 \end{itemize}

 \vspace{1em}

-cell barcode corresponding to $n$-plet should be in monority the the preparation went well.
+Cell barcode corresponding to $n$-plet should be in monority the the preparation went well.

 \end{columns}
 \end{center}

 ## Cell filtering

-apoptotic cells express MT genes
+\begin{center}
+  \includegraphics[width=0.75\textwidth]{img/mouse_human_mix.png}
+\end{center}
+
+## Cell filtering
+
+\begin{block}{hypothesis}
+  Cell barcode corresponding to $n$-plet should be in monority the the preparation went well.
+\end{block}
+
+
+### Algorithm
+
+1. Simulate thousands of doublets by adding together two randomly chosen single-cell profiles.
+2. For each original cell, compute the density of simulated doublets in the surrounding neighborhood.
+3. For each original cell, compute the density of other observed cells in the neighborhood.
+4. Return the ratio between the two density as a **doublet score** for each cell.
+
+## Cell filtering
+
+\begin{center}
+  \includegraphics[width=\textwidth]{img/doublet_detection_comparison.png}
+\end{center}
+
+Different algorithm are available to compare cells to synthetic doublets
+
+## Cell filtering
+
+\begin{center}
+  \includegraphics[width=0.8\textwidth]{img/features_for_QC_1.png}
+\end{center}
+\vspace{-1.5em}
+We can use hard thresholds to remove putative poor quality cells
+\vspace{-0.5em}
+\begin{itemize}
+  \item apoptotic cells express MT genes
+  \item incefficient RT or PCR amplification 
+\end{itemize}
+
+## Cell filtering
+
+\begin{center}
+  \includegraphics[width=0.8\textwidth]{img/features_for_QC_2.png}
+\end{center}
+
+Cell expressing few genes also contains few mRNA molecule
+

 # Normalization

+## Counts model
+
+\begin{center}
+  \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
+\end{center}
+
+## Counts distribution
+
+### Random variable
+A variable whose values depends on outcomes of a random phenomenon or experiment.
+
+### For a given gene:
+We consider $X$ a **random variable** with $x$ a realisation of $X$ the number of mRNA's observed in a cell.
+
+\begin{itemize}
+  \item The random variable $X$ follow a statitical distribution $F$
+  \item We write $X \sim F$
+\end{itemize}
+
+## Counts model
+
+\begin{center}
+  \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
+\end{center}
+
+With a transcription rate $\lambda_g(t)$ the observed mRNA count follow a Poisson distribution $\mathcal{P}(\lambda_g(t))$.
+
+## Counts distribution
+
+$P(X = x)$ for $\mathcal{P}(\lambda_g)$
+
+\begin{center}
+\includegraphics[width=0.6\textwidth]{./img/poisson.png}
+\end{center}
+
+$\lambda_g$ the rate of mRNA production is equal to the variance in the number
+of mRNA.
+
+## Counts
+
+\begin{center}
+\begin{columns}
+\column{0.5\textwidth}
+\begin{center}
+\begin{tikzpicture}
+  \fill
+      (0.5,3.5) node {\bf $\text{gene}_1$}
+   -- (0.5,2.5) node {\bf $\text{gene}_2$}
+   -- (0.5,1.5) node {\bf $\vdots$}
+   -- (0.5,0.5) node {\bf $\text{gene}_n$};
+  \fill
+      (1.5,4.5) node {\bf{$\text{cell}_1$}}
+   -- (1.5,3.5) node {mRNA}
+   -- (1.5,2.5) node {\color{red}mRNA}
+   -- (1.5,1.5) node {$\vdots$}
+   -- (1.5,0.5) node {mRNA};
+  \fill
+      (2.5,4.5) node {\bf{$\text{cell}_2$}}
+   -- (2.5,3.5) node {mRNA}
+   -- (2.5,2.5) node {\color{red}mRNA}
+   -- (2.5,1.5) node {$\vdots$}
+   -- (2.5,0.5) node {mRNA};
+  \fill
+      (3.5,4.5) node {\bf{$\cdots$}}
+   -- (3.5,3.5) node {$\cdots$}
+   -- (3.5,2.5) node {\color{red}$\cdots$}
+   -- (3.5,1.5) node {$\ddots$}
+   -- (3.5,0.5) node {$\cdots$};
+  \fill
+      (4.5,4.5) node {\bf{$\text{cell}_c$}}
+   -- (4.5,3.5) node {mRNA}
+   -- (4.5,2.5) node {\color{red}mRNA}
+   -- (4.5,1.5) node {$\vdots$}
+   -- (4.5,0.5) node {mRNA};
+  \draw (1,0) grid (5,4);
+\end{tikzpicture}
+\end{center}
+
+\column{0.6\textwidth}
+
+For a gene $g$, {\bf each cell is an observation} of the mRNA count of $g$
+
+As we have a large number of cells, we have access to the:
+\begin{itemize}
+  \item empirical mean
+  \item empirical variance 
+  \item empirical distribution 
+\end{itemize}
+
+\vspace{1em}
+
+bulk RNASeq $\sim 3$ observation per gene
+
+\end{columns}
+\end{center}
+
+## Counts distributions
+
+$P(X = x)$ for $\mathcal{P}(\mu)$
+
+\begin{center}
+\includegraphics[width=0.6\textwidth]{./img/poisson.png}
+\end{center}
+
+$\mu$ the rate of mRNA production is equal to the variability in the number
+of mRNA.
+
+**We often have more variability! (broader distributions)**
+
+## Counts model
+
+\begin{center}
+  \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png}
+\end{center}
+
+Cells are not exact replicates of one anothers: a large numbers of factors can be different between two cells
+
+$\lambda_g(t)$ is a **random variable**
+
+## Counts distributions
+
+\begin{center}
+\begin{columns}
+\column{0.4\textwidth}
+$X \sim \mathcal{P}(\lambda)$: $\sigma^2 = \lambda$
+
+\vspace{2em}
+
+$X \sim \mathcal{NB}(\lambda, \sigma)$: $\sigma^2 = \lambda + \alpha \lambda^2$
+
+\column{0.6\textwidth}
+\vspace{1em}
+\includegraphics[width=0.9\textwidth]{./img/mu_vs_var.png}
+
+\end{columns}
+\end{center}
+
+## Counts distributions
+
+$P(X = x)$ for $\mathcal{NB}(\mu, \sigma)$
+
+\begin{center}
+\includegraphics[width=0.8\textwidth]{./img/poisson.png}
+\end{center}
+
+
+## Counts distributions
+
+$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 10)$
+
+\begin{center}
+\includegraphics[width=0.8\textwidth]{./img/NB_sigma_10.png}
+\end{center}
+
+
+## Counts distributions
+
+$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 2)$
+
+\begin{center}
+\includegraphics[width=0.8\textwidth]{./img/NB_sigma_2.png}
+\end{center}
+
+## Counts distributions
+
+$P(X = x)$ for $\mathcal{NB}(\mu, \sigma = 1)$
+
+\begin{center}
+\includegraphics[width=0.8\textwidth]{./img/NB_sigma_1.png}
+\end{center}
+
+## Variance of count data
+
+\begin{center}
+\begin{columns}
+\column{0.5\textwidth}
+\begin{center}
+\begin{tikzpicture}
+  \fill
+      (0.5,3.5) node {\bf $\text{gene}_1$}
+   -- (0.5,2.5) node {\bf $\text{gene}_2$}
+   -- (0.5,1.5) node {\bf $\vdots$}
+   -- (0.5,0.5) node {\bf $\text{gene}_n$};
+  \fill
+      (1.5,4.5) node {\bf{$\text{cell}_1$}}
+   -- (1.5,3.5) node {mRNA}
+   -- (1.5,2.5) node {\color{red}mRNA}
+   -- (1.5,1.5) node {$\vdots$}
+   -- (1.5,0.5) node {mRNA};
+  \fill
+      (2.5,4.5) node {\bf{$\text{cell}_2$}}
+   -- (2.5,3.5) node {mRNA}
+   -- (2.5,2.5) node {\color{red}mRNA}
+   -- (2.5,1.5) node {$\vdots$}
+   -- (2.5,0.5) node {mRNA};
+  \fill
+      (3.5,4.5) node {\bf{$\cdots$}}
+   -- (3.5,3.5) node {$\cdots$}
+   -- (3.5,2.5) node {\color{red}$\cdots$}
+   -- (3.5,1.5) node {$\ddots$}
+   -- (3.5,0.5) node {$\cdots$};
+  \fill
+      (4.5,4.5) node {\bf{$\text{cell}_c$}}
+   -- (4.5,3.5) node {mRNA}
+   -- (4.5,2.5) node {\color{red}mRNA}
+   -- (4.5,1.5) node {$\vdots$}
+   -- (4.5,0.5) node {mRNA};
+  \draw (1,0) grid (5,4);
+\end{tikzpicture}
+\end{center}
+
+\column{0.6\textwidth}
+
+Which mRNA comparison seems the most significant to you :
+\begin{itemize}
+  \item $50$ vs $5$
+  \item $10050$ vs $10000$
+\end{itemize}
+
+We want to consider that larger 
+
+
+
+\end{columns}
+\end{center}
+
 # Variance stabilization

 # Depth normalization