Skip to content
Snippets Groups Projects
Select Git revision
  • 0f213906768fc2052be8bf45f10591bd46b971da
  • master default protected
2 results

normalization.Rmd

Blame
  • Laurent Modolo's avatar
    Laurent Modolo authored
    0f213906
    History
    title: "single-cell RNA-Seq: Normalization"
    author: "Laurent Modolo [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr)"
    date: "Wednesday 8 June 2022"
    output:
      beamer_presentation:
        df_print: tibble
        fig_caption: no
        highlight: tango
        latex_engine: xelatex
        slide_level: 2
        theme: metropolis
      ioslides_presentation:
        highlight: tango
      slidy_presentation:
        highlight: tango
    classoption: aspectratio=169  

    Introduction

    Introduction

    Program

    1. Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
    2. Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
    3. Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
    4. Clustering and annotation (Thursday 23 June 2022 - 14:00)
    5. Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
    6. Differential expression analysis (Friday 8 July 2022 - 14:00)

    Introduction

    Program

    1. Single-cell RNASeq data from 10X Sequencing (Friday 3 June 2022 - 14:00)
    2. Normalization and spurious effects (Wednesday 8 June 2022 - 14:00)
    • Quality control
    • Normalization
      • Variance stabilization
      • Depth normalization
      • The monotonicity of the normalization
    • batch effects
    • Heterogeneous data
    1. Dimension reduction and data visualization (Monday 13 June 2022 - 15:00)
    2. Clustering and annotation (Thursday 23 June 2022 - 14:00)
    3. Pseudo-time and velocity inference (Thursday 30 June 2022 - 14:00)
    4. Differential expression analysis (Friday 8 July 2022 - 14:00)

    Quality control

    Cell filtering

    \begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} \begin{tikzpicture} \fill (0.5,3.5) node {\bf

    gene1\text{gene}_1
    } -- (0.5,2.5) node {\bf
    gene2\text{gene}_2
    } -- (0.5,1.5) node {\bf
    \vdots
    } -- (0.5,0.5) node {\bf
    genen\text{gene}_n
    }; \fill (1.5,4.5) node {\bf{
    cell1\text{cell}_1
    }} -- (1.5,3.5) node {mRNA} -- (1.5,2.5) node {mRNA} -- (1.5,1.5) node {
    \vdots
    } -- (1.5,0.5) node {mRNA}; \fill (2.5,4.5) node {\color{red}\bf{
    0 cell2\text{0 cell}_2
    }} -- (2.5,3.5) node {\color{red}mRNA} -- (2.5,2.5) node {\color{red}mRNA} -- (2.5,1.5) node {\color{red}
    \vdots
    } -- (2.5,0.5) node {\color{red}mRNA}; \fill (3.5,4.5) node {\bf{
    \cdots
    }} -- (3.5,3.5) node {
    \cdots
    } -- (3.5,2.5) node {
    \cdots
    } -- (3.5,1.5) node {
    \ddots
    } -- (3.5,0.5) node {
    \cdots
    }; \fill (4.5,4.5) node {\bf{
    cellc\text{cell}_c
    }} -- (4.5,3.5) node {mRNA} -- (4.5,2.5) node {mRNA} -- (4.5,1.5) node {
    \vdots
    } -- (4.5,0.5) node {mRNA}; \draw (1,0) grid (5,4); \end{tikzpicture} \end{center}

    \column{0.5\textwidth}

    {\large Some cells are not cells.}

    \begin{itemize} \item matrix columns are defined by {\bf cell barcode sequences} \item {\bf cell barcode sequences identify droplet} in the 10X protocol \end{itemize}

    \end{columns} \end{center}

    Cell filtering

    \begin{center} \begin{columns} \column{0.5\textwidth} \begin{tikzpicture} \fill (0.5,3.5) node {\bf

    gene1\text{gene}_1
    } -- (0.5,2.5) node {\bf
    gene2\text{gene}_2
    } -- (0.5,1.5) node {\bf
    \vdots
    } -- (0.5,0.5) node {\bf
    genen\text{gene}_n
    }; \fill (1.5,4.5) node {\bf
    bc1\text{bc}_1
    } -- (1.5,3.5) node {mRNA} -- (1.5,2.5) node {mRNA} -- (1.5,1.5) node {
    \vdots
    } -- (1.5,0.5) node {mRNA}; \fill (2.5,4.5) node {\color{red}\bf
    bc2\text{bc}_2
    } -- (2.5,3.5) node {\color{red}mRNA} -- (2.5,2.5) node {\color{red}mRNA} -- (2.5,1.5) node {\color{red}
    \vdots
    } -- (2.5,0.5) node {\color{red}mRNA}; \fill (3.5,4.5) node {\bf{
    \cdots
    }} -- (3.5,3.5) node {
    \cdots
    } -- (3.5,2.5) node {
    \cdots
    } -- (3.5,1.5) node {
    \ddots
    } -- (3.5,0.5) node {
    \cdots
    }; \fill (4.5,4.5) node {\bf
    bcc\text{bc}_c
    } -- (4.5,3.5) node {mRNA} -- (4.5,2.5) node {mRNA} -- (4.5,1.5) node {
    \vdots
    } -- (4.5,0.5) node {mRNA}; \draw (1,0) grid (5,4); \end{tikzpicture}

    \column{0.5\textwidth}

    {\large Some cells are not cells.}

    \begin{itemize} \item {\bf v2} chemistry

    737,000\sim 737,000
    cell barcodes \item {\bf v3} chemistry
    3,500,000\sim 3,500,000
    cell barcodes \end{itemize}

    \vspace{1em}

    To avoid cell barcode collision we need [ \text{\bf cell number} \ll \text{\bf cell barcode number} ]

    Most of the droplets will be empty

    \end{columns} \end{center}

    Cell filtering

    \begin{center} \begin{columns} \column{0.35\textwidth} Sequenced empty droplets: \begin{itemize} \item do not express many genes \item looks like experimental noise \end{itemize}

    \vspace{1em}

    The number of UMI per cell barcode

    \column{0.7\textwidth} \vspace{1.5em} \includegraphics[width=\textwidth]{img/cell_barcode_rank_vs_umi.png} \end{columns} \end{center}

    Cell filtering

    \begin{center} \begin{columns} \column{0.35\textwidth} We have {\bf two populations} of cell barcode: \begin{itemize} \item a {\bf low} total UMI counts one \item a {\bf high} total UMI counts one \end{itemize}

    \column{0.7\textwidth} \vspace{1.5em} \includegraphics[width=\textwidth]{img/cell_barcode_rank_vs_umi.png} \end{columns} \end{center}

    Cell filtering

    \begin{center} \begin{columns} \column{0.5\textwidth} \begin{tikzpicture} \fill (0.5,3.5) node {\bf

    gene1\text{gene}_1
    } -- (0.5,2.5) node {\bf
    gene2\text{gene}_2
    } -- (0.5,1.5) node {\bf
    \vdots
    } -- (0.5,0.5) node {\bf
    genen\text{gene}_n
    }; \fill (1.5,4.5) node {\bf
    cell1\text{cell}_1
    } -- (1.5,3.5) node {mRNA} -- (1.5,2.5) node {mRNA} -- (1.5,1.5) node {
    \vdots
    } -- (1.5,0.5) node {mRNA}; \fill (2.5,4.5) node {\color{red}\bf
    2 cells2\text{2 cells}_2
    } -- (2.5,3.5) node {\color{red}mRNA} -- (2.5,2.5) node {\color{red}mRNA} -- (2.5,1.5) node {\color{red}
    \vdots
    } -- (2.5,0.5) node {\color{red}mRNA}; \fill (3.5,4.5) node {\bf
    \cdots
    } -- (3.5,3.5) node {
    \cdots
    } -- (3.5,2.5) node {
    \cdots
    } -- (3.5,1.5) node {
    \ddots
    } -- (3.5,0.5) node {
    \cdots
    }; \fill (4.5,4.5) node {\bf
    cellc\text{cell}_c
    } -- (4.5,3.5) node {mRNA} -- (4.5,2.5) node {mRNA} -- (4.5,1.5) node {
    \vdots
    } -- (4.5,0.5) node {mRNA}; \draw (1,0) grid (5,4); \end{tikzpicture}

    \column{0.5\textwidth}

    {\large Some cells are many cells:}

    \begin{itemize} \item not all tissues are easily dissociable \item two cells glued together will share the same droplet \item two different cells can share the same droplet by chance \end{itemize}

    \vspace{1em}

    Cell barcode corresponding to

    nn
    -plet should be in the minority if the preparation went well.

    \end{columns} \end{center}

    Cell filtering

    \begin{center} \includegraphics[width=0.75\textwidth]{img/mouse_human_mix.png} \end{center}

    Cell filtering

    hypothesis

    Cell barcode corresponding to

    nn
    -plet should be in the minority if the preparation went well.

    Algorithm

    1. Simulate thousands of doublets by adding together two randomly chosen single-cell profiles.
    2. For each original cell, compute the density of simulated doublets in the surrounding neighborhood.
    3. For each original cell, compute the density of other observed cells in the neighborhood.
    4. Return the ratio between the two density as a doublet score for each cell.

    Cell filtering

    \begin{center} \includegraphics[width=\textwidth]{img/doublet_detection_comparison.png} \end{center}

    Different algorithms are available to compare cells to synthetic doublets

    Cell filtering

    \begin{center} \includegraphics[width=0.8\textwidth]{img/features_for_QC_1.png} \end{center} \vspace{-1.5em} We can use hard thresholds to remove putative poor quality cells \vspace{-0.5em} \begin{itemize} \item apoptotic cells express MT genes \item incefficient RT or PCR amplification \end{itemize}

    Cell filtering

    \begin{center} \includegraphics[width=0.8\textwidth]{img/features_for_QC_2.png} \end{center}

    Cell expressing few genes also contains few mRNA molecules

    Gene filtering

    \begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} \begin{tikzpicture} \fill (0.5,3.5) node {\bf

    gene1\text{gene}_1
    } -- (0.5,2.5) node {\bf \text{gene}_2} -- (0.5,1.5) node {\bf \vdots} -- (0.5,0.5) node {\bf \text{gene}_n}; \fill (1.5,4.5) node {\bf{\text{cell}_1}} -- (1.5,3.5) node {mRNA} -- (1.5,2.5) node {\color{red}0} -- (1.5,1.5) node {\vdots} -- (1.5,0.5) node {mRNA}; \fill (2.5,4.5) node {\bf{\text{0 cell}_2}} -- (2.5,3.5) node {mRNA} -- (2.5,2.5) node {\color{red}0} -- (2.5,1.5) node {\vdots} -- (2.5,0.5) node {mRNA}; \fill (3.5,4.5) node {\bf{\cdots}} -- (3.5,3.5) node {\cdots} -- (3.5,2.5) node {\cdots} -- (3.5,1.5) node {\ddots} -- (3.5,0.5) node {\cdots}; \fill (4.5,4.5) node {\bf{\text{cell}_c}} -- (4.5,3.5) node {mRNA} -- (4.5,2.5) node {\color{red}0} -- (4.5,1.5) node {\vdots} -- (4.5,0.5) node {mRNA}; \draw (1,0) grid (5,4); \end{tikzpicture} \end{center}

    \column{0.5\textwidth}

    {\large We don't need to keep genes not expressed across cell-types}

    \vspace {1em} \begin{enumerate} \item Remove gene i if \sum_{j=1}^{N} x_{ij} = 0 \item Remove gene i if \frac{1}{N}\sum_{j=1}^{N} x_{ij} < 1 \end{enumerate}

    \vspace {1em}

    We cannot correctly estimate the variability of gene i if condition {\bf 2} is not fulfilled.

    \end{columns} \end{center}

    Normalization

    Expression model

    \begin{center} \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png} \end{center}

    Measurement model

    Random variable

    A variable whose values depend on outcomes of a random phenomenon or experiment.

    For a given gene:

    We consider X a random variable with x a realization of X the number of mRNAs observed for a gene in a cell at time t

    \begin{itemize} \item The random variable X follows a statistical distribution F \item We write X \sim F \end{itemize}

    Measurement model

    \begin{center} \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png} \end{center}

    With a transcription rate \lambda_g(t) the observed mRNA count follow a Poisson distribution

    [X \sim \mathcal{P}(\lambda_g(t))]

    Counts distribution

    P(X = x) for \mathcal{P}(\lambda_g)

    \begin{center} \includegraphics[width=0.6\textwidth]{img/poisson.png} \end{center}

    The expectation of X, E(X) is equal to it's variance Var(X) (both are equal to \lambda_g)

    Counts

    \begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} \begin{tikzpicture} \fill (0.5,3.5) node {\bf \text{gene}_1} -- (0.5,2.5) node {\bf \text{gene}_2} -- (0.5,1.5) node {\bf \vdots} -- (0.5,0.5) node {\bf \text{gene}_n}; \fill (1.5,4.5) node {\bf{\text{cell}_1}} -- (1.5,3.5) node {mRNA} -- (1.5,2.5) node {\color{red}mRNA} -- (1.5,1.5) node {\vdots} -- (1.5,0.5) node {mRNA}; \fill (2.5,4.5) node {\bf{\text{cell}_2}} -- (2.5,3.5) node {mRNA} -- (2.5,2.5) node {\color{red}mRNA} -- (2.5,1.5) node {\vdots} -- (2.5,0.5) node {mRNA}; \fill (3.5,4.5) node {\bf{\cdots}} -- (3.5,3.5) node {\cdots} -- (3.5,2.5) node {\color{red}\cdots} -- (3.5,1.5) node {\ddots} -- (3.5,0.5) node {\cdots}; \fill (4.5,4.5) node {\bf{\text{cell}_c}} -- (4.5,3.5) node {mRNA} -- (4.5,2.5) node {\color{red}mRNA} -- (4.5,1.5) node {\vdots} -- (4.5,0.5) node {mRNA}; \draw (1,0) grid (5,4); \end{tikzpicture} \end{center}

    \column{0.6\textwidth}

    For a gene g, {\bf each cell is an observation} of the mRNA count of g

    \vspace{1em} As we have a large number of cells, we have access to the: \vspace{-1em} \begin{itemize} \item empirical mean \item empirical variance \item empirical distribution \end{itemize}

    of x_g

    \vspace{1em}

    In bulk RNASeq, we have \sim 3 observation per gene the task is more difficult.

    \end{columns} \end{center}

    Counts distributions

    P(X = x) for \mathcal{P}(\mu)

    \begin{center} \includegraphics[width=0.6\textwidth]{img/poisson.png} \end{center}

    \begin{center} {\bf We observe more variability! (broader distributions)} \end{center}

    Expression model

    \begin{center} \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png} \end{center}

    Cells are not exact replicates of one another: a large number of factors can be different between two cells

    \begin{center} \lambda_g(t) is also a {\bf random variable} \end{center}

    Counts distributions

    \begin{center} \begin{columns} \column{0.5\textwidth}

    We have a hierarchical model:

    [ \begin{aligned}[t] \lambda \sim & F(.) \ X | \lambda \sim & \mathcal{P}(\lambda) \end{aligned}[t] ]

    \vspace{2em}

    If F = \mathcal{Gamma(\lambda, \alpha)}, X follow a Negative-Binomial distribution

    [ X \sim \mathcal{NB}(\lambda, \alpha) ]

    with Var(X) = \lambda + \alpha \lambda^2

    \column{0.5\textwidth} \vspace{1em} \includegraphics[width=0.9\textwidth]{img/mu_vs_var.png}

    \end{columns} \end{center}

    Counts distributions

    P(X = x) for \mathcal{P}(\lambda)

    \begin{center} \includegraphics[width=0.8\textwidth]{img/poisson.png} \end{center}

    Counts distributions

    P(X = x) for \mathcal{NB}(\lambda, \alpha = 10)

    \begin{center} \includegraphics[width=0.8\textwidth]{img/NB_sigma_10.png} \end{center}

    Counts distributions

    P(X = x) for \mathcal{NB}(\lambda, \alpha = 2)

    \begin{center} \includegraphics[width=0.8\textwidth]{img/NB_sigma_2.png} \end{center}

    Counts distributions

    P(X = x) for \mathcal{NB}(\lambda, \alpha = 1)

    \begin{center} \includegraphics[width=0.8\textwidth]{img/NB_sigma_1.png} \end{center}

    Variance of count data

    \begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} \begin{tikzpicture} \fill (0.5,3.5) node {\bf \text{gene}_1} -- (0.5,2.5) node {\bf \text{gene}_2} -- (0.5,1.5) node {\bf \vdots} -- (0.5,0.5) node {\bf \text{gene}_n}; \fill (1.5,4.5) node {\bf{\text{cell}_1}} -- (1.5,3.5) node {mRNA} -- (1.5,2.5) node {\color{red}mRNA} -- (1.5,1.5) node {\vdots} -- (1.5,0.5) node {mRNA}; \fill (2.5,4.5) node {\bf{\text{cell}_2}} -- (2.5,3.5) node {mRNA} -- (2.5,2.5) node {\color{red}mRNA} -- (2.5,1.5) node {\vdots} -- (2.5,0.5) node {mRNA}; \fill (3.5,4.5) node {\bf{\cdots}} -- (3.5,3.5) node {\cdots} -- (3.5,2.5) node {\color{red}\cdots} -- (3.5,1.5) node {\ddots} -- (3.5,0.5) node {\cdots}; \fill (4.5,4.5) node {\bf{\text{cell}_c}} -- (4.5,3.5) node {mRNA} -- (4.5,2.5) node {\color{red}mRNA} -- (4.5,1.5) node {\vdots} -- (4.5,0.5) node {mRNA}; \draw (1,0) grid (5,4); \end{tikzpicture} \end{center}

    \column{0.6\textwidth}

    [ Var(X) = \lambda + \alpha \lambda^2 ]

    Which mRNA comparison seems the most significant to you : [ \begin{aligned} x_{1,1} = 50 & \text{ vs } x_{1,2} = 5 \ x_{1,1} = 10050 & \text{ vs } x_{2,2} = 10000 \end{aligned} ] \vspace{1em}

    We don't want gene with high variance to have an outsized impact on the analysis.

    \end{columns} \end{center}

    Variance in count depth

    The technical replicate unit is the cell

    \begin{center} \includegraphics[width=\textwidth]{img/cell_depth.png} \end{center}

    Goals of the normalization step

    Ideally, we want the normalization step to be able to:

    • remove differences of depth between cells
    • stabilize the variability of the count data
    • be monotonous
    • make different data sets similar

    Depth normalization

    The simplest depth normalization is to scale every column of the X matrix to a given factor

    \begin{center} \begin{columns} \column{0.5\textwidth}

    [ x_{ij}^{norm} = \frac{x_{ij}}{\sum_{i=1}^{n} x_{ij}} \times \alpha ]

    \begin{itemize} \item \alpha = 10^{6}: CPM \item \alpha = 10^{4}: CP10K \end{itemize}

    \column{0.4\textwidth} \vspace{1em} \begin{center} \includegraphics[width=\textwidth]{img/total_umi_dist.png} \end{center}

    \end{columns} \end{center}

    Variance stabilization

    The simplest variance stabilization is the natural log transform (base 10)

    \begin{center} \begin{columns} \column{0.5\textwidth} \begin{center} P(X_{ij} = 0) > 0

    \includegraphics[width=\textwidth]{img/log10_vs_log1p.png}

    \end{center}

    \column{0.5\textwidth} \begin{center} We use log(x + 1) or log1p(x)

    \includegraphics[width=\textwidth]{img/var_log1p.png} \end{center}

    \end{columns} \end{center}

    Monotonicity of the normalization

    The log1p(CPM) is monotonous:

    [ \forall x_i, x_j, \text{such that}, x_i < x_j, log1p\left(\frac{x_i}{\sum x} 10^6\right) < log1p\left(\frac{x_j}{\sum x} 10^6\right) ]

    \vspace{2em}

    \begin{center} {\bf This property is crucial for the task of finding marker genes} \end{center}

    Variance stabilization

    \begin{center} \includegraphics[width=0.8\textwidth]{img/log_transform_6_groups.png} \end{center}

    \begin{center} Variance stabilization for 6 groups of increasing gene mean \end{center}

    SCTransform

    \begin{center} \begin{columns} \column{0.4\textwidth}

    SCTransform use a {\bf linear model} of the {\bf expected count} per gene

    [ X_{ij} \sim GammaPoisson\left(\lambda_{ij}, \alpha_{i}\right) ]

    [ \log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right) ]

    with s the vector of column sums of the count matrix X (total UMI per cell)

    \column{0.5\textwidth} \begin{center} \includegraphics[width=\textwidth]{img/regression_linaire.png} \end{center}

    \end{columns} \end{center}

    SCTransform

    We fit the model for each gene

    [ \log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right) ]

    \begin{center} \includegraphics[width=\textwidth]{img/sctransform_reg_parameter.png} \end{center}

    \begin{center} The pink line corresponds to a smooth average of the parameter values \end{center}

    SCTransform

    Modeling each gene separately results in overfitting, particularly for low-abundance genes that are detected in only a minor subset of cells and are modeled with a high variance.

    \begin{center} \includegraphics[width=\textwidth]{img/sctransform_regularization.png} \end{center}

    SCTransform

    Pearson residuals

    [ \begin{aligned} r_{ij} &= \frac{x - \lambda_{ij}}{\sqrt{\lambda_{ij} + \alpha \lambda_{ij}^2}} \ \lambda_{ij} &= exp\left(\beta_{i0} + \beta_{is} \log \left(s\right)\right) \end{aligned} ]

    \begin{center} SCTransform use the {\bf Pearson residuals} as the normalized count values \end{center}

    \vspace{2em} We can regress other factors

    [ \log\left(E\left(x_{j}\right)\right) = \beta_{i0} + \beta_{is} \log \left(s\right) + \dots + \beta_{iv} \log \left(v\right) ]

    Sanity

    \begin{center} \includegraphics[width=\textwidth]{img/sanity_model_a_bis.png} \end{center}

    Sanity

    \begin{center} \begin{columns} \column{0.33\textwidth}

    transcription activity in cell c

    \column{0.33\textwidth}

    mRNA count in cell c

    \column{0.33\textwidth}

    cell state

    \end{columns} \end{center} \begin{center} \includegraphics[width=\textwidth]{img/sanity_model.png} \end{center}

    \begin{center} \begin{columns} \column{0.33\textwidth}

    [ \vec{\alpha}c = \begin{bmatrix} \frac{x{1c}}{\sum x_c}\ \frac{x_{2c}}{\sum x_c}\ \vdots \ \frac{x_{nc}}{\sum x_c}\ \end{bmatrix} ]

    \column{0.33\textwidth}

    [ m_{gc}|a_{gc} \sim \mathcal{P}\left(a_{gc}\right) ]

    \column{0.33\textwidth}

    [ n_{gc}|p_{c}, a_{gc} \sim \mathcal{P}\left(p_{c} a_{gc}\right) ]

    \end{columns} \end{center}

    Sanity

    \begin{center} \begin{columns} \column{0.33\textwidth}

    transcription activity in cell c

    \column{0.33\textwidth}

    mRNA count in cell c

    \column{0.33\textwidth}

    cell state

    \end{columns} \end{center} \begin{center} \includegraphics[width=\textwidth]{img/sanity_model.png} \end{center}

    The probability of obtaining the UMI counts \vec{n}_c given the cell state \vec{\alpha}_c is a product over genes of Poisson distributions with means N_{c}\alpha_{gc}, where N_c is the total UMI count in cell c.

    \begin{center} Sanity will estimate log\left(\alpha_{gi}\right) \sim \mathcal{N}\left(\mu, \sigma\right) \end{center}

    Goals of Normalization

    \begin{center} \includegraphics[width=\textwidth]{img/normalization_marker_gene.png} \end{center}

    batch effects

    batch effects

    \begin{center} \includegraphics[width=\textwidth]{img/batch_effect.png} \end{center}

    batch effects appear when you mix data from different:

    • experiments
    • time points
    • plates
    • sequencing lane
    • etc.

    batch effects

    \begin{center} \begin{columns} \column{0.5\textwidth} {\bf SCTransform}

    We can regress out a categorial variable \column{0.5\textwidth} {\bf Sanity}

    We can run Sanity on each separate batch and combine the results.

    \end{columns} \end{center}

    Harmony

    \begin{center} \includegraphics[width=\textwidth]{img/harmony.png} \end{center}

    single-cell RNA-Seq dimension reduction Monday 13 June 2022

    References