diff --git a/6_dea/dea.Rmd b/6_dea/dea.Rmd index 7658d3c9317ab290748ca86cb0daddae170df44a..233458b8ec1459ad970a3f310b43a0163f1943ec 100644 --- a/6_dea/dea.Rmd +++ b/6_dea/dea.Rmd @@ -26,7 +26,7 @@ classoption: aspectratio=169 - Differential expression analysis between groups - Regression analysis - Multiple testing - - Multivariate Differential expression analysis + - Multivariate differential expression analysis # Hypothesis testing @@ -38,7 +38,7 @@ classoption: aspectratio=169 We reject the hypothesis at risk $\alpha$, the probability that the null hypothesis was true for the observed value. ### $p$-value - The $p$-value is the probability to observe a value as or more extreme under the null hypothesis model. + the $p$-value is the probability to observe a value as or more extreme under the null hypothesis model. ## Hypothesis testing @@ -162,7 +162,7 @@ receiver operating characteristic \end{tikzpicture} \end{center} \column{0.5\textwidth} -For a given gene $x_i$ we can test: +For a given gene $x_i$ we can test \vspace{1em} \begin{itemize} \item $H_0$: $E\left(x_i\right) = E\left(x_{i'}\right)$ @@ -206,7 +206,7 @@ $P(X = x)$ for $\mathcal{NB}(\lambda, \alpha = 1)$ \includegraphics[width=0.7\textwidth]{img/NB_sigma_1.png} \end{center} -## Non-parametric approaches +## Nonparametric approaches ### We don't try to model the data distribution @@ -214,16 +214,16 @@ Instead we work with: - ranks of the values - the sign of the difference between two groups (Wilcoxon) -- the distribution of differances +- the distribution of differences -If we know the distribution the parametric approach is often more powerfull +If we know the distribution, the parametric approach is often more powerful. ### Often limited to the 2 groups setting ## Wilcoxon rank sum test -### $H_0$: the median are equal +### $H_0$: the medians are equal \begin{center} \href{https://www.nature.com/articles/s41467-021-27464-5}{ @@ -233,7 +233,7 @@ If we know the distribution the parametric approach is often more powerfull ## WaddR -### Base on 2-Wasserstein distance +### Based on 2-Wasserstein distance \begin{center} \href{https://pubmed.ncbi.nlm.nih.gov/33792651/}{ @@ -241,7 +241,7 @@ If we know the distribution the parametric approach is often more powerfull } \end{center} -## Model based approaches +## Model-based approaches \begin{center} \begin{columns} @@ -282,14 +282,14 @@ X \sim \pi \delta_0 + \left(1 - \pi\right) \mathcal{NB}(\lambda, \alpha) \end{center} -## Model based approaches +## Model-based approaches ### NB distributed counts with excess of zeros \begin{center} \includegraphics[width=0.8\textwidth]{img/ziNB_1} \end{center} -## Model based approaches +## Model-based approaches ### Mixture of two NB distributions @@ -297,9 +297,32 @@ X \sim \pi \delta_0 + \left(1 - \pi\right) \mathcal{NB}(\lambda, \alpha) \includegraphics[width=0.8\textwidth]{img/ziNB_2} \end{center} -## Model based approaches -### $y = \beta_0 + \beta_1 x$ +## Model-based approaches + +### GLM framework + +\[ +X_i \sim \mathcal{NB}(\lambda, \alpha) +\] + +\[E(X_i|\mathbf{Y}) = \boldsymbol{\mu}_i = g^{-1}(\mathbf{Y}\boldsymbol{\beta})\] + +with : +\begin{itemize} + \item $\boldsymbol{\mu}_i$ the mean of the gene $i$ distribution + \item $g$ is the link function + \item $\beta$ the unknown parameters of the model +\end{itemize} + +\[E(X_i|\mathbf{Y}) = \boldsymbol{\mu}_i = g^{-1}(Y_1 \beta_1 + \dots Y_n \beta_n)\] + +### We can also model the variance as a function of the mean +\[ Var(X_i|\mathbf{Y}) = V( \boldsymbol{\mu}_i ) = \operatorname{V}(g^{-1}(\mathbf{X}\boldsymbol{\beta})).\] + +## Model-based approaches + +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1$ \begin{center} \includegraphics[width=0.7\textwidth]{img/lm_2_groups_b0_3_b1_05.png} @@ -307,19 +330,19 @@ X \sim \pi \delta_0 + \left(1 - \pi\right) \mathcal{NB}(\lambda, \alpha) $\beta_0 = 3$, $\beta_1 = 0.5$ -## Model based approaches +## Model-based approaches -### $y = \beta_0 + \beta_1 x$ +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1$ ### Wald test: \[H_0: \beta_1 = 0\] ### Likelihood ratio test (LTR) -\[H_0: L\left(y = \beta_0\right) = L\left(y = \beta_0 + \beta_1 x\right)\] +\[H_0: L\left(\boldsymbol{\mu}_i = \beta_0\right) = L\left(\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1\right)\] -## Model based approaches +## Model-based approaches -### $y = \beta_0 + \beta_1 x$ +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1$ \begin{center} \includegraphics[width=0.7\textwidth]{img/lm_b0_3_b1_05.png} @@ -327,62 +350,83 @@ $\beta_0 = 3$, $\beta_1 = 0.5$ $\beta_0 = 3$, $\beta_1 = 0.5$ -## Model based approaches +## Model-based approaches -### $y = \beta_0 + \beta_1 x$ +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1$ \begin{center} \href{https://cole-trapnell-lab.github.io/monocle3/}{ - \includegraphics[width=0.7\textwidth]{img/deg_pseudotime.png} + \includegraphics[width=0.6\textwidth]{img/deg_pseudotime.png} } \end{center} -## Model based approaches +## Model-based approaches -### $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$ +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1 + \beta_2 y_2$ \begin{center} \includegraphics[width=0.7\textwidth]{img/lm_2_groups_b0_b0_3_b1_05.png} \end{center} -$\beta_0 = 3$, $\beta_1 = 0.5$ $\beta_2 = 5$ +$\beta_0 = 3$, $\beta_1 = 0.5$, $\beta_2 = 5$ + +## Model-based approaches + +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1 + \beta_2 y_2$ + +\begin{center} + \href{https://www.sciencedirect.com/science/article/pii/S2211124721005192}{ + \includegraphics[width=0.9\textwidth]{img/deg_time_group.png} + } +\end{center} -## Model based approaches +## Model-based approaches -### $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$ +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1 + \beta_2 y_2 + \beta_3 y_1 y_2$ \begin{center} \includegraphics[width=0.7\textwidth]{img/lm_2_groups_b0_b0_3_b1_05_interaction.png} \end{center} -$\beta_0 = 3$, $\beta_1 = 0.5$ $\beta_2 = 5$ +$\beta_0 = 3$, $\beta_1 = 0.5$, $\beta_2 = 5$, $\beta_3 = -0.4$ -## Model based approaches +## Model-based approaches -### $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$ +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1 + \beta_2 y_2 + \beta_3 y_1 y_2$ \begin{center} \includegraphics[width=0.7\textwidth]{img/lm_2_groups_2_factors_b0_b0_3_b1_05_interaction.png} \end{center} -$\beta_0 = 3$, $\beta_1 = 0.5$ $\beta_2 = 5$ +$\beta_0 = 3$, $\beta_1 = 0.5$, $\beta_2 = 5$, $\beta_3 = -0.4$ -## Model based approaches +## Model-based approaches -### $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$ +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1 + \beta_2 y_2$ \begin{center} - href{doi: 10.1093/nar/gky675}{ - \includegraphics[width=0.6\textwidth]{img/deg_time_group.png} + \href{doi: 10.1093/nar/gky675}{ + \includegraphics[width=0.35\textwidth]{img/deg_time_group_inter.png} + } +\end{center} + +## Model-based approaches + +### $\boldsymbol{\mu}_i = \beta_0 + \beta_1 y_1 + \beta_2 Z$ + +$Z \sim \mathcal{N}(\mu_z, \sigma_z)$ + +\begin{center} + \href{https://www.sciencedirect.com/science/article/pii/S2211124721005192}{ + \includegraphics[width=0.35\textwidth]{img/deg_time_mixed.png} } \end{center} # Multiple hypotheses testing -## Multiple hypotheses problem +## Multiple hypothesis problem \begin{center} - \only<1>{\includegraphics[width=10cm]{img/dnorm_abs}\\[-2.5em]} - \only<1>{\includegraphics[width=10cm]{img/pval_alpha}} + \only<1>{\includegraphics[width=10cm]{img/pval_2_0.05}} \only<2>{\includegraphics[width=10cm]{img/pval_alpha_random_H0_1} \begin{center} n = 10 @@ -426,7 +470,7 @@ $\beta_0 = 3$, $\beta_1 = 0.5$ $\beta_2 = 5$ \end{center} -## Multiple hypotheses solutions +## Multiple hypothesis solutions \begin{block}{Family Wise Error Rate (FWER)} \begin{itemize} @@ -438,13 +482,13 @@ $\beta_0 = 3$, $\beta_1 = 0.5$ $\beta_2 = 5$ \end{block} \begin{example} \begin{center} - \emph{``We reject 14 hypothesis with a FWER of 0.05''} - \emph{``We reject 14 hypothesis at a level of 0.05 after Bonferoni correction''} + \emph{``We reject 14 hypotheses with a FWER of 0.05''} + \emph{``We reject 14 hypotheses at a level of 0.05 after Bonferoni correction''} \end{center} Means: 14 hypotheses are not following the null distribution and we make this statement with a probability 0.05 of having fewer than one false positives in the 14 tests. \end{example} -## Multiple hypotheses solutions +## Multiple hypothesis solutions \begin{block}{False Discovery Rate (FDR)} \begin{itemize} @@ -456,26 +500,28 @@ $\beta_0 = 3$, $\beta_1 = 0.5$ $\beta_2 = 5$ \end{block} \only<1>{ \vspace{2em} +\begin{center} \begin{tabular}{l|ccc} - hypothesis & Claimed non-significant & Claimed significant & Total\\ + hypothesis & Claimed nonsignificant & Claimed significant & Total\\ \hline Null & TN & FP & $m_0$\\ Non-null & FN & TP & $m_1$\\ Total & S & R & $m$ \end{tabular} +\end{center} } \only<2-3>{ \begin{example} \begin{center} - \emph{``We reject 254 hypothesis with a FDR of 0.05''} - \emph{``We reject 254 hypothesis with a level of 0.05 after BH correction''} + \emph{``We reject 254 hypotheses with a FDR of 0.05''} + \emph{``We reject 254 hypotheses with a level of 0.05 after BH correction''} \end{center} Means: 254 hypotheses are not following the null distribution and we expect on average 5\% or less of false positives in the 254. \end{example} } \only<3>{ \begin{center} - The number of FPs increases with the number of TPs + {\bf The number of FPs increases with the number of TPs} \end{center} } @@ -485,18 +531,18 @@ $\beta_0 = 3$, $\beta_1 = 0.5$ $\beta_2 = 5$ \end{center} $$\Pr\left(FP < 1\right) < \alpha_{FWER}$$ $$\Pr\left(\mathbb{E}\left[\frac{FP}{R}\right | R > 0]\right)\Pr\left(R > 0\right) < \alpha_{FDR}$$ -When $TP \leq 1$ FWER and FDR control are identical.\\ +when $TP \leq 1$ FWER and FDR control are identical.\\ The difference increases with the number of $TP$s ## FDR control \begin{center} -\includegraphics[width=12cm]{img/pval_hist_H0_H1}\\[-1em] +\includegraphics[width=11cm]{img/pval_hist_H0_H1}\\[-1em] \pause -When we analyse data we hope to get a mixture between:\\ -\includegraphics[width=12cm]{img/pval_hist_H0}\\[-2em] +When we analyze data we hope to get a mixture between:\\ +\includegraphics[width=11cm]{img/pval_hist_H0}\\[-2em] \pause -\includegraphics[width=12cm]{img/pval_hist_H1} +\includegraphics[width=11cm]{img/pval_hist_H1} \end{center} ## FDR control: local FDR ($\ell FDR$) of Efron @@ -525,11 +571,19 @@ When we analyse data we hope to get a mixture between:\\ } \end{center} +## Post-selection inference + +\begin{center} + \href{https://pubmed.ncbi.nlm.nih.gov/30206223/}{ + \includegraphics[width=0.75\textwidth]{img/post_inference_example.png} + } +\end{center} + ## SimCD \begin{center} \href{https://arxiv.org/abs/2104.01512v1}{ - \includegraphics[width=0.6\textwidth]{img/simCD.png} + \includegraphics[width=\textwidth]{img/simCD.png} } \end{center} diff --git a/6_dea/img/deg_time_group.png b/6_dea/img/deg_time_group.png index 1761b1b55158f6b4eefd9997fd1a664c3b258e35..f79d4afa427bb7ee129e80082e3727f4a753aef5 100644 Binary files a/6_dea/img/deg_time_group.png and b/6_dea/img/deg_time_group.png differ diff --git a/6_dea/img/deg_time_group_inter.png b/6_dea/img/deg_time_group_inter.png new file mode 100644 index 0000000000000000000000000000000000000000..1761b1b55158f6b4eefd9997fd1a664c3b258e35 Binary files /dev/null and b/6_dea/img/deg_time_group_inter.png differ diff --git a/6_dea/img/deg_time_mixed.png b/6_dea/img/deg_time_mixed.png new file mode 100644 index 0000000000000000000000000000000000000000..6dd95aeaa7f986b9a72ee6c44c754d6c4bed6bbd Binary files /dev/null and b/6_dea/img/deg_time_mixed.png differ diff --git a/6_dea/img/post_inference_example.png b/6_dea/img/post_inference_example.png new file mode 100644 index 0000000000000000000000000000000000000000..d81c23a857a00cb1d6939a6cd3922e0923ad094a Binary files /dev/null and b/6_dea/img/post_inference_example.png differ