diff --git a/Practical_c.Rmd b/Practical_c.Rmd index a6a7d918e3a905da61e308c4abc67a4fe1fde21f..95b53ca97f6e1bf10ea927edf52a8ddaa1736d2c 100644 --- a/Practical_c.Rmd +++ b/Practical_c.Rmd @@ -190,6 +190,8 @@ Which graphical representation of the data does give an insight about the distri For instance, a box-plot (and derivative) +--- + </p> </details> @@ -212,6 +214,8 @@ ggplot(yeast_av_data, aes(factor(YDL200C_427), A101)) + geom_boxplot() + theme_bw() ``` +--- + </p> </details> @@ -259,6 +263,8 @@ Detail the general framework to build a statistical test? - Requirements: know the probability distribution of $T$ assuming $\H_0$ is **true** - Verify if the value $t$ is "probable" according to the distibution of $T$ under $\H_0$ +--- + </p> </details> @@ -271,6 +277,9 @@ What are the possible answers of a statistical test? <p> - possible answers: "$\H_0$ is false with a given risk" (reject $\H_0$, the test result is significant) vs "the result is not significant" (given this sample, we don't know if $\H_0$ is true or false) - the answer is **never**: ~~$\H_0$ is true~~ + +--- + </p> </details> @@ -354,6 +363,8 @@ ggplot(data.frame(x = c(0, 10)), aes(x)) + size = 3, colour = "blue") ``` +--- + </p> </details> @@ -381,6 +392,7 @@ Reasoning: - assuming $\H_0$ $\to$ the statistics $T$ follows given probability distribution - "Inductive contraposition": p-value is small $\to$ observed value $t$ for the statistics $T$ is unlikely considering its probability distribution under $\H_0$ $\to$ it is unlikely that $T$ follows this distribution $\to$ it is unlikely that $\H_0$ is true +--- </p> </details> @@ -397,6 +409,7 @@ What are type I and type II errors? - Type II error: not reject $\H_0$ conditionally to the fact that $\H_0$ is false +--- </p> </details> @@ -415,6 +428,8 @@ What are type I and type II risks? - Power $= 1 - \beta = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false}$ +--- + </p> </details> @@ -438,11 +453,13 @@ The type 1 risk is a conditional probability assuming that $\H_0$ is true. The type 1 risk $\alpha$ is generally chosen (e.g. $\alpha = 5\%$). It is important to evaluate the power of the test which can not be done in general (it requires either to design a test where the distribution of statistic $T$ is known assuming that tjhe alternative hypostheis $\H_1$ is true, or to evaluate the power using simulations where the truth about $\H_0$ and $\H_1$ is konwn. +--- + </p> </details> <div class="pencadre"> -How to get the answer a statistical test? +How to get the answer of a statistical test? </div> <details><summary>Solution</summary> @@ -577,6 +594,8 @@ ggplot(data.frame(x = c(0, 10)), aes(x)) + size = 3, colour = "blue") ``` +--- + </p> </details> @@ -599,46 +618,48 @@ ggplot(data.frame(x = c(0, 10)), aes(x)) + <!-- </details> --> -### p-values and significancy +### p-values, significancy and power We run a Student T-test on a sample of observations regarding a single variable. We want to know if the average of this variable is null. -The data are generated from the simulation of a random Gaussian variable of mean 0 and variance 1. Since, we are working with simulations, we can repeat the experiment as much as we want, i.e. generate multiple samples of observed values for the considered variables. - -> Note: it is only possible because we are working with simulated data. When analysing real experimental data, it is generally very complicated (or even impossible) to repeat the experiments numerous times in order to generate multiple independent samples of observed values. +The data are generated from the simulation of a random Gaussian variable of mean $\mu=0$ and variance $\sigma^2=1$. **Here, we know the true mean $\mu$ and variance $\sigma^2$, however, in general data analysis, they are usually unknown, hence the motivation to run a test on a possible value for $\mu$.** -In the T-test, we test the hypotheses "$\H_0$: $\mu = 0$" versus "$\H_1$: $\mu \ne 0$" where $\mu$ is the population mean of the variable (that is unknown). +Since, we are working with simulations, we can repeat the experiment as much as we want, i.e. generate multiple samples of observed values for the considered variables. -We repeat the data generation 10000 times and compute the p-value for T-test on each repetition. +> Note: it is only possible because we are working with simulated data. When analysing real experimental data, it is generally very complicated or costly (or even impossible) to repeat the experiments numerous times in order to generate multiple independent samples of observed values. -Below is the empirical distribution of the collected p-values (10000 computations of the same p-value regarding the same test on the same variable but with different samples each time). +In the T-test, we test the hypotheses "$\H_0$: $\mu = \mu_0$" versus "$\H_1$: $\mu \ne \mu_0$" where $\mu$ is the population mean of the variable (that is unknown), for a given value $\mu_0$. In the following, we will choose $\mu_0=0$ and test "$\H_0$: $\mu = 0$" versus "$\H_1$: $\mu \ne 0$". -```{r, echo=FALSE, message=FALSE, warning=FALSE} -## compute p-values for the T-test -compute_p_values <- function(sample_size, mu, mu0) { +Here is a function to generate a sample and compute the p-values associated to the T-test: +```{r} +## function to generate a sample and compute the p-value for the T-test +#' @param sample_size the size of the simulated sample +#' @param mu the mean of the Gaussian distribution used to generate the data (default is 0) +#' @param mu0 the tested value for the mean in the T-test (default is 0) +compute_p_value <- function(sample_size, mu = 0, mu0 = 0) { + # simulation of a Gaussian variable obs_values <- rnorm(sample_size, mean = mu) + # Computation of the observed value for the statistic associated to the T-test stat_value <- sqrt(sample_size) * (mean(obs_values) - mu0) / sd(obs_values) + # Computation of the p-value + # (under H0, the statistic follows a Student T-distribution with n-1 degrees of freedom) p_value <- 2*(1-pt(abs(stat_value), df = sample_size - 1)) + # or use t.test() return(p_value) } +``` -# experience setup -sample_size <- 100 -n_repetition <- 10000 -mu_values <- 0 -mu0 <- 0 +We **repeat** the experiment 10000 times: we repeat the data generation with $\mu=0$ (**hence $\H_0$ is known to be true**) and we compute the p-value for the T-test on each repetition. -setup_grid <- expand_grid(rep = 1:n_repetition, mu = mu_values) +```{r} +n_repetition <- 10000 +p_values <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0)) +``` -result <- do.call("bind_rows", pblapply( - split(setup_grid, seq(nrow(setup_grid))), function(item) { - p_val <- compute_p_values(sample_size, item$mu, mu0) - return(tibble(rep=item$rep, p_value=p_val, mu = item$mu, mu0 = mu0)) - }, - cl = 4 -)) +Below is the empirical distribution of the collected p-values (10000 computations of the **same** p-value regarding the **same** test on the **same** variable but with **different samples** each time). -ggplot(result) + geom_histogram(aes(x=p_value)) + +```{r, echo=FALSE, message=FALSE, warning=FALSE} +ggplot(data.frame(p_value=p_values)) + geom_histogram(aes(x=p_value)) + geom_vline(xintercept=0.05, col = "red") + annotate( "text", label=TeX("$\\alpha=5\\%$"), x=0.07, y=100, colour="red", @@ -653,18 +674,136 @@ What can you say about this figure? What could be the problem? especially regard <details><summary>Solution</summary> <p> -In a non negligible number of samples, the null hypothesis was rejected (p-value $\leq\alpha$) whereas it is true. In this case, we find a significant result $\mu\ne 0$ despite being wrong. +In a non negligible number of samples, the null hypothesis was rejected (p-value $\leq\alpha$) whereas it is known to be true. In this case, we find a significant result $\mu\ne 0$ despite being wrong. However, in the majority of the studies, the null hypothesis is correctly not rejected. -**Important:** +--- -- confirm a detected effect with additional experiments/studies -- the more (independent) studies, the lower risk of incorrect conclusion +</p> +</details> + +We follow the same process but this time, we generate data where the population mean is $\mu=0.25$ (hence not 0 and **$\H_0$ is known to be false**) and we run the T-test on 10000 repetitions of the experiment. + +```{r} +n_repetition <- 10000 +p_values <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0.25)) +``` + +Below is the empirical distribution of the collected p-values + +```{r, echo=FALSE, message=FALSE, warning=FALSE} +ggplot(data.frame(p_value=p_values)) + geom_histogram(aes(x=p_value)) + + geom_vline(xintercept=0.05, col = "red") + + annotate( + "text", label=TeX("$\\alpha=5\\%$"), x=0.07, y=2000, colour="red", + angle = 90) + + theme_bw() +``` + +<div class="pencadre"> +What can you say about this figure? What could be the problem? especially regarding the result of the test? +</div> + +<details><summary>Solution</summary> +<p> + +In a non negligible number of samples, the null hypothesis was not rejected (p-value $\leq\alpha$) whereas it is known to be false. In this case, we cannot find a significant result $\mu\ne0$ despite being right. + +However, in the majority of the studies, the null hypothesis is correctly rejected. + +--- </p> </details> +**It is important to evaluate the power of a test (despite being generally impossible with real experimental data)**. + +<div class="pencadre"> +Imagine a procedure based on simulation to estimate the power of the previous T-test depending on the type I risk $\alpha$? +</div> + +<details><summary>Solution</summary> +<p> + +The power of a test is $1 - \beta = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false}$ where $\beta = \PP(\text{not reject } \H_0\ \vert\ \H_0 \text{ is false}$ is the type II risk. + +To estimate $\beta$, we need to repeat the same experiment multiple time and estimate the corresponding probability. + +A simple way is to generate data where $\H_0$ is known to be false, i.e. $\PP(\H_0 \text{ is false})=1$. + +We have (c.f. [here](https://en.wikipedia.org/wiki/Conditional_probability)): +\[ +\begin{aligned} +& \text{power}\\ +& = \PP(\text{reject } \H_0\ \vert\ \H_0 \text{ is false})\\ +& = \frac{\PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})}{\PP(\H_0 \text{ is false})} +\end{aligned} +\] + +In this case, we have then: $\text{power} = \PP(\text{reject } \H_0\ \ \text{and}\ \H_0 \text{ is false})$ and we can estimate this probability by counting the number of times reject $\H_0$ among the repetition of the experiment depending on the type I risk $\alpha$ that we choose. + +We repeat the experiment, generating data where $\mu\ne0$ (here $\mu = 0.2$), hence "$\H_0: \mu=0$" is known to be false. + +```{r} +n_repetition <- 10000 +p_values <- replicate(n_repetition, compute_p_value(sample_size = 100, mu = 0.25)) +``` + +We compute the estimation of the power, being the number of times $\H_0$ was rejected over the experiment repetitions over the total number of repetitions, depending on the value that we choose for $\alpha$ + +```{r} +alpha_values <- seq(0, 1, 0.001) # values from 0 to 1 +power_values <- sapply( + alpha_values, + function(alpha) { + estimated_power <- sum(p_values<=alpha)/length(p_values) + return(estimated_power) + } +) +``` + +--- + +</p> +</details> + +Here is the representation of the estimated power depending on alpha: + +```{r, echo=FALSE, message=FALSE, warning=FALSE} +ggplot(data.frame(alpha=alpha_values, power=power_values)) + + geom_point(aes(x=alpha, y=power), size=0.1) + + geom_vline(xintercept=0.05, col = "red") + + annotate( + "text", label=TeX("$\\alpha=5\\%$"), x=0.07, y=0.75, colour="red", + angle = 90) + + theme_bw() +``` + +--- + +</p> +</details> + +<div class="pencadre"> +What can you say about this representation? +</div> + +<details><summary>Solution</summary> +<p> + +Decreasing $\alpha$ to reduce the type I error decreases the power of the test. + +--- + +</p> +</details> + +**Important:** + +- confirm a detected effect with additional experiments/studies +- the more (independent) studies, the lower risk of incorrect conclusion + ## Linear model and analysis of variance We will now use a linear model, and especially an analysis of variance (ANOVA), to test the association between the morphological trait `A101` and the SNPs `YAL069W_1` or `YDL200C_427` (separately). @@ -695,6 +834,8 @@ Between `A101` and the SNP `YAL069W_1`: anova(lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata)) ``` +--- + </p> </details> @@ -723,6 +864,7 @@ mod2 <- lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata) plot(mod2, which = 2) ``` +--- </p> </details> @@ -737,6 +879,8 @@ Do you find a significant effect of the respective SNPs on the considered morpho Solution +--- + </p> </details> @@ -759,6 +903,8 @@ Between `A101` and the SNP `YAL069W_1`: anova(lm(A101 ~ factor(YDL200C_427), data = yeast_av_subdata)) ``` +--- + </p> </details>